Skip to content

Conversation

@onalante-ebay
Copy link

@onalante-ebay onalante-ebay commented Dec 26, 2025

While the scalar post-processing required to obtain one bit per lane
makes this more expensive than directly supporting variable-sized bit
groups (as done in Zstandard1), the result is still an improvement
over the current lane-by-lane algorithm.

To reduce duplication, XSIMD_LITTLE_ENDIAN is moved from
math/xsimd_rem_pio2.hpp to config/xsimd_config.hpp, and will now be
available outside the defining header.

Footnotes

  1. See [lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM facebook/zstd#3139, namely ZSTD_row_matchMaskGroupWidth.

While the scalar post-processing required to obtain one bit per lane
makes this more expensive than directly supporting variable-sized bit
groups (as done in Zstandard[^1]), the result is still an improvement
over the current lane-by-lane algorithm.

[^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.
@serge-sans-paille
Copy link
Contributor

I've suggested an improvement of the 64 bit version there: https://godbolt.org/z/b7henc933

@onalante-ebay
Copy link
Author

Applied, thank you for the suggestion. I will fix the GCC build in a moment.

template <class A, class T, detail::enable_sized_t<T, 1> = 0>
XSIMD_INLINE uint64_t mask(batch_bool<T, A> const& self, requires_arch<neon>) noexcept
{
uint8x16_t inner = self;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably use the method described in https://github.com/DLTcollab/sse2neon/blob/ade5552a32852422e4f34f0beaa51790ef9f4171/sse2neon.h#L5574.
It performs the reduction in parallel and that seems more efficient!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants