More efficient movemask for aarch64 #1237

serge-sans-paille · 2025-12-27T15:32:06Z

No description provided.

onalante-ebay · 2025-12-27T20:42:34Z

include/xsimd/arch/xsimd_neon64.hpp

+         ********/
+
+        template <class A, class T, detail::enable_sized_t<T, 1> = 0>
+        XSIMD_INLINE uint64_t mask(batch_bool<T, A> const& self, requires_arch<neon64>) noexcept


It seems like this has lower block throughput than the non-NEON64 variant: https://godbolt.org/z/szPjEzPW7.

Aha, benchmarks on actual CPUs were faster with vaddv: DLTcollab/sse2neon@ed179d7.

The results might need reevaluation for u{16,32}. I can put together a benchmark since I am using a M2-based device.

I can confirm faster execution for u{8,16,32} on this crude benchmark:

PATCH

diff --git a/benchmark/main.cpp b/benchmark/main.cpp index 7a630e4..e921566 100644 --- a/benchmark/main.cpp +++ b/benchmark/main.cpp @@ -12,6 +12,15 @@ #include "xsimd_benchmark.hpp" #include <map> +void benchmark_mask() +{ + std::size_t size = 20000; + xsimd::run_mask_benchmark<uint8_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint16_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint32_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint64_t>(std::cout, size, 1000); +} + void benchmark_operation() { // std::size_t size = 9984; @@ -112,6 +121,7 @@ void benchmark_basic_math() int main(int argc, char* argv[]) { const std::map<std::string, std::pair<std::string, void (*)()>> fn_map = { + { "mask", { "mask", benchmark_mask } }, { "op", { "arithmetic", benchmark_operation } }, { "exp", { "exponential and logarithm", benchmark_exp_log } }, { "trigo", { "trigonometric", benchmark_trigo } }, diff --git a/benchmark/xsimd_benchmark.hpp b/benchmark/xsimd_benchmark.hpp index 6f6b91b..8b8447c 100644 --- a/benchmark/xsimd_benchmark.hpp +++ b/benchmark/xsimd_benchmark.hpp @@ -16,6 +16,7 @@ #include "xsimd/xsimd.hpp" #include <chrono> #include <iostream> +#include <random> #include <string> #include <vector> @@ -310,6 +311,38 @@ namespace xsimd return t_res; } + template <class T, class OS, kernel::detail::enable_integral_t<T> = 0> + void run_mask_benchmark(OS& out, std::size_t size, std::size_t iter) + { + bench_vector<T> f_lhs; + // NOTE: This is a hack to match the signature of `benchmark_simd{,_unrolled}`. + bench_vector<T> f_res; + + size = size / batch<T>::size * batch<T>::size; + f_lhs.resize(size); + f_res.resize(size); + + std::minstd_rand rng(1337); + std::bernoulli_distribution dist; + for (std::size_t i = 0; i < size; ++i) + { + f_lhs[i] = static_cast<T>(dist(rng)); + } + + const auto mask_functor = [](batch<T> const& x) + { + return (x == batch<T>(0)).mask(); + }; + const auto time = benchmark_simd<batch<T>>(mask_functor, f_lhs, f_res, iter); + const auto time_unr = benchmark_simd_unrolled<batch<T>>(mask_functor, f_lhs, f_res, iter); + + out << "============================" << std::endl; + out << "mask" << sizeof(T) * 8 << std::endl; + out << "vector : " << time.count() << "ms" << std::endl; + out << "vector unr : " << time_unr.count() << "ms" << std::endl; + out << "============================" << std::endl; + } + template <class F, class OS> void run_benchmark_1op(F f, OS& out, std::size_t size, std::size_t iter, init_method init = init_method::classic) {

Thanks for the feedback. Let's merge that one then! (once CI is happy)

As a complement to #1236

serge-sans-paille mentioned this pull request Dec 27, 2025

Implement optimized movemasks for NEON #1236

Open

serge-sans-paille force-pushed the feature/aarch64-movemask branch 3 times, most recently from c5f067e to d85a523 Compare December 27, 2025 20:42

onalante-ebay reviewed Dec 27, 2025

View reviewed changes

More efficient batch_bool::mask() for aarch64

5fac2ad

As a complement to #1236

serge-sans-paille force-pushed the feature/aarch64-movemask branch from d85a523 to 5fac2ad Compare December 28, 2025 10:30

serge-sans-paille merged commit fa3d5f9 into master Dec 28, 2025
116 of 118 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More efficient movemask for aarch64 #1237

More efficient movemask for aarch64 #1237

serge-sans-paille commented Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 •

edited

Loading

Uh oh!

onalante-ebay Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 •

edited

Loading

Uh oh!

serge-sans-paille Dec 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

More efficient movemask for aarch64 #1237

More efficient movemask for aarch64 #1237

Conversation

serge-sans-paille commented Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

onalante-ebay Dec 27, 2025 •

edited

Loading

onalante-ebay Dec 27, 2025 •

edited

Loading

serge-sans-paille Dec 28, 2025 •

edited

Loading