On many CPUs, floating-point division is much slower than multiplication,
see discussion here.
Thus, when dividing by the same denominator in a loop, one might consider replacing x / y with a x * z where z = 1 / y is computed outside of the loop.
This sacrifices some accuracy (due to the round-off error from computing z) in order to replace the per-iteration divide with a multiply.
The question is, does this actually have an impact?
To build, use the usual CMake process:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build buildWe test the speed of iterating over an array and dividing its contents by a constant. This is a pretty common pattern in tatami, and while usually more operations are involved, we'll keep things simple here. We compare to the alternative, i.e., compute the reciprocal and use it to multiply each array element within the loop.
On an Intel i7-8850H, we get:
$ ./build/basic
Division: 0.0137299 ± 1.95084e-05
Multiplication: 0.0136978 ± 3.16613e-05
On a Mac M2, we get:
$ ./build/basic
Division: 0.00203902 ± 0.000123369
Multiplication: 0.00200791 ± 8.85522e-05
Not much effect, it seems. I'd guess that the CPU is bottlenecked by reading the array from main memory, such that the speed of the actual operation doesn't matter.
A more robust test involves log-normalization of a count matrix in RNA-seq data.
This involves computing log1p instead of log2 and/or multiplying by the reciprocal
(
On an Intel i7-8850H, we get:
$ ./build/normalize
log1p(x / f) / M: 0.150602 ± 0.00142801
log1p(x * r) * N: 0.133567 ± 0.00120476
log2(x / f + 1): 0.101818 ± 0.00119196
log2(x * r + 1): 0.0837252 ± 0.00128969
log(x / f + 1) / M: 0.0727775 ± 0.00114991
log(x * r + 1) * N: 0.062309 ± 0.00119778
On a Mac M2, we get:
$ ./build/normalize
log1p(x / f) / M: 0.0603784 ± 0.000133809
log1p(x * r) * N: 0.0546123 ± 0.000144244
log2(x / f + 1): 0.0523682 ± 5.85398e-05
log2(x * r + 1): 0.0448984 ± 6.94794-05
log(x / f + 1) / M: 0.0518799 ± 0.000129268
log(x * r + 1) * N: 0.0447853 ± 7.38861e-05
Division is significantly faster here. Now that we're doing a decent amount of work with the log-transformations, compute is now the bottleneck and the extra time spent in the divide is no longer masked.