as noted by @casparvl here: #268 (comment)
- the
herk and hemm routines have much worse multi-threaded performance than the other routines.
- their performance is the same as the single-threaded performance, while for the other routines performance scales with number of threads
- this bad perf seems to happen only for 2023a, 2022b and 2024a are fine.
as can be seen in the output here: #268 (comment) ,
the issue seems to be with larger matrices. initially the perf increases with increasing matrix size, but then suddenly from 1400 x 1400 the perf breaks down.
would be good to check the OpenBLAS repo to see if this is a known regression