(Replying to PARENT post)
(Replying to PARENT post)
It is perfectly reasonable to take a look at the output on Godbolt, tweak it a bit and call it a day.
Maintaining a full assembly language version of the same code is rarely justifiable.
And yet, I understand the itch, especially because there are quite often some low-hanging fruits to grab.
(Replying to PARENT post)
(Replying to PARENT post)
Can anyone recommend a (r)introduction to modern vector programming on CPUs? I was last fluent in the SSE2 days, but an awful lot has happened since - and while I did go over the list of modern vector primitives (AVX2, not yet AVX512), what I'm missing is use cases - every such primitive has 4-5 common use cases that are the reason it was included, and I would really like to know what they are ....
(Replying to PARENT post)
In other words, I suspect an invariant calculation inside consecutive loops but that is not vectorized will be pulled out of the loops and also moved prior to them and executed just once.
(Replying to PARENT post)
(Replying to PARENT post)
Guess what? It's jquery.
(Replying to PARENT post)
> Optimizing compilers reload vector constants needlessly
...is absolutely true. I wrote some code that just does bit management (shifting, or, and, xor, popcount) on a byte-level. Compiler produced vectorized instructions that provided about a 30% speed-up. But when I looked at it... it was definitely not as good as it could be, and one of the big things was frequently reloading/broadcasting constants like 0x0F or 0xCC or similar. Another thing it would do is to sometimes drop down to normal (not-SIMD) instructions. This was with both `-O2` and `-O3`, and also with `-march=native`
I ended up learning how to use SIMD intrinsics and hand-wrote it all... and achieved about a 600% speedup. The code reached about 90% of the performance of the bus to RAM which was what I theorized "should" be the limiting factor: bitwise operations like this are extremely fast and the slowest point point was popcount which didn't have a native instruction on the hardware I was targeting (AVX2). This was with GCC 6.3 if I recall, about 5 years ago.