๐Ÿ‘คibobev๐Ÿ•‘3y๐Ÿ”ผ151๐Ÿ—จ๏ธ95

(Replying to PARENT post)

I haven't (yet) read the article, but I will. But the headline...

> Optimizing compilers reload vector constants needlessly

...is absolutely true. I wrote some code that just does bit management (shifting, or, and, xor, popcount) on a byte-level. Compiler produced vectorized instructions that provided about a 30% speed-up. But when I looked at it... it was definitely not as good as it could be, and one of the big things was frequently reloading/broadcasting constants like 0x0F or 0xCC or similar. Another thing it would do is to sometimes drop down to normal (not-SIMD) instructions. This was with both `-O2` and `-O3`, and also with `-march=native`

I ended up learning how to use SIMD intrinsics and hand-wrote it all... and achieved about a 600% speedup. The code reached about 90% of the performance of the bus to RAM which was what I theorized "should" be the limiting factor: bitwise operations like this are extremely fast and the slowest point point was popcount which didn't have a native instruction on the hardware I was targeting (AVX2). This was with GCC 6.3 if I recall, about 5 years ago.

๐Ÿ‘คinetknght๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Seems like the compiler puts the test for the first loop before loading the constant the first time, and therefor needs to load it again before the second loop. So the "tradeoff" is that if neither loop runs it will load the constant zero times. Of course this isn't what a human would do but at least there is some kind of sliver of logic to it. (Like if vpbroadcastd was a 2000 cycle instruction this pattern might have made sense)
๐Ÿ‘คBoardsOfCanada๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

My experience with optimizing compilers is that generated code is often frustratingly close to optimal (given the source is well written and taking account the constraints of the target arch).

It is perfectly reasonable to take a look at the output on Godbolt, tweak it a bit and call it a day.

Maintaining a full assembly language version of the same code is rarely justifiable.

And yet, I understand the itch, especially because there are quite often some low-hanging fruits to grab.

๐Ÿ‘คstephc_int13๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Moving the constant to file or anonymous namespace scope solves the issue. It's too bad that intrinsics are not `constexpr` because I have a powerful urge to hang a `constinit` in front of this line.
๐Ÿ‘คjeffbee๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Somewhat related:

Can anyone recommend a (r)introduction to modern vector programming on CPUs? I was last fluent in the SSE2 days, but an awful lot has happened since - and while I did go over the list of modern vector primitives (AVX2, not yet AVX512), what I'm missing is use cases - every such primitive has 4-5 common use cases that are the reason it was included, and I would really like to know what they are ....

๐Ÿ‘คbeagle3๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The optimization here would be CSE or hoisting, or both? I'm guessing the problem is those are performed prior to vectorization.

In other words, I suspect an invariant calculation inside consecutive loops but that is not vectorized will be pulled out of the loops and also moved prior to them and executed just once.

๐Ÿ‘คphkahler๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Intel had an optimizing compiler that was amazing. But unless you were intel-only it made life harder to switch compilers for that platform.
๐Ÿ‘คJoeAltmaier๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Side note: I really really like the blog theme and the complete lack of bulk on this blog. No react.js, 4500 NPM modules for a SPA or other crap and it loads instantly.

Guess what? It's jquery.

๐Ÿ‘คexabrial๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Maybe it's trying to avoid using SSE in the case where there's no loop? SSE on some older platforms had a cost just from using it, so it might be possible.
๐Ÿ‘คfoota๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I have a theory that in another decade ML models will do this better than any optimizing compiler -- similar to a hand written chess engine vs alpha go.
๐Ÿ‘คmarmada๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Probably a holdover from older assumptions that using/loading a constant is free or cheap enough to be considered free.
๐Ÿ‘คdmitrygr๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0