(Replying to PARENT post)
Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.
"Wuffsโ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...
(Replying to PARENT post)
[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...
[1] https://github.com/veluca93/fpnge
[2] https://twitter.com/richgel999/status/1485976101692358656
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.
- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?
(Replying to PARENT post)
That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).
I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.
Great writeup either way.
(Replying to PARENT post)
Something thatโs unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.
It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanityโฆ
(Replying to PARENT post)
(Replying to PARENT post)
Might be interesting to benchmark their implementation too to see how it compares.
(Replying to PARENT post)