๐Ÿ‘คwooosh๐Ÿ•‘3y๐Ÿ”ผ98๐Ÿ—จ๏ธ27

(Replying to PARENT post)

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )
๐Ÿ‘คdougall๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

"Wuffsโ€™ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

๐Ÿ‘คnigeltao๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656

๐Ÿ‘คpizza๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Note that libdeflate has used essentially the same method since 2016 (https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...), though I recently switched it to use a slightly different method (https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...) that performs more consistently across different families of x86 CPUs.
๐Ÿ‘คebiggers๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like thereโ€™s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but donโ€™t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?
๐Ÿ‘คjosephg๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

๐Ÿ‘คTAForObvReasons๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.

๐Ÿ‘คNavinF๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I hope this brilliant work has been merged into the relevant open source libraries.

Something thatโ€™s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanityโ€ฆ

๐Ÿ‘คjiggawatts๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I love this kind of writeup. This is my idea of fun: speedups.
๐Ÿ‘คdaniel-cussen๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.

๐Ÿ‘คprofquail๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0