Computing Adler32 Checksums at 41 GB/s

👤wooosh🕑3y🔼98🗨️27

(Replying to PARENT post)

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )

👤dougall🕑3y🔼0🗨️0

(Replying to PARENT post)

Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

"Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

👤nigeltao🕑3y🔼0🗨️0

(Replying to PARENT post)

Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656

👤pizza🕑3y🔼0🗨️0

(Replying to PARENT post)

Note that libdeflate has used essentially the same method since 2016 (https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...), though I recently switched it to use a slightly different method (https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...) that performs more consistently across different families of x86 CPUs.

👤ebiggers🕑3y🔼0🗨️0

(Replying to PARENT post)

Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?

👤josephg🕑3y🔼0🗨️0

(Replying to PARENT post)

While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

👤TAForObvReasons🕑3y🔼0🗨️0

(Replying to PARENT post)

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.

👤NavinF🕑3y🔼0🗨️0

(Replying to PARENT post)

I hope this brilliant work has been merged into the relevant open source libraries.

Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…

👤jiggawatts🕑3y🔼0🗨️0

(Replying to PARENT post)

I love this kind of writeup. This is my idea of fun: speedups.

👤daniel-cussen🕑3y🔼0🗨️0

(Replying to PARENT post)

zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.

👤profquail🕑3y🔼0🗨️0