(Replying to PARENT post)

Many years ago now we built something that used a 60-bit truncated hash of URIs. So that's far too small to be comfortable that you won't run into collisions, but it's big enough that for modest sizes (hundreds of millions up to billions) a collision is unlikely. So you can have a slow path for the collision case, so long as that slow path is correct. For unit testing we needed at least one collision so we could see that everything behaves as designed before some real user hits one. We had test data sets with over a billion URIs in them, but none that (as far as we could tell) had a collision.

So I just wrote code to mint nonsense URIs for our system based on a small integer seed (like https://foo.example/123456789), then span up a system with a bunch of RAM (maybe 8-12GB) to iterate through each seed, storing the hash and seed it was generated from, until it hit a collision after a few gigabytes of RAM, then it spat out the two URIs from the colliding seeds. Bingo, now we had data for unit testing to show that stuff works even in the unlikely case that two distinct URIs have the same hash.

Probably a day's work to build that, and then I'd guess several hours every month saved debugging weird collision bugs because our unit tests would now tell us if we'd screwed up and collisions were slipping past.

๐Ÿ‘คtialaramex๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Figure you need at least 12 bytes per entry to store the ID and hash. Then you can only handle 1/12 of your memory size in bytes (so 1 billion entries if you have 12 billion bytes of RAM).

Much more memory efficient to use a Bloom filter. Build n different ways of hashing the hash (this can be as simple as having some standard hash function H() making the ith different way H(h, i)). Then when you add an entry you write a bit at each of these locations.

Instead of writing right away, you check to see if all the bits were already set. If they were, the entry being added is a candidate for a possible collision with some earlier entry, so stream it out to a file or something.

Once you have the (much smaller) list of a few million collision candidates, you do another iteration over the whole data set, checking each one against only the collision candidates.

The most memory-intensive part is the Bloom filter. You could probably get away with a Bloom filter size of 10 bits per entry or less. (If you want specific numbers, there are lots of online calculators to help you calibrate the size of a Bloom filter.)

But given this is a one-off search, and you had a machine with enough RAM to find a collision, an ordinary hash-based set implementation is good enough and saves the most precious resource of all, developer time :)

๐Ÿ‘คcsense๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I embarrassed myself as a Noogler trying to explain what in my head seemed like a good tradeoff between hash size (also of URIs) and collision handling. The process had fixed storage for a table of known hashes, and every collision was tested for validity before taking action. If you could use a truncated hash, youโ€™d obviously increase your collisions, but if youโ€™re going to validate all collisions anyway, it seemed like you could increase your coverage substantially (in the 60-bit case it would have been 4x) while only marginally increasing your overhead of collision handling.

There's some kind of sparsity/entropy function involved with optimizing these two but I'm too clueless to sort it out. Without the vocabulary to make my case I just got weirded out by a bunch of smart people staring at me quizzically. Sooo..

๐Ÿ‘คjcims๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Couldnโ€™t you also test the recovery logic with a hasher that yielded collision more often? Itโ€™s like take your system, make it worse (use very dumb hashing), and see how it degrades and/or breaks?
๐Ÿ‘คseanmcdirmid๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Any reason the hash function and the collision handling canโ€™t be separated and mocked for unit tests? What if you decide to increase the bits of hash values or change the hash algorithm?
๐Ÿ‘คwbsun๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

This is one of those cases where it would have been easier to fake the hashing function in the unit tests with something that collides when you want. What happens if you want a test case where 3+ URLs hash to the same bucket?
๐Ÿ‘คkortilla๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I have a question about how you generated collisions.

If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.

With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.

But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?

๐Ÿ‘คnulptr๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Why not just run the test-cases using a 16-bit hash function?
๐Ÿ‘คstouset๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Did you truncate the hash for dataset size considerations? If you associate data with the hash, I would guess the test data would dwarf the hash size, so why not do 128 or 256 hashes?
๐Ÿ‘คAtlasBarfed๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0