(Replying to PARENT post)
Much more memory efficient to use a Bloom filter. Build n different ways of hashing the hash (this can be as simple as having some standard hash function H() making the ith different way H(h, i)). Then when you add an entry you write a bit at each of these locations.
Instead of writing right away, you check to see if all the bits were already set. If they were, the entry being added is a candidate for a possible collision with some earlier entry, so stream it out to a file or something.
Once you have the (much smaller) list of a few million collision candidates, you do another iteration over the whole data set, checking each one against only the collision candidates.
The most memory-intensive part is the Bloom filter. You could probably get away with a Bloom filter size of 10 bits per entry or less. (If you want specific numbers, there are lots of online calculators to help you calibrate the size of a Bloom filter.)
But given this is a one-off search, and you had a machine with enough RAM to find a collision, an ordinary hash-based set implementation is good enough and saves the most precious resource of all, developer time :)
(Replying to PARENT post)
There's some kind of sparsity/entropy function involved with optimizing these two but I'm too clueless to sort it out. Without the vocabulary to make my case I just got weirded out by a bunch of smart people staring at me quizzically. Sooo..
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.
With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.
But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?
(Replying to PARENT post)
So I just wrote code to mint nonsense URIs for our system based on a small integer seed (like https://foo.example/123456789), then span up a system with a bunch of RAM (maybe 8-12GB) to iterate through each seed, storing the hash and seed it was generated from, until it hit a collision after a few gigabytes of RAM, then it spat out the two URIs from the colliding seeds. Bingo, now we had data for unit testing to show that stuff works even in the unlikely case that two distinct URIs have the same hash.
Probably a day's work to build that, and then I'd guess several hours every month saved debugging weird collision bugs because our unit tests would now tell us if we'd screwed up and collisions were slipping past.