The probability of a hash collision (2022)

UUIDv4 Value Range and Collision Probability

The core discussion revolves around the number of possible values for a UUIDv4 and the probability of collision when generating multiple UUIDs. Several users clarify the math and offer practical considerations.

Number of possible values: Several users agree on the number of possible values. "A UUID v4 is a 128 bit number, but 4 bits are reserved to specify the version number and 2 more bits are reserved to specify the variant, which leaves 122 bits of randomness. That means it can take on 5 x 10^36 possible values." (NickHoff) and "UUID v4 has 122 random bits giving 2^122 possible values (~5.3×10^36)." (ethan_smith)
Birthday paradox and collision probability: The "birthday paradox" is invoked to explain the non-intuitive probability of collisions. As j-pb puts it, "a 128bit uuid gives you a 'good enough' 64bit distributed autoincrement." Several users mention the collision probabilities with different levels of rigor and approximations.
Applications for which UUIDv4's may fall short: jandrewrogers argues that while the risk is small, UUIDv4 collision probability can be a real-world concern for large data models. "Some large data models can exceed this number of records, so you can't use probabilistic UUID naively in these cases e.g. one for every unique record. In data models that approach the 100T limit, UUID-like identifiers are typically generated deterministically to avoid this issue entirely."

Importance of Randomness and Potential Pitfalls

Several commenters stress the importance of a good random number generator (RNG) for UUIDv4 generation and point out ways that naive implementations can fail.

Quality of the random number generator: "The question becomes, how bad is your random. If your random is not uniformly distributed, you might get duplication from bias. If your random is setup wrong and you get the same seeding multiple times, you might get duplication from that. If your random is good, the birthday math should hold." (toast0)
Recurring Footgun: As jandrewrogers notes: "There have been many cases of UUIDv4 systems breaking in the wild because people naively or accidentally use weak entropy sources. This turns out to be a recurring footgun such that use of UUIDv4 is prohibited in some applications because you can't rely on people to implement it properly."

Practicality of Approximations and Avoiding Overflows

The discussion touches on the computational challenges of dealing with very large numbers and the trade-offs involved in using approximations.

Approximation Techniques: perihelions describes Stirling's approximation for logarithms of factorials. "The quickest way to work around the numeric overflow issues is to use the Stirling approximation of the logarithm of the factorial...." Further on, "You can build on that to build good, composable approximations of any the standard combinatorial functions, in log-space (and recover the approximations you want by simply exponentiating back). For example, if you've implemented an approximate ln-fac(n) ~ ln(fac(n)), you immediately get ln-choose, the logarithm of (n!)/(k!(n-k)!), as simply ln-fac(n) - ln-fac(k) - ln-fac(n-k). Fully composable: if ln-fac() is a numerically good approximation, then is so any reasonable sum or difference."
Trade-offs of Implementations: rurban shares code snippets illustrating alternative implementations for collision estimation, highlighting their accuracy and speed trade-offs.
Simplest Approximation Probably Good Enough: senderista posits: "I think in practice the simplest approximation is probably always good enough. The reason is that I don’t think we care about collision probabilities larger than 50%, and the value of k that gives p=0.5 is sqrt(n) according to the simplest approximation (p=k^2/2n), while p=0.39 for k=sqrt(n) according to the exponential approximation (p=1-(e^-(k^2/2n)). The difference at the most extreme value we care about (p=0.5) is small enough (~11%) that I think the simplest approximation should always suffice for practical applications."

The Value of Curiosity and Meaningful Knowledge

A tangent emerges centered on the motivation behind writing technical pieces like the one that prompted the discussion, with commenters emphasizing the importance of intrinsic curiosity and learning for its own sake.

Hobby Versus Profession: permalac asks how one comes to write about such topics. "What I mean is, is this a hobby where someone is passionate about soothing, or does some employers allow people to work on side projects?" richardwhiuk responds: "Almost certainly a hobby. Employer would probably want something like this on a employer blog so that they get the benefits." kittoes agrees: "Unless you're supremely lucky, this kind of stuff is a hobby. One wishes that weren't the case, but capitalism is what it is..."
Focus on Experience, Not Just Outcome: munificent argues against focusing only on directly applicable knowledge. "It sounds like you're too focused on outcome and not enough on experience. It is a miserable life to treat everything like a chore done to earn some know, expected, concrete reward."
Learning for the Sake of It: deepsun recounts a story of Feynman where, after achieving his life's goals, he focused on what he enjoyed doing (not for the value), ultimately winning a Nobel prize for his later curiosity.
The Serendipity of Knowledge: kittoes encourages the original poster to "generally ignore whether something has direct value or not because that's not how knowledge works. For example, I once spent well over a month implementing the NthWeekday function using nothing but basic arithmetic operations... This hyper-specific problem has near-zero direct value, but it was THE project that sparked my passion for maths."

Specific Implementation Caveats

A few commenters point to specific implementations or situations where collisions are more likely than expected.

Snowflake's HASH Function: Simon_O_Rourke mentions a collision encountered in Snowflake, which mjuarez attributes to Snowflake's non-cryptographic 64-bit hash function and recommends against using it for unique keys.
Taylor approximation correction: meindnoch points out a flaw in the justification of a function approximation in "Appendix A: Supporting proofs for approximations"