People Keep Inventing Prolly Trees

The Hacker News discussion revolved around several key themes concerning data structures and their implementation, particularly regarding content-defined chunking and tree-based indexing.

Reinvention and Naming of Data Structures

A significant portion of the discussion centered on the idea that the concept of "Prolly Trees" might not be entirely novel. Users pointed to existing work that shares similar characteristics, leading to a discussion about "reinvention" and the justification for new names.

Prior Art Exists: Several users highlighted earlier implementations and concepts that resemble Prolly Trees. One user noted, "This article does not mention Jumbostore (Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes) which used content defined chunking recursively on the chunk list of a content defined chunked file in 2007. This is exactly what a Prolly Tree is."
"Prolly Tree" Justification: The creator of the term "prolly tree" defended the name by distinguishing it from blob chunking. According to "aboodman": "The reason I thought a new name was warranted is that a prolly tree stores structured data (a sorted set of k/v pairs, like a b-tree), not blob data. And it has the same interface and utility as a b-tree. Is it a huge difference? No. A pretty minor adaptation of an existing idea. But still different enough to warrant a different name IMO."
"SuperMegaTree" as a Precursor: Another user humorously claimed their own "SuperMegaTree" was a predecessor, stating, "Amazing! all these people reinvented my SuperMegaTree!"

Content-Defined Chunking and Hashing

The core mechanism of content-defined chunking, often using rolling hashes, was a recurring topic. Users discussed its benefits, potential applications, and the underlying probabilistic nature of the balancing.

Natural Extension of Existing Ideas: "ChadNauseam" saw the tree version as a natural extension of rolling-hash based chunking: "I've been obsessed with rolling-hash based chunking since I read about it in the dat paper. I didn't realize there was a tree version, but it is a natural extension."
Probabilistic Balancing Explained: The term "probabilistically balanced" generated debate. "moomin" initially explained it as: "It means the balancing is content-dependent, and is normally balanced, but certain edge-case inputs may result in sub-optimal behaviour." This was further clarified by "judofyr" who contrasted it with deterministic guarantees: "Opposite of probabilistic is not deterministic in this context. This is not about «drawing a random number», but rather that balancing is dependent on the input data. «With high probability» here means «majority of the possible input data leads to a balanced structure»."
Hashing and Security Implications: "ChadNauseam" proposed a cryptosystem using chunking with keys derived from chunk hashes for deduplication without data exposure to the provider. "RainyDayTmrw" raised security concerns about this approach, noting: "What's your threat model? This has 'interesting'[3] properties. For example, given a file, the provider can figure out who has the file. Or, given a file, an arbitrary user can figure out if some other user already has the file."

Efficiency of Downloads and Data Transfer

A significant tangent emerged regarding the experience of downloading large files over the internet, with users discussing methods for resuming interrupted downloads and efficient partial updates.

Poor Download Experience: The inefficiency of resuming downloads was highlighted by "ChadNauseam": "Tangent: why is it that downloading a large file is such a bad experience on the internet? If you lose internet halfway through, the connection is closed and you're just screwed."
HTTP Range Requests as a Solution: Several users pointed out that HTTP Range Requests already address this issue. "Retr0id" stated: "HTTP Range Requests solve this without any clever logic, if mutually supported." "motorest" emphasized its understatement, linking to Mozilla's documentation and the relevant RFC.
Comparison to Bittorrent and rsync: "nicoburns" suggested Bittorrent as a relevant protocol, while "wakawaka28" argued that for single files, byte offset resumption (via Range Requests) is usually sufficient, and for partial updates, servers could offer rsync.
Historical Perspective on Downloads: One user, "1vuio0pswjnm7", offered a historical perspective: "This comment could only come from someone who never downloaded large files from the internet in the 1990s. Feels like heaven to me downloading today."

Data Structure Implementation and Optimization

Discussions also touched upon the practical aspects of implementing and optimizing these tree structures.

Efficiency of Updates: The question of whether editing a Prolly Tree requires full reconstruction was raised by "iamwil". "aboodman" clarified that efficient implementations would not do this: "Both noms and dolt, and presumably most other prolly tree implementations do what you propose. if they didn't, inserts would be terribly slow."
Merkle-fied Trees: "gritzko" provided context by defining Prolly Trees as "Merkle-fied B-trees, essentially" and mentioned related work with Merkle-fied LSM trees.
Discoverability of Data Structures: The challenge of finding specialized data structures was discussed, with "elric" asking, "How do people find specialised data structure that they aren't already aware of?" "Loranubi" shared a personal project cataloging data structures, and "donatj" suggested Wikipedia pages would be beneficial.

Specific Implementations and Related Work

Several specific projects and papers were referenced, underscoring the active development and exploration in this area.

The dat Paper: The "dat" paper was frequently mentioned as a reference for content-defined chunking. "aboodman" provided a link to it.
IPFS: "theLiminator" drew a parallel between the discussed concepts and IPFS.
Noms and Dolt: "aboodman" mentioned "noms" and "dolt" as examples of systems using Prolly Trees.
Jumbostore: "compressedgas" pointed to "Jumbostore" as an earlier implementation.
Other Related Structures: "gritzko" referenced his own work with "librdx" and "DISCONT," as well as "Ink&Switch's" work, and discussed the lineage of Causal Trees to "weave" and RGAs.
rsync Algorithm: The rsync algorithm and its use of rolling checksums were cited as a similar concept by "wmanley" when referencing older kernel mailing list discussions.