Essential insights from Hacker News discussions

Git-Annex

This discussion centers around two primary tools for managing large files within version control systems: git-annex and git-lfs (Large File Storage). Users share their experiences, compare features, and highlight perceived strengths and weaknesses of each.

Use Cases and Documentation Clarity

A notable theme is the appreciation for clear documentation that focuses on practical use cases.

  • "Happy to see use cases front and center in command line documentation. They seem to always start with ”obscure command flag that you’ll probably never use”," observed EmilStenstrom.

This sentiment is contrasted with the experience of some users who find git-annex's documentation overwhelming.

  • "But git-annex's documentation goes on and on about a bunch of commands I don't really want to read about," remarked andrewmcwatters. He contrasts this with git-fetch-file, stating, "those two lines and that .git-remote-files manifest just told you what git-fetch-file does."

git-annex vs. git-lfs Functionality and Philosophy

A significant portion of the discussion compares the underlying philosophies and capabilities of git-annex and git-lfs, particularly concerning distributed nature and handling of large files.

  • "Not at all. git-annex is for managing large files in git and unlike git-lfs it preserves the distributed nature of git," stated nolist_policy.
  • "LFS is for users developing something with git that has large files in the repo like the classic game development example. git-annex is something you'd use to keep some important stuff backed up which happens to involve large files, like a home folder with music or whatever in it. In my case I do the latter," explained seanparsons, highlighting a key difference in typical application.
  • "A good way to think about it is that git-annex is sort of a git-native and distributed solution to the storage problem at the "other side" ("server side") of something like LFS, and to reason about it from there," offered avar.

There's also a sentiment that git-lfs could have been a more collaborative effort. * "GitHub really embraced the Microsoft-esque NIH with LFS, instead of adopting git-annex," commented fragmede, with keepamovin adding, "To its absolute detriment."

Performance and Scalability Issues with git-annex

Several users reported performance degradation when using git-annex with a large number of files or very large repositories.

  • "Everything was fine, but it starts to become increasingly slow, such that every operation takes several minutes (5-30 mins or so)," noted albertzeyer when managing a large photo collection. This led to speculation about the cause: "I wonder a bit whether that is ZFS, or git-annex, or maybe my disk, or sth else."
  • "My experience is the same, git-annex just doesn't work well with lots of small files," said warp. "With annexes on slow USB disks, connected to a Raspberry Pi 3 or 4, I'm already annoyed when working with my largest annex (in file count) of 25000 files."
  • "I had tested a git-annex repository with about 1.5M files and it got pretty slow as well. The plain git repo size grew to multiple GiB and plain git operations were super slow, so I think this was mostly a git limitation," shared matrss. They suggest a potential mitigation: "DataLad's approach of nested subdatasets (in practice git submodules where each submodule is a git-annex repository) can help, if it fits the data and workflows."

A potential cause for slowness was also attributed to the tool's inherent paranoia for data integrity. * "Annex isn't slow because it's written in Haskell, it tends to be slow because of I/O and paranoia that's warranted as the default behavior in a distributed backup tool," opined avar. Further clarification pointed to specific commands: "E.g. if you drop something it'll by default check the remotes it has access to for that content in real time, it can be many orders of magnitude faster to use --fast etc., to (somewhat unsafely) skip all that and trust whatever metadata you have a local copy of."

The Role of Joey Hess and the git-annex Ecosystem

The creator of git-annex, Joey Hess, and his contributions are also a topic of discussion, with some users sharing links to talks and interviews highlighting his work.

git-annex for Personal Data Management vs. Collaboration

git-annex is often praised for its efficacy in personal data management, while its suitability for collaborative projects is questioned.

  • "I use git-annex to manage all my data on all my drives. It automatically keeps track of which files are on which drives, it ensures that there are enough copies and it checksums everything. It works perfectly with offline drives," shared nolist_policy about personal data management.
  • Munksgaard's experience aligns with this: "Git-Annex is a cool piece of technology, but my impression is that it works best for single-user repositories. So for instance, as @nolist_policy described in a sibling comment, managing all your personal files, documents, music, etc. across many different devices." They contrast this with collaboration: "I tried using it for syncing large files in a collaborative repository, and the use of "magic" branches didn't seem to scale well."

Conversely, git-lfs is seen as more oriented towards typical development workflows. * "I don't see any benefit of git-annex over LFS so far, I'm not even sure I could setup annex as easily," stated kajika91. They highlight ease of use: "Once you opened a .gitattributes from git-LFS you pretty much know all you need and you barely need any commands anymore." The transparency is also a factor: "Also I like how setting up a .gitattribute makes everything transparent the same way .gitignore works. I don't see any equivalent with git-annex."

Haskell Dependencies and System Administration Concerns

The implementation of git-annex in Haskell leads to a discussion about its dependency management and the user experience for system administrators.

  • "My only problem with git-annex is Haskell. I don't hate the language itself, but the sheer number of dependencies it has to install is staggering," expressed goku12. They elaborate on the issues: "Many of those dependencies are not used by anything else, or may be incompatible versions when more than one application uses it. The pain is when you install them using the system package manager."
  • They find the Haskell ecosystem's approach problematic: "But why do they treat it like everything starts and ends with Haskell? Sometimes, there are other priorities like system administration. None of the other compiled languages have this problem - Rust, Go, Zig, ... Even plain old C and C++ aren't this frustrating with dependencies."
  • Zeendo, a Haskell developer, shares a similar sentiment: "As a full time Haskell developer, I have a similar aversion to Haskell-based distro packages which aren't statically linked." They speculate on the reasons: "Clearly dynamic linking makes sense in a lot of cases for internal application distribution - which is where Haskell is often used - so maybe people are incorrectly projecting that onto distro packages?"

Integration with Cloud Storage and Remote Services

The ability of git-annex to integrate with various cloud storage providers and remote services is a point of interest, with some past issues and recent improvements being discussed.

  • "I've used this for years, but to me the big selling point was integration with cloud storage providers as a means of backup. That, however, was always flaky and dependent on unmaintained third-party plugins," shared andunie. They inquire about recent developments: "Does anyone know if the situation has improved on that front in the past 5 years?"
  • "Depends on the cloud storage provider, I think. The best chances are with those that support the more standard protocols like S3, webdav, sftp, etc. A relatively new development is the special remote built into rclone, which should be better maintained than some other third-party special remotes and provides access to all rclone-supported remotes," indicated matrss.
  • BrandiATMuhkuh's use case involves enterprise clients: "Does this also work if I have data on SharePoint, DropBox, etc. and want to pull them (sync with local machine)? My use case is mostly ETL related, where I want to pull all customers data (enterprise customer) so I can process them. But also keep the data updated, hence pull?"

git-annex's Specific Features and Alternatives

Users also discuss specific features of git-annex and potential alternative solutions for their data management needs.

  • "What it works really well at is storing research data. LFS can't upload to arbitrary webdav/S3/sharepoint/other random cloud service," highlighted aragilar.
  • "I don't really need the versioning aspect too much, but sometimes I modify the photos a bit (e.g. rotating or so). But all the other things are relevant for me, like having it distributed, syncing, only partially having the data on a particular node, etc. So, what solution would be better for that? In the end it seems that other solutions provide a similar set of features. E.g. Syncthing," questioned albertzeyer.
  • "I have thought about doing this in the past but ran into issues (one of them being the friction in permanently deleting files once added). I'd be curious how you use it if you have time to share," said Algernon, seeking insights on a specific limitation.

Performance and Disk I/O Considerations

The performance, particularly related to disk I/O and filesystem interactions, is a recurring concern.

  • "I am also always getting annoyed with one or the other , e.g. due to the hashing overhead, etc. However, in many cases the annoyances come with bad filesystem integration on Windows in my case," mentioned riedel.
  • "My guess is the windows virus scaner," offered rurban as a potential explanation for performance issues.
  • "One thing to check is whether any security/monitoring software might be causing issues. Since there are so many files in git repos, it can put a lot of load on that type of software," suggested egwor, pointing towards another potential bottleneck.

git-fetch-file as a Simpler Alternative

The utility git-fetch-file is presented as a more straightforward alternative for specific tasks.

  • "You can apparently do, sort of, but not really, the same thing git-fetch-file[1] does, with git-annex," Andrewmcwatters stated, providing an example using git-fetch-file. He reiterated his preference for its unpretentious nature: "But git-annex's documentation goes on and on about a bunch of commands I don't really want to read about, whereas those two lines and that .git-remote-files manifest just told you what git-fetch-file does."