Essential insights from Hacker News discussions

Anna's Archive: An Update from the Team

The Hacker News discussion about Anna's Archive reveals several prominent themes:

Censorship and Accessibility

A significant portion of the conversation revolves around the accessibility of the site and instances of it being blocked. This is directly tied to legal orders and ISP-level filtering, particularly in certain jurisdictions.

  • "When accessing from Belgium the link is blocked by Cloudflare: Error HTTP 451 Unavailable For Legal Reasons"
  • "In response to a legal order, Cloudflare has taken steps to limit access to this website through Cloudflare's pass-through security and CDN services within Belgium"
  • "Hmm. Even the title link above doesn't work for me on Virgin's cable, in the UK"
  • "Nope,it just takes forever, then eventually shows a blank screen..."
  • "I'm unable to resolve the domain on EE UK - looks like it's DNS blocked."
  • "By comparison, on my work network (TalkTalk) I can resolve the domain but I get a connection reset from the site."
  • "I think this might be the first time I've hit a DNS block. It feels rather eerie seeing people talking about a site that, from my point of view, doesn't even exist..."
  • "There's an inconsistent censoring of numerous websites across the UK. In short, the biggest ISPs (a list which changes over time), will block various sites (TPB, libgen, AA, and others), based on court orders taken out at different times"
  • "Set proton VPN to Albania and enjoy the full internet is my experience."
  • "Yep blocked by Ziggo in NL as well"
  • "Whenever I'm in the Netherlands I need to set my DNS to 1.1.1.1 or similar, lots of blocks."

The Nature of HTTP Error Code 451

The discussion briefly touches on the HTTP status code 451, its origin, and its relevance in the context of the site being blocked.

  • "I actually didn't know there were more error codes beyond error code 429"
  • "There's "431 Request Header Fields Too Large" which you will see occasionally. But after that 451 is the only other 400-level error code above 429. It was chosen as a reference to the book Fahrenheit 451."
  • "451 is kind of a novelty code, its meaning being related to Bradbury's "Fahrenheit 451" SciFi novel."
  • "Oh! You'll love this: 418 I'm a teapot"

Cloudflare's Role

Users express surprise at Cloudflare's involvement in blocking access, questioning their role as a filter for individual connections rather than just site protection.

  • "I thought last I heard they'd arrested the guy who was suspected of running the site, about a year or so ago. Guess I'm misremembering."
  • "Also I'm surprised Cloudflare hasn't shut them down like they do for other dodgy sites."
  • "Man, I thought cloudflare stood in front of individual sites. When did they start becoming a filter on an individual’s web connections?"

The Need for Internet Redesign and Resilience

A broader theme emerges regarding the internet's vulnerability to various "attacks" and the perceived lack of initiatives to redesign it for better resilience. This includes concerns about DDoS, spam, surveillance laws, and LLM scraping.

  • "The entire internet needs to be re-designed to stand up against attacks. - DDOS attacks - Spamming - UK like surveillance laws - LLM scraping Why is it that there is almost not initiative for this?"
  • "because they will come after new design? how do you not see this?"
  • "It starts with one"
  • "I'll start the wiki"
  • "I'll design the logo!"
  • "I'll make a GUI in Visual Basic!"
  • "I'll bring my axe!"
  • "i'll make snacks"
  • "Decentralization and interoperability, including the TCP routing protocols give the ability for the network to grow freely, but makes those kind of attacks easier. The easiest way to mitigate those problem will be to decrease the openness and centralize more. It might lead to even worse things that DDOS."
  • ""Be the change you want to see in the world""
  • "I fully agree. It's difficult though because I genuinely believe that the solution space overlaps with cryptography, which is quickly discounted as viable option because it is now laden with negative connotations."
  • "nah. cryptography is not seriously held back by cryptocurrency"
  • "Cryptography has negative connotations? Like what? Do you mean cryptocurrency by any chance? (If so, it's feasible to practice cryptography without touching cryptocurrency)."
  • "In my bubble: - DRM. - Owner-unfriendly device locks (such as manufacturer-controlled secure boot or locked-down OSes). - Inability to audit network traffic from one's own devices, i.e. an IoT device. - Remote attestation, when in opposition to open computing. I could also see folks seeing the use of cryptography as "having something to hide" - I don't personally agree."
  • "The Internet has been redesigned. It's just not been redesigned with your interests in mind and at least some of the "attacks" are features to the right people."
  • "There are, but they each have their tradeoffs. Proof of work and micropayments (eg. Xanadu or Internet Mail 2000) schemes solve spamming and LLM scraping, but are more expensive or more CPU-intensive. P2P systems like FreeNet too, but they are harder to use and more storage intensive and make it easier to spy on individual users. Tor solves UK-like surveillance laws but it's slower and makes it easier to spam."
  • "The precursor to BitCoin was this interesting project called HashCash. It was built to combat email spam and forced the sender to spend compute solving a moderate hash and put it in the header. The person who receives the email can prove easily if the sender "paid" the cost."
  • "RFC-3514 [1] proposed an effective solution against attacks. So see, there are initiatives, but people treat it as a joke, maybe because of when it was released."
  • "the problem is that anybody who does that work will be targeted very quickly by the people in power. even if it's decentralised, it'll be banned one way or another and you'll be hunted down."

LLM Training Data and Copyright

A substantial discussion thread emerges concerning the ethical and legal implications of Large Language Models (LLMs) training on vast datasets, including copyrighted material. This touches on fair use, the necessity of comprehensive data for AI development, and potential societal benefits.

  • "Openai need to train their models based on these books, not stackoverflow or reddit."
  • "They do: ... The tweet only names Meta, but it would be very surprising if OpenAI didn't do the same thing."
  • "Anyone who doesn't train on all material available, legal or otherwise, will be outcompeted by teams that do, including those based in countries that don't respect Western copyright law. It's that simple. Either this is practice is judged (or legislated) to be fair use, or copyright is done. It's also that simple."
  • "> Authors and rights holders are supposed to just take it? Copyright law exists for a reason. Trying to improve an LLM doesn't give you the right to flout our legal system. Yes, other countries might have an advantage in LLM training as a result but so be it."
  • "> If it's judged as fair use, then yes. And then it's not flouting anything. Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim. For example, nonfiction authors already "just take it" when reviews describe the main points of their book without paying them a cent. The justification is that it's for the greater good, and rights are limited."
  • "How do you think masked language models work?"
  • "Judges have recently ruled [1] that training on legally obtained materials constitutes fair use, but we will have to see in the long term if that ruling holds up."
  • ">the whole point of fair use is to benefit society I'll stop you right there - I really don't think that applies at all. Does 'society' really benefit when the whole thing is a funnel for enormous amounts of wealth to go to already-gigantic companies like Microsoft?"
  • "It seems like it could conceivably be fair in some sense, as long as the models were actually released as open-weights (for the benefit of society)."
  • "I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better. I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large. This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling. (In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.)"
  • "Quality. The tranformable value in all data is not equal."
  • "They do, don't they? I think OpenAI uses libgen. Meta managed to get into a private ebook torrent tracker called Bibliotik a few years ago to use for training Llama and the resulting publicity essentially killed the tracker."

The Value and Mission of Anna's Archive

There's a strong sentiment of appreciation for Anna's Archive as a crucial resource, coupled with discussions about its operational model, sustainability, and the dedication of its team.

  • "Anna's archives is possibly the greatest site ever. Infinite love to the team <3"
  • "Kind of... the fact that they have the actual data behind a "soft" paywall (waiting times and terribly slow transfers otherwise) makes me a bit skeptic of their "goodwill"."
  • "Bandwidth isn’t free of charge"
  • "and hosting"
  • "Their backdoor plan to get rich! Not going to fool me this time VCs!!"
  • "Everyone involved is taking on significant personal liability and hosting expenses. Not sure what more you expect."
  • "Yes spot on, crazy that asking for an optional pittance for less bandwidth throttling on such a huge and risky project can be seen as exploitative."
  • "No such thing as free when bandwidth costs money. Any service online that is handing out things for free without restriction is getting their return through scrupulus means and shouldn't be trusted. Anna's Archive straddles the line enough to allow people to download books for free but not at too great an expense to the volunteers who pay out of pocket to support the project."
  • "So what about the authors and creators of the works? They did it for free?"
  • "Please remain up. Libgen no longer works. I've used IRC for fiction and non-fiction but tech books needs Anna's Archive and Libgen. I buy the physical with company budget to pay the author but I need DRM free ebooks to read comfortably on my Tab S9 Ultra."
  • "libgen is still there"
  • "Not accurate. You are probably looking at a site like https://libgen.ac/ which states clearly at the top: "Not a Part of Library Genesis. ex libgen.io, libgen.org" The real one has been down for a long time."
  • "fuck those guys, annas archive is one of the last good things about the internet."
  • "1. Information wants to be free. :-)"
  • "What I mean to say is: I have been disappointed by my heroes before."
  • "Personally I suspect Anna’s archive to be funded be Russia as a part of their ”cold” war with the west. They are literally burning down giant commercial buildings in Europe. This seems like a no-brainer in comparison, in a risk vs benefit calculation."
  • "> Information should be free I'm sick and tired of this misquote; as it was merely an observation of trends, and was never meant to be a moral maxim or mandate. If you truly believe information needs to be free as a moral mandate, share your company's source code first."
  • "I see it as “everyone deserves respect”. No need to overanalyse it. It’s one of those few things in life that are simply true, nothing proof needed."
  • "I see it as "Carthage must be destroyed". No need to nitpick it. We must destroy Carthage."
  • "People can do good things and bad things simultaneously. Unless me supporting the good things directly enables also the bad things, I don't see a reason to throw out the good thing."
  • "What are social security numbers if not just another bit of information that wants to be free? Or perhaps you are saying that people that have an interest in the availability of particular information should have some control on that information's freedom..."
  • "About recent events. We are still alive and kicking. In recent weeks we’ve seen increased attacks on our mission. We are taking steps to harden our infrastructure and operational security. The work of securing humanity’s legacy is worth fighting for. Since we started in 2022, we have liberated tens of millions of books, scientific articles, magazines, newspapers, and more. These are now forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes, thanks to everyone who helps with torrenting. Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending, HathiTrust, DuXiu, and many more. We have also scraped and published the largest book metadata collections in history: WorldCat, Google Books, and others. With this we’ll be able to identify which books are still missing from our collections, and prioritize saving the rarest ones. Much thanks to all of our volunteers for making these projects happen. We’ve forged some incredible partnerships. We’ve partnered with two LibGen forks, STC/Nexus, Z-Library. We’ve secured tens of millions additional files through these partnerships. And they are helping the mission by mirroring our files. Unfortunately we have seen the disappearance of one of the LibGen forks. We don’t have further information about what happened there, but are saddened by this development. There is a new entrant: WeLib. They appear to have mirrored most of our collection, and use a fork of our codebase. We have copied some of their user interface improvements, and are grateful for that push. Sadly, we are not seeing them share any new collections, nor share their codebase improvements. Since they haven’t shown commitment to contributing back to the ecosystem, we advise extreme caution. We recommend not using them. In the meantime, we have some exciting projects in the works. We have hundreds of terabytes in new collections sitting on our servers, waiting to be processed. If you’re at all interested in helping out, feel free to check out our Volunteering and Donate pages. We run all of this on a minimal budget, so any help is greatly appreciated. Keep fighting."
  • "Kudos to the team behind this project! It looks like they have improved UI in last year. The crucial problem right now is to remain accessible or to survive. I have no idea how much effort is being put into it. I wonder is it possible to remain afloat despite all efforts to take them down?"
  • "There was a pretty major UI update in the past 2-5 days-ish. Apologies for the minor grumble, but on mobile I used to be able to browse search results much more effectively; the new design only fits ~4-5 results on a screen."
  • "I've been using WeLib since April and had a good experience so far"
  • "Why use them over annas archive?"
  • "cleaner interface"
  • "If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic. I agree with the Anna's Archive team, it serves noone to have one of these players in the space hoarding their own collections and not sharing them to other archiving projects, it make the collection extremely vulnerable and at risk of becoming lost knowledge as time goes on."
  • "I disagree with how this is framed. shadow libraries thrive on decentralization, any other servers mirroring a collection is better than no mirrors at all"
  • "Im not sure how you disagree with this. Decentralization relies on multiple copies in multiple places. The fact is that WeLib is not allowing other libraries like Anna's Archive to mirror or copy thier exclusive collection, hence the recommendation not to use them. Otherwise, please explain how I am missing your point."
  • "No honour among thieves."
  • "> If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic. I agree with the Anna's Archive team, That's an odd combination."
  • "Also, they provide a torrents list that anyone can seed and be part of the long-term preservation."
  • "Shadow libraries maintainers deserve a Nobel prize for their contributions to humanity. Satoshi would be proud."
  • ""Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending" Not really helping in the big picture, here, guys."
  • "What is the future of service like these? More and more content will be AI generated, to some degree. And should thereby that content be aggregated?"

Questions of Legitimacy and Profitability

Skepticism is expressed regarding Anna's Archive's self-proclaimed "non-profit" status, especially given its illegal operations and reliance on donations and crypto. The discussion probes whether the project is truly a charitable endeavor or an enterprise for profit, with questions about financial transparency and the controllers' lifestyles.

  • "Can Anna's Archive claim to be a non-profit when it's effectively an illegal enterprise with unknown controllers? They are even offering decent bounties: ... Whoever is running it must be doing really well for themselves laundering all that crypto. Also interestingly they don't offer a tor onion service, while the admin is most certainly technically competent to administer one given that he no doubt uses tor to insulate himself from his enterprise and launder crypto. What is the reasoning for that?"
  • "Your comment seems like a non sequitur to me. Whether something is a "non-profit" has nothing to do with whether it receives or spends money. (See, e.g. the American Red Cross's ~$4B/yr budget.) It's about what it does with the money it has. Obviously, since Anna's Archive is breaking the law, it can't conform itself to the normal legal/regulatory system that governs non-profit organizations. It can certainly still claim to be acting in the spirit of a non-profit, and it's up to you to decide whether you trust that claim. Nobody's forcing you to give them money."
  • "The connotation of a non-profit is that it's being audited. It would be extremely silly to suggest otherwise."
  • "It may have that connotation to you, but in general (at least in the US) non-profit organizations are not required to have independent audits. There's also plenty of non-profit-by-some-definition organizations that never file a Form 1023, giving up some benefits of the 501(c)(3) regulations but in exchange being even less regulated."
  • "Audits have nothing to do with it; all entities are subject to audit. The primary difference between a non-profit and a for-profit is that a non-profit does not distribute profit to shareholders, including the founders."
  • "Audit or threat of audit is the mechanism of enforcement and that is all that ever matters."
  • "Is Cosa Nostra a non-profit? The question doesn't make sense. It's a category error. A non-profit is a corporate legal structure. An unregistered organization could be a cabal, a gang, a syndicate, a fellowship, a religion, a movement, a private club, or something else."
  • "At least in the US, claiming that you are a nonprofit implies that contributions are tax deductible. Claiming that you are a nonprofit when contributions are not tax deductible might be considered fraudulent."
  • "Given the amount of hosting and storage needed to sustain this project. Nobody is getting rich off of donations. Not to mention the lifestyle tradeoffs that innevitably come with international fugitive status do not lend themselves to a very comfortable life. The usage of crypto is entirely one of necessity, as controling information and knowledge is something powerful people have clear stakes in. Many countries weild their financial systems to hold or acquire power. Information and Knowledge is one form of such power. Everything points to the Anna's Archive team being passionate ideologues as opposed to some criminal enterprise focused on profit motives."
  • "The controller's freedom. If they didn't launder it they wouldn't be free."
  • "They still offer an absolutely incredible free service" Actually their free downloads aren't particularly good when compared to some of the other online services that 'leech' from them. And their torrent strategy could be altruistic but it could also be self interested. By spreading storage costs around and attracting more contributions. And providing insurance to hardrive seizures. What mainly interests me is how much money they are actually making, I suspect it's very profitable."
  • "> Given the amount of hosting and storage needed to sustain this project. Nobody is getting rich off of donations. They're getting donations as much as megaupload was getting donations for premium accounts... People pay for higher bandwidth and no wait time, not to support the "cause". It's a farce to qualify this of donations. And obviously people do get rich off of it, as you can see from the slew of file hosting services."
  • "illegal doesn't at all have to mean immoral or particularly wrong either. Laws are complex constructions, often created for decidedly hypocritical reasons of benefitting some at the expense of others. Thus, Who gives a shit if they're taking money from those who voluntarily subscribe. They still offer an absolutely incredible free service to who knows how many people who otherwise wouldn't be able to afford so much access to so much free information. Given the behavior of the pro-copyright business interests and legal bodies of the world, and the outright hypocrisy of openly creating one set of rules on content piracy for certain corporations while applying another, harsher rule system for those who aren't so nicely connected, smug moralizing about something like Annas Archive has little grounding. And aside from picking random crap out of your ass for smearing arbitrarily, what shred of evidence do you have of anyone there laundering crypto, and how?"
  • "Hmm, does not comply with age verification for eating disorders. Dangerous site for children. Also not compliant with data retention rules. I don't know, man, seems like it should be illegal in Europe and the UK. I will email to make sure it's on regulators' radar. Europeans would not make bad law."
  • "A pretty rich thing to say when your mission is piracy. I'm not against piracy at all, quite the contrary, but this is quite laughable."