We accidentally solved robotics by watching 1M hours of YouTube

This discussion touches on several key themes surrounding AI development, data usage, intellectual property, and the nature of online discourse.

Data Scraping and Terms of Service

A significant portion of the conversation revolves around the ethical and legal implications of scraping data, particularly from platforms like YouTube, for AI training. Users question whether such massive data collection violates YouTube's Terms of Service (ToS).

"Does YouTube allow massive scraping like this in their ToS?" asked okdood64.
mouse_ stated, "Probably not. Who cares at this point? No one is stopping ML sets from being primarily pirated."
MaxPock expressed a strong stance against unauthorized data use: "They don't and neither do I allow my site - whose content I found on Gemini -scraped."
perching_aix, quoting their "lawyer" (GPT-4o), brought up the legal argument that YouTube, as a non-exclusive licensee of user content, might not be able to enforce further restrictions in their ToS, suggesting scraping publicly available data is not illegal in the US.

The Role and Impact of Copyright Law

The discussion delves into the historical enforcement of copyright and how the current AI boom might be challenging or dismantling these established norms. The idea of a "plutocracy" and the "rule of law" are brought up in this context. The fate of Aaron Swartz is cited as an example of the harsh enforcement of copyright in the past.

snickerdoodle12 lamented the perceived shift, stating, "But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now? If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you."
The user added, "...they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz."
In response to the mention of Aaron Swartz, shadowgovt clarified, "Aaron Swartz died of suicide, not copyright. His death was a tragedy but it wasn't done to him."
marcus_holmes offered a nuanced perspective on Swartz's death: "There's an English phrase "hounded to death", meaning that someone was pursued and hassled until they died. It doesn't specify the cause of death, but I think the assumption would be suicide, since you can't actually die of fatigue. I think that's what was done to Aaron Swartz."
jagged-chisel questioned the framing of Swartz's death, remarking, "I can’t imagine why you’d let the FBI off the hook."

Skepticism Towards AI Claims and Hype

A prominent theme is the skepticism and critique of overly optimistic or exaggerated claims made about new AI developments, particularly regarding the originality and impact of certain research or blog posts.

hahaxdxd123 critically assessed a particular article, stating, "Extremely oversold article." They questioned specific points, such as "predict in representation space, not pixels" (claiming it's been done since 2014) and "zero-shot generalization" (which they claim is beaten by other models). Regarding "accidentally solved robotics," they noted, "They're doing 65% success on very simple tasks." The user concluded, "The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source."
richard___ succinctly expressed doubt: "Solved??? Where?"
Voloskaya also found the article problematic: "This article contains so many falsehoods and history rewrites that it's pretty painful to read."
billstar pointed out a potential fabrication: "The research is very real but the blog post appears to be very fake."

The Influence of LLMs on Writing Style and Authenticity

Several users discuss the telltale signs of AI-generated text or text heavily influenced by LLMs, highlighting how it can obscure authenticity and make it difficult to discern genuine human writing.

rozab noted, "My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context..." They also pointed out inconsistencies in data cited in a blog post, suggesting a lack of careful drafting.
isoprophlex described a similar experience: "I hate that the writing triggers my "this is heavily written by an llm" sense, but because it's all in lowercase and written in a kind of humorous laissez-faire lingo, i can't smugly point at the usual telltales." They then identified specific LLM traits in the writing, including pun usage, sentence structure, and "emoji diarrhea." The user concluded, "The only thing I can offer: the author admits to using claude..."
pton_xd agreed that the article was "very obviously LLM generated" and provided a ChatGPT prompt that could produce a similar output.
esjeon noted that a username used in the discussion was a derogatory term, which had previously been seen in LLM-generated spam accounts.
jcrawfordor commented on a user's post, remarking, "It has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN."

The Nature of Generality vs. Specialization in AI and Robotics

A segment of the discussion contrasts generalized AI approaches (like LLMs) with specialized, purpose-built systems, particularly in the context of robotics and task execution.

contingencies argued against the broad pursuit of generalized AI, stating, "I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap."
foobarian posited a counterpoint: "I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before."
jjangkke echoed the sentiment that generalized solutions can be inefficient, stating, "We made a sandwich but it cost you 10x more than it would a human and slower..." They identified the "last mile problem" as a key barrier for generalized AI in mimicking human capabilities.
jes5199 used an analogy: "a CPU is more expensive, more complicated, more energy demanding than custom made circuitry, in most cases."

Political and Administrative Influence on AI Regulation

The discussion briefly touches upon the political landscape and how government actions might influence AI development and regulation, particularly concerning copyright.

perching_aix asked for elaboration on the idea of "the current power" dismantling copyright for AI.
bgwalter responded by mentioning the firing of the head of the US Copyright Office by Trump and a "Big Beautiful Bill" that allegedly prohibits state "AI" legislation, linking to an article detailing these events. They also mentioned Trump's "Crypto and AI czar."

Subcultures and Internet Slang

The usage of internet slang and the subcultures from which it originates also briefly surfaces.

throwaway198846 inquired about the term "ngmi," asking "in which subculture it is common."
mensetmanusman identified it as being common in "Ivy League hacker subculture 15 years ago."
spencerflem stated their "college friend group uses it occasionally."
panarky linked its usage to "crypto hype threads."