The Bitter Lesson Is Misunderstood

Here's a summary of the themes from the Hacker News discussion:

Data Quality is Paramount, Not Just Quantity

A central theme is the critical importance of clean, accurate, and well-structured data for AI success, particularly with LLMs. Many users express skepticism that simply throwing more data or compute at problems will solve fundamental issues arising from "garbage" data.

FloorEgg highlights the problem of human-generated data being "riddled with human errors," including "fundamental errors in critical thinking, reasoning, semantic and pragmatic oversights." They state, "The problem I am facing in my domain is that all of the data is human generated and riddled with human errors."
stego-tech echoes this sentiment, emphasizing that for "any chance of success, your data has to be pristine and fresh, otherwise you’re lighting money on fire." They also directly state, "it’s the data, stupid."
PLenz adds a cautionary note from personal experience, stating that even in domains like web advertising and GIS, data is "only slightly better than the median industry and in the case of the geodata from apps I'd say it's far, far, far worse."
incompatible warns about biases in human-created data: "When studying human-created data, you always need to be aware of these factors, including bias from doctrines, such as religion, older information becoming superseded, outright lies and misinformation, fiction, etc. You can't just swallow it all uncritically."
kbenson suggests a potential method for improving data quality: "Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data?"
geetee questions the need for more data, asking, "Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?"" and later, "What does synthetic training data actually mean? Just saying the same things in different ways? It seems like we're training in a way that's just not sustainable."
brazzy summarizes this point: "The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore. So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both."
back2dafucha offers an old adage: "About 28 years ago a wise person said to me: 'Data will kill you' Even mainframe programmers knew it."

Focusing on User Intent and "Jobs to Be Done" for Practical Solutions

Several contributors, led by FloorEgg, highlight the success of approaches that focus on understanding and fulfilling user intent rather than simply trying to replicate flawed historical data. This involves building user-friendly interfaces that allow users to articulate what they want done, and then architecting systems to achieve that.

FloorEgg describes their successful approach: "Instead I developed a UX that made it as easy as possible for people to explain what they want to be done, and a system that then goes and does that." They contrast this with the common, less effective suggestion of dumping historical data into a model: "Instead of trying to train a model on all the history of these inputs and outputs, the solution was a combination of goal->job->task breakdown (like a fixed agentic process), and lots of context and prompt engineering."
FloorEgg further clarifies the difference: "This is the difference between making what people actually want and what they say they want: it's untangling the why from the how."
In a more detailed explanation of their system, FloorEgg outlines the process: "Instead of training a model on the historical inputs and outputs, the solution was to use the best base model LLMs, a pre-determined agentic flow, thoughtful system prompt and context engineering, and an iterative testing process with a human in the loop (me) to refine the overall system by carefully comparing the variances between system outputs and historical customer input/output samples."
FloorEgg contrasts this with basic chat interfaces: "(and a basic chat interface like ChatGPT doesn't work for these types of problems, no matter how smart it gets, for a variety of reasons)."

Rethinking Scaling Laws and the Limits of Current Architectures

A significant portion of the discussion revolves around the idea that current transformer architectures may be hitting scaling limits, particularly concerning the availability of training data. This leads to a debate about the need for new architectures, synthetic data generation, and alternative approaches like reinforcement learning and real-world interaction.

cs702 articulates the scaling argument: "We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². ... We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens."
cs702 proposes solutions: "We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams)."
FloorEgg suggests a third option: "What about or (3) models that interact with the real world?"
charleshn disagrees with the premise that we cannot add more compute, pointing to examples: "See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1]."
jeremyjh questions the applicability of AlphaZero-like methods to language: "AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model? ... Distillation does not produce new data in the same way that chess games produce new positions. We need more raw information."
voxic11 provides an example of synthetic data in other domains: "Synthetic data is already widely used to do training in the programming and mathematics domains where automated verification is possible. Here is an example of an open source verified reasoning synthetic dataset..."
charleshn defends synthetic data generation for specific domains: "Works really well for maths, algorithms, amd many things actually. ... That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long."
scotty79 suggests turning language into a game: "You make models talk to each other, create puzzles for each other's to solve, ask each other to make cases and evaluate how well they were made."
kushalc emphasizes the need for specific kinds of data for scaling bottlenecks: "it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc."
paulsutter identifies physical simulation as an underutilized data source: "Physical simulation is the most important underutilized data source. It’s very large, but also finite. And once you’ve learned the complexity of reality you won’t need more data you’ll be done."
madrox offers a pragmatic view on investing in AI: "if you're not in academia you're not trying to solve that problem; you're trying to get your bag before it happens. ... this is yet another incarnation of the hungry beast."

The Potential and Challenges of Embodied AI and Real-World Interaction

The discussion touches upon the future of AI, with a strong undercurrent about embodied AI and the difficulties of achieving human-level robotics, contrasting it with the rapid progress in language models.

FloorEgg's initial question about models interacting with the real world sparks this thread.
jvanderbot questions the data rate: "Play in the real world generates a data point every few minutes. Seems a bit slow?" to which FloorEgg queries the definition of a "data point."
pizzly counters that real-world experience is "multi modal though vision, sound, touch, pressure, muscle feedback, gravitational, etc. Its extremely rich in data. Its also not a data point its continuous stream of information."
mannykannot offers a counterpoint, noting that embodied experience is limited compared to accumulated written knowledge.
tliltocatl highlights the expense: "Interacting with the real world is expensive. It's the most expensive thing of all."
FloorEgg provides a quantitative comparison: "Living cells are ~4-5 orders of magnitude more functional-information-dense than the most advanced chips..." This leads to a sub-discussion about information density calculations.
FloorEgg later ties this to "jobs to be done": "So maybe we have to master LLMs and probably a whole other paradigm before robots can really be general purpose and useful."
FloorEgg speculates on the difficulty gap: "So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?"
kushalc attributes this difficulty to the "curse of dimensionality" in the real world: "Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance."
benlivengood still sees potential in untapped data sources: "I don't think anyone has yet trained on all videos on the Internet. Plenty of petabytes left there to pretrain on..."

Mathematical and Technical Notation Exclusivity

One user brings up the issue of how complex and implicit mathematical notation can create barriers to understanding, comparing it to technical jargon in computer science.

Quarrelsome expresses frustration with mathematical notation: "I can look up the symbol for "roughly equals", that was super cool and is a great part of curiousity. But this implied multiplication between the 6 and the N combined with using a fucking diamond symbol (that I already despise given how long it took me to figure the first time I encountered it) is just gross." They also draw a parallel to CS jargon like "CQRS": "Which results in thousands of newbies trying to parse the unfathomable complexity of 'Command Query Responsibility Segregation' and spending as much time staring at its opaqueness as I did the opening sentence of the wikipedia article on logarithms."
ghkbrew offers a technical explanation for some notation: "I assume they use N⋅D rather than ND to make it explicit these are 2 different variables. That's not necessary for 6N because variable names don't start with a number by convention."