Here's a breakdown of the major themes and sentiments in the Hacker News discussion, supported by direct quotes.
Excitement and Curiosity
Some users expressed general interest in the potential of the research, even if they didn't fully understand the technical details.
- zoklet-enjoyer: "I don't know what those words mean, but I am excited for the possibilities."
Explanation of the Problem
The discussion clarifies the problem that the paper addresses: the quadratic computational cost of attention mechanisms in LLMs, particularly concerning long context lengths.
- PaulHoule: "LLMs can look back over a certain number (N) of tokens, which roughly correspond to words. For instance if you want to summarize or answer questions about a document accurately the length of the document has to be less than N. Conventionally they use an attention mechanism that compares every token to every other token which has a cost of N*N or N squared which is quadratic. If you want LLMs to chew over a huge amount of context (all the source code for your project) it’s a problem so people are looking for ways around this."
- rybosome: "Adding to that excellent high level explanation of what the attention mechanism is, I’d add (from my reading of the abstract of this paper); This work builds a model that has the ability to “remember” parts of its previous input when generating and processing new input, and has part of its intelligence devoted to determining what is relevant to remember. This is in lieu of kind of saying “I need to keep re-reading what I’ve already read and said to keep going”."
Skepticism About the Paper's Practicality and Evaluation
Several discussants express reservations about the practical significance and completeness of the paper's experimental validation. They criticize the limited scope of the experiments and lack of performance metrics beyond perplexity.
- imranq: "I like the idea of removing quadratic scaling for attention, this paper has thin experimental support. No real tasks tested beyond perplexity. Nothing on reasoning, retrieval QA, or summarization quality. Even in perplexity the gains are marginal. However it removes attention so I think its worth watching that space of non-attention models"
- albertzeyer: "Also, this paper comes very short in experiments. There is basically only table 2. There is no study on length extrapolation (which is very relevant for the topic), or needle-in-haystack experiments, or scaling studies, any larger scale experiments, etc. Also, even in this main table 2, I see a couple of typos. And looking at the results in table 2, the improvements seems to be quite minor. So I would conclude, this needs a lot more work."
Concerns About the Paper's Clarity and Accuracy
Some users found the paper poorly written, unfocused, and potentially inaccurate in its presentation, particularly regarding comparisons to existing LLMs like DeepSeek.
- yorwba: "This paper seems rather unfocused, explaining their architecture three times with slight variations while managing to omit crucial details like how exactly they compute gradients for their "External Retrieval Memory." Also, the section on DeepSeek is really weird: "While the precise architectural details of DeepSeek LLM are still emerging, early discussions suggest that it relies on an extended Transformer backbone or a "hybrid" approach that likely incorporates some form of attention-based mechanism, potentially at specific layers or across chunk boundaries, to facilitate information flow across large contexts." It makes it sound like a mystery, even though there have been multiple papers published on it (they cite the R1 one) so that there's really no need to guess whether attention is involved. Overall I'm not convinced the authors know what they're doing."
- maxrmk: "> While the specific internal workings of DeepSeek LLM are still being elucidated, it appears to maintain or approximate the self-attention paradigm to some extent. Totally nonsensical. Deepseeks architecture is well documented, multiple implementations are available online."
Discussion of Context Window Sizes and Memory Considerations
The conversation touches on the actual context window sizes of current LLMs, both proprietary and open-source, and the memory limitations associated with attention mechanisms and batched inference.
- albertzeyer: ""hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs."
- cubefox: "> "hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs. Yes, but those are all relying on proprietary company secrets, while this is an open research paper. Besides, only Gemini so far has a context window of more than a million tokens."
- littlestymaar: "Llama 4 Scout has it also, and is an open weight LLM, unfortunately it is also disappointing at pretty much any context length…"
- boroboro4: "> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate. This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore."
Sustainable Pricing Models
A question is raised about the sustainability of charging by token, given the computational scaling of LLMs.
- daxfohl: "Partially related, is charging by token sustainable for LLM shops? If the compute requirements go up quadratically, doesn't that mean cost should as well?"
- sakras: "Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin. "