The Hacker News discussion revolves around several key themes concerning the economics and evolution of Large Language Model (LLM) APIs, particularly in light of recent pricing changes by providers like Google.
The Disconnect Between Compute Cost and API Pricing
A central theme is the mismatch between the underlying, often non-linear (quadratic) compute costs of LLMs and the linear pricing models offered to API consumers. The article's "traffic analogy" is seen as a good way to frame this.
- cmogni1 notes: "The article does a great job of highlighting the core disconnect in the LLM API economy: linear pricing for a service with non-linear, quadratic compute costs. The traffic analogy is an excellent framing."
The Impact of KV Cache on Performance and Cost
The efficiency of the KV cache, particularly during the decoding phase, is identified as a significant bottleneck and a driver of costs, especially for longer context windows. This impacts memory bandwidth and GPU VRAM usage.
- cmogni1 explains: "The real bottleneck, however, is the KV cache during the decode phase. For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth."
- cmogni1 further hypothesizes about Google's pricing: "Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short."
Nuances in Pricing Models: "Thinking" vs. "Non-Thinking" Modes and Tiered Pricing
The discussion points out that the pricing structures are not always as straightforward as a simple linear cost per token. The introduction and discontinuation of different pricing tiers, such as "thinking" and "non-thinking" modes, has led to perceived price hikes for some users. Some models also incorporate tiered pricing based on input sequence length.
- simonw clarifies Google's pricing changes: "It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode."
- simonw also notes variations in pricing structures: "That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens."
- A separate point is raised about which models are subject to these changes: "simonw states: 'Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a whole lot more similar to each other.'"
- Another user points out potential precedents for price adjustments: "sharkjacobs suggests: 'Arguably that was Haiku 3.5 in October 2024. I think the same hypothesis could apply though, that you price your model expecting a certain average input size, and then adjust price up to accommodate the reality that people use that cheapest model when they want to throw as much as they can into the context.'"
The Role of Customer Usage Patterns in Pricing
There's a consensus that customer usage patterns are increasingly dictating API pricing, rather than solely technological advancements. This suggests a potential stabilization or "soft floor" in price reductions for now.
- sethkim observes: "Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a 'soft' floor for now on pricing."
Future of LLM Pricing: Gradual Improvement vs. Exponential Drops
While there's agreement that LLM prices will continue to fall, the magnitude of future improvements is debated. Some expect substantial year-over-year drops to moderate, meaning AI development costs might not become "free" in the near future.
- simonw expresses optimism about future drops: "Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware."
- sethkim offers a more tempered view: "No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to 'free' in the near future as many suspect."
Motivation for Price Changes: Profitability and Market Strategy
Some users attribute price changes to companies prioritizing profitability and shareholder value, framing them as "money grabs" rather than purely technical necessities. The strategy of "gain customers at all costs" is also mentioned as a potential driver.
- vfvthunter states their view: "Google is a publicly traded company responsible for creating value for their shareholders. When they became dicks about ad blockers on youtube last year or so, was it because they hit a bandwidth Moore's law? No. It was a money grab."
- guluarte suggests a strategy: "they are doing the we work approach, gain customers at all costs even if that means losing money."
LLM Profitability and Margins
The discussion touches upon whether LLM providers are currently profitable. While some believe major providers are making small margins on inference, the initial cost of training models might lead to overall losses. There's also speculation about differential margins between companies.
- simonw offers an opinion on profitability: "I don't believe that's true on inference - I think most if not all of the major providers are selling inference at a (likely very small) margin over what it costs to serve them (hardware + energy). They likely lose money when you take into account the capital cost of training the model itself, but that cost is at least fixed: once it's trained you can serve traffic from it for as long as you chose to keep the model running in production."
- bungalowmunch adds speculation: "yes I would generally agree; although I don't have a have source for this, I've heard whispers of Anthropic running at a much higher margin compared to the other labs"
The Endgame Strategy for LLM Providers
A cynical view is presented that LLM providers might be intentionally underpricing now to capture the market, with the ultimate goal of significantly increasing prices once software development becomes heavily dependent on them, likening it to the "Uber model."
- throwawayoldie posits: "Yes, and the obvious endgame is wait until most software development is effectively outsourced to them, then jack the prices to whatever they want. The Uber model."
Advancements in Model Quality and Resource Requirements
There's an ongoing trend of increasing model quality and capability, alongside decreasing resource requirements per unit of performance. This raises the question of whether future LLMs will be significantly smaller and more accessible, potentially running on consumer hardware.
- incomingpain highlights this trend: "Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion. Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion... So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?"
- sethkim responds with a nuanced view: "My two cents here is the classic answer - it depends. If you need general 'reasoning' capabilities, I see this being a strong possibility. If you need specific, factual information baked into the weights themselves, you'll need something large enough to store that data."
The Article's Purpose: Sales Pitch vs. Objective Analysis
A user points out that the article, while potentially insightful, is also part of a sales pitch from a startup, which should be considered when evaluating its conclusions.
- sharkjacobs states: "Feel free to reach out to see how Sutro can help. I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch."
- sethkim acknowledges this: "Yes, we're a startup! And LLM inference is a major component of what we do - more importantly, we're working on making these models accessible as analytical processing tools, so we have a strong focus on making them cost-effective at scale."
- samtheprogram defends the practice: "There’s absolutely nothing wrong with putting a small plug at the end of an article."
Questioning the "Quadratic" Statement and Moore's Law Analogies
The mathematical basis for the "quadratic" cost statement is questioned, with one user proposing it might be exponential. There's also a comparison to past "end of Moore's Law" predictions, suggesting that extrapolating too much from minor pricing updates could be unfounded.
- georgeburdell asks: "Is there math backing up the “quadratic” statement with LLM input size? At least in the traffic analogy, I imagine it’s exponential, but for small amounts exceeding some critical threshold, a quadratic term is sufficient"
- gpm provides the mathematical justification: "Every token has to calculate attention for every previous token, that is that attention takes O(sum_i=0^n i) work, sum_i=0^n i = n(n-1)/2, so that first expression is equivalent to O(n^2). I'm not sure where you're getting an exponential from."
- jasonthorsness offers a cautionary note: "Unfounded extrapolation from a minor pricing update. I am sure every generation of chips also came with “end of Moore’s law” articles for the actual Moore’s law."
Open Source Alternatives and Context Window Limitations
The feasibility of running open-source LLMs on personal hardware is discussed, with context window size identified as a significant challenge.
- ramesh31 asks: "Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?"
- sethkim responds positively: "Yes! Both Llama 3 and Gemma 3 have 128k context windows."
The Bottleneck of "Business Necessity" for All-in-One Models
A dissenting opinion suggests that the current focus on massive, all-encompassing models is a business necessity that stifles innovation. The argument is that smaller, specialized models that can be built upon, akin to human learning, are overlooked.
- fusionadvocate argues: "What is holding back AI is this business necessity that models must perform everything. Nobody can push for a smaller model that learns a few simple tasks and then build upon that, similar to the best known intelligent machine: the human. If these corporations had to build a car they would make the largest possible engine, because 'MORE ENGINE MORE SPEED', just like they think that bigger models means bigger intelligence, but forget to add steering, or even a chassi."
Unnecessary Hype and Misleading Article Titles
One user expresses frustration with article titles that claim unique insights when the information is readily available to the community, labeling it as "LinkedIn style 'thought leader' nonsense."
- jjani criticizes the article's framing: "Stopped reading here, if you're positioning yourself as if you have some kind of unique insight when there is none in order to boost youe credentials and sell your product there's little chance you have anything actually insightful to offer. Might sound like an overreaction/nitpicking but it's entirely needless LinkedIn style 'thought leader' nonsense."