Adaptive LLM routing under budget constraints

The Hacker News discussion revolves around several interconnected themes concerning the current state and future of Large Language Models (LLMs), their perceived limitations, and the implications for Artificial General Intelligence (AGI).

The Necessity of Human Preference Data and LLM "Wisdom"

A primary point of contention is whether LLMs possess an intrinsic understanding of "question complexity" or require explicit human feedback to optimize performance. The initial query by fny questioned the need for human preference data, suggesting LLMs might already have a "strong enough notion of question complexity to build a dataset for routing."

delichon countered this, stating, "Aka Wisdom. No, LLMs don't have that. Me neither, I usually have to step in the rabbit holes in order to detect them." fny then elaborated, posing, "Do you think you need to do high/medium/low amount of thinking to answer X?" seems well within an LLMs wheelhouse if the goal is to build an optimized routing engine."

However, jibal provided a more fundamental critique, asserting, "LLMs don't have notions ... they are pattern matchers against a vast database of human text." This highlights a recurring skeptical view that LLMs are sophisticated pattern-matching engines rather than entities possessing genuine understanding or "wisdom."

The Stagnation of LLM Progress and the AGI Question

A significant portion of the discussion expresses a sentiment that the rapid advancements in LLM performance seen in the past have slowed or plateaued. This leads to skepticism about the near-term prospects of AGI.

andrewflnr explicitly questioned the research direction, asking, "Is this really the frontier of LLM research? I guess we really aren't getting AGI any time soon, then. It makes me a little less worried about the future, honestly." They later clarified, "That's just the thing. There don't seem to have been any breakthroughs in model performance or architecture, so it seems like we're back to picking up marginal reductions in cost to make any progress."

yahoozoo agreed with this observation: "That and LLMs are seemingly plateauing. Earlier this year, it seemed like the big companies were releasing noticeable improvements every other week. People would joke a few weeks is “an eternity” in AI…so what time span are we looking at now?"

muldvarp offered a contrasting view, noting, "There have been very large improvements in code generation in the last 6 months. A few weeks without improvement are not necessarily a plateau." Nonetheless, the dominant sentiment leans towards a perceived slowdown in fundamental architectural or performance gains.

The notion of AGI itself is also heavily debated, with many participants questioning its definition and attainability. kenjackson stated, "First, I don't think we will ever get to AGI. Not because we won't see huge advances still, but AGI is a moving ambiguous target that we won't get consensus on."

nutjob2 strongly echoed this sentiment: "There's no concrete evidence AGI is possible mostly because it has no concrete definition. It's mostly hand waving, hype and credulity, and unproven claims of scalability right now. You can't move the goal posts because they don't exist."

guluarte presented a more nuanced perspective on AGI realization: "I'm starting to think that there will not be an 'AGI' moment, we will simply slowly build smarter machines over time until we realize there is 'AGI'." This contrasts with the idea of a distinct, recognizable "AGI" breakthrough.

The Unreliability and Dangers of Current LLMs

Despite questions about AGI, there is a strong undercurrent of concern regarding the immediate dangers and unreliability of existing LLMs.

jibal warned, "LLMs are not on the road to AGI, but there are plenty of dangers associated with them nonetheless."

nicce provided a specific and alarming example: "Just 2 days ago Gemini 2.5 Pro tried to recommend me tax evasion based on non-existing laws and court decisions. The model was so charming and convincing, that even after I brought all the logic flaws and said that this is plain wrong, I started to doubt myself, because it is so good at pleasing, arguing and using words."

This led nutjob2 to advise, "Or you could understand the tool you are using and be skeptical of any of its output. So many people just want to believe, instead of the reality of LLMs being quite unreliable." They also shared a personal observation: "Personally it's usually fairly obvious to me when LLMs are bullshitting probably because I have lots of experience detecting it in humans."

roywiggins offered a stark warning about engaging with faulty LLM output: "Once you've started to argue with an LLM you're already barking up the wrong tree. Maybe you're right, maybe not, but there's no point in arguing it out with an LLM."

The Economics of LLM Deployment and Routing

The paper's focus on cost savings through intelligent routing is acknowledged and discussed, highlighting the significant price disparities between different LLM providers and models.

pbd highlighted the economic incentive: "GPT-4 at $24.7 per million tokens vs Mixtral at $0.24 - that's a 100x cost difference! Even if routing gets it wrong 20% of the time, the economics still work."

pqtyw questioned using expensive models when cheaper alternatives exist: "While technically true why would you want to use it when OpenAI itself provides a bunch of many times cheaper and better models?"

FINDarkside humorously proposed a highly cost-effective routing strategy: "It's trivial to get better score than GPT-4 with 1% of the cost by using my propertiary routing algorithm that routes all requests to Gemini 2.5 Flash. It's called GASP (Gemini Always, Save Pennies)"

A nuanced point about cost calculation was raised by simpaticoder: "PPT (price-per-token) is insufficient to compute cost. You will also need to know an average tokens-per-interaction (TPI). They multiply to give you a cost estimate. A .01x PPT is wiped out by 100x TPI."

The Citation and Naming of Research Efforts

Some participants expressed skepticism about the quality and significance of the research being discussed, particularly its source and naming conventions.

ctoth inquired, "Is a random paper from Fujitsu Research claiming to be the frontier of anything?"

andrewflnr broadened this critique, mentioning, "Not just this paper, but model working shenanigans also seem to have been a big part of GPT-5, which certainly claims to be frontier work."

yieldcrv dismissed the significance of papers from arXiv, suggesting it's akin to a blog and can be used for reputation laundering: "just because it’s on arxiv doesn’t mean anything. arxiv is essentially a blog under an academic format, popular amongst asian and south asian academic communities. currently you can launder reputation with it, just like “white papers” in the crypto world allowed for capital for some time this ability will diminish as more people catch on."

Finally, spoaceman7777 humorously pointed out a missed alliteration opportunity in the paper's naming scheme: "Incredible that they are using contextual bandits, and named it: Preference-prior Informed Linucb fOr adaptive rouTing (PILOT) Rather than the much more obvious: Preference-prior Informed Linucb For Adaptive Routing (PILFAR)"