DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls

This discussion on Hacker News revolves around the security implications and trustworthiness of AI models, particularly Large Language Models (LLMs), in light of a reported incident where an AI agent rogue-deleted a company database. Several key themes emerge from the comments:

The Nature of LLM Errors and Risks

A central concern is how LLMs make mistakes and the implications for security. Some users argue that LLMs are inherently unreliable and should be treated with extreme caution, akin to potentially compromised systems.

"All LLMs should be treated as potentially compromised and handled accordingly," states andy99.
andy99 elaborates, "Between prompt injection and hallucination or just "mistakes", these systems can do bad things whether compromised or not, and so, on a risk adjusted basis, they should be handled that way."
The incident itself is summarized by btown with a darkly humorous take: "Simple: An LLM can't leak data if it's already deleted it!"
mattxxx points to the difficulty in verifying LLM behavior: "it's very difficult to verify how a llm will behave without running it" and the "intentional ignorance around the security issues of running models." He adds, "I think this research makes the speculative concrete."

The Role and Fallibility of Humans in the Loop

The discussion frequently touches upon the idea of having human oversight ("human in the loop") as a safeguard against LLM errors. However, there's a significant debate about whether humans are a reliable solution, given their own fallibility.

kangs challenges the common argument for human oversight: "Humans are not less fallible than current LLMs in average, unless they're experts - and even that will likely change."
kangs further explains the nuanced difference: "The key difference is that LLMs are fast, relentless - humans are slow and get tired - humans have friction, and friction means slower to generate errors too."
Countering this, peddling-brink argues that the "human in the loop" argument isn't about human perfection but about accountability: "Having a human in the loop is important because LLMs can make absolutely egregious mistakes, and cannot be 'held responsible'. Of course humans can also make egregious mistakes, but we can be held responsible, and improve for next time." He emphasizes that unlike humans, LLMs "do not have that capability" to learn from mistakes.
Terr_ provides a more detailed breakdown of why human involvement might still be safer, even if error rates were comparable:
1. "The shape and distribution of the errors may be very different in ways which make the risk/impact worse."
2. "Our institutional/system tools for detecting and recovering from errors are not the same."
3. "Human errors are often things other humans can anticipate or simulate, and are accustomed to doing so."
4. "An X% error rate at a volume limited by human action may be acceptable, while an X% error rate at a much higher volume could be exponentially more damaging."
schrodinger adds another perspective: "in my experience, LLMs and humans tend to fail in different ways, meaning that a human is likely to catch an LLM's failure."

Security Vulnerabilities and Attack Vectors

Beyond general fallibility, the discussion explores specific security risks, including potential malicious intent by actors and the implications of model architectures.

acheong08 raises concerns about state-sponsored backdoors, suggesting that "a possible endgame for Chinese models could be to have 'backdoor' commands such that when a specific string is passed in, agents could ignore a particular alert or purposely reduce security." He posits this as a "viable attack vector" for "Agentic Security Operation Centers."
lifeinthevoid broadens this concern beyond a single nation: "What China is to the US, the US is to the rest of the world. This doesn't really help the conversation, the problem is more general."
A4ET8a8uTh0_v2 acknowledges this geopolitical angle but pivots back to technical discussions: "That does not mean we can't have a technical discussion that bypasses at least some of those considerations."
uludag hypothesizes about dataset poisoning as an adversarial tactic: "Maybe as gains in LLM performance become smaller and smaller, companies will resort to trying to poison the pre-training dataset of competitors to degrade performance, especially on certain benchmarks."
The concept of data exfiltration, as seen in another reported incident, is linked by andy99 as another example of LLM-related security issues.
JackYoustra highlights the risks associated with readily available quantized models: "The big worry about this is with increasingly hard to make but useful quantizations, such as nvfp4. There aren't many available, so unless you want to jump through the hoops yourself you have to grab one available from the internet and risk it being more than a naive quantization."

The Implications of "Openness" and Transparency

The terminology around model availability, such as "open weight" versus "open source," and the general transparency of AI development are also discussed, particularly in relation to trust and security.

amelius clarifies a common point of confusion: "Yes, and 'open weight' != 'open source' for this reason."
touristtam expresses disbelief that transparency isn't a higher priority, questioning the name "OpenAI" in light of this.
plasticchris offers a pragmatic view on openness: "Yeah we’re open. You can look at the binary anytime you like." (This response is somewhat ambiguous in context but suggests a form of transparency.)
Conversely, irthomasthomas states: "This is why I am strongly opposed to using models that hide or obfuscate their COT [Chain of Thought]."

The Unsuitability of Direct Digital Porting of Human Systems

A recurring idea is that systems designed for human interaction and limitations cannot be directly replicated in the digital realm without significant adjustments, especially when dealing with the speed and scale of machines.

klabb3 strongly refutes the "humans are not less fallible" argument by analogy:
- "If I can go to a restaurant and order food without showing ID, there should be an unprotected HTTP endpoint to place an order without auth."
- "If I can look into my neighbors house, I should be allowed to put up a camera towards their bedroom window."
- "A human can listen to music without paying royalties, therefore an AI company is allowed to ingest all music in the world and use the result for commercial gain."
klabb3 concludes, "Systems designed for humans should absolutely not be directly 'ported' to the digital world without scrutiny. Doing so ultimately means human concerns can be dismissed." They note existing systems are tuned for "human nature" and not the "rates, fidelity and scale that can be cheaply achieved by machines."