The unreasonable effectiveness of fuzzing for porting programs

LLMs for Code Porting: Promise and Pitfalls

The Hacker News discussion revolves around the use of Large Language Models (LLMs) for code porting, particularly focusing on the experiences of the author of the linked article, rjpower9000. While there's optimism about automating repetitive tasks and enabling "radical updates", significant concerns remain regarding reliability, testing, and the need for human oversight.

Fuzzing and Testing Limitations

A key concern is the over-reliance on fuzzing as a testing method, especially when coupled with LLM-generated code. Users point out the limitations of fuzzing in discovering subtle bugs.

awesome_dude argues, "The only advantage this [fuzzing] has over other forms of testing is that it's not constrained by people thinking "Oh these are the likely inputs to deal with"". However, xyzzy123 clarifies that modern fuzzers are often "coverage guided and will search for and devote more effort to inputs which trigger new branches / paths". Regardless, rjpower9000 notes that "Fuzzing would be unlikely to discover a bug that only occurs on giant inputs or needs a special configuration of lists."

This highlights the need for more sophisticated testing strategies, especially when LLMs are involved. nyanpasu64 shares a link to a Timsort bug, indicating that even seemingly well-tested algorithms can contain subtle flaws.

LLM Code Generation: Accuracy and Debugging Challenges

Several comments address the current reliability of LLM-generated code and the difficulty in identifying and correcting its errors. hansvm expresses skepticism: "On almost any problem where I'd be faster letting an LLM attempt it rather than just banging out a solution myself, it only comes close to being correct with intensive, lengthy prompting -- after much more effort than just typing the right thing in the first place." They elaborate on the challenges: "When it's wrong, the bugs often take more work to spot than to just write the right thing since you have to carefully scrutinize each line anyway while simultaneously reverse engineering the rationale for each decision". This stands in contrast to rjpower9000's experience, prompting hansvm to ask, "I don't doubt that's what you've actually observed though, so I'm passionately curious where the disconnect lies."

Equivalence Testing and Domain Specific Considerations

amw-zero argues that code porting simplifies testing because "you have a built in correctness statement: The port should behave exactly as the source program does. This greatly simplifies the testing process." This allows for leveraging equivalence testing.

bluGill adds a cautionary tale, "Eventually we reach a time where we are getting a lot of bug reports "didn't work, didn't work with the old system as well" which is to say we ported correctly, but the old system wasn't right either and we just hadn't tested it in that situation until the new system had the budget for exhaustive testing." In other words, the source may not be perfect.

Achieving a Desired Code Style or "Rustiness"

The discussion touches on the importance of stylistic correctness, particularly in the context of porting to Rust. rcthompson suggests that the lack of "Rustiness" in the generated code could be addressed by "telling the AI to minimize the use of unsafe etc., while enforcing that the result should compile and produce identical outputs to the original." rjpower9000 acknowledges that: "Sometimes to make it more Rust-y, you might want an internal function or structure to change. You then lose your low-level fuzz tests." However, he suggests generating equivalence test and notes “That said, you could have the LLM write equivalence tests, and you'd still have the top-level fuzz tests for validation. So I wouldn't say it's impossible, just a bit harder to mechanize directly.”

LLMs for Radical Refactoring and API Migration

Several discuss the potential for LLMs to facilitate large-scale refactoring and API migrations. e28eta quotes the original article: "LLMs open up the door to performing radical updates that we'd never really consider in the past." However, they also caution against using LLMs as a substitute for proper documentation: "Just as long as it doesn’t become “use this LLM which we’ve already trained on the changes to the library, and you just need to feed us your codebase and we’ll fix it. PS: sorry, no documentation.”"

marxism argues that providing LLM prompts for API migration can be more efficient than traditional documentation: "A prompt would be much more compact...Documentation says "here's everything now possible, you can do it all!" A prompt says "here's the specific facts you need."" rjpower9000 gives the example of the Python sub-interpreter proposal and says "A lot of times we want to make some changes that aren't quite mechanical, and if they hit a large part of the code base, it's hard to justify. But if we're able to defer these types of cleanups to LLMs, it seems like this could change."

Alternative Approaches: Formal Verification and Mutation Testing

The thread explores more advanced techniques for ensuring code correctness. gaogao suggests exploring "formal verification, which I'm pretty bullish about in concert with LLMs." rjpower9000 considers this by asking "I don't know enough about the space, could you have an LLM write a formal spec for a C function and the validate the translated function has the same properties?"

zie1ony shares their experience with "property (invariant) testing" and "AI + Mutation Tests", noting that "best is: 10% human to 90% AI of work."

Importance of Human Oversight in Test Design

Several commenters stress the importance of human involvement in designing test suites, particularly property tests. LAC-Tech states, "I'd have much more confidence in an AI codebase where the human has chosen the property tests, than a human codebase where the AI has chosen the property tests." They emphasize that "Tests are executable specs. That is the last thing you should offload to an LLM." bccdee builds on this point: "Also, a poorly designed test suite makes your code base extremely painful to change. A well-designed test suite with good abatractions makes it easy to change code, on top of which, it makes tests extremely fast to write." This highlights that good test design is more important than test quantity, which is something an LLM is going to struggle with.