Fine-tuning LLMs is a waste of time

Fine-tuning for Knowledge Injection: A Contentious Topic

Several commenters express skepticism about the effectiveness of fine-tuning as a direct method for injecting entirely new knowledge into large language models (LLMs). They argue that fine-tuning is better suited for adapting models to specific tasks or modifying their behavior, rather than fundamentally expanding their knowledge base.

"Yeah, as soon as I read that I felt like the author was living in a very different context from mine. It's never even occurred to me that fine-tuning could be an effective method for injecting new knowledge." - cbsmith
"Clickbait headline. 'Fine-tuning LLMs for knowledge injection is a waste of time' is true, but IDK who's trying to do that. Fine-tuning is great for changing model behavior" - reissbaker

Fine-tuning for Specific Tasks & Behavior Modification

Many commenters agree that fine-tuning is valuable for specializing models to excel at particular tasks or to modify their behaviour (e.g., censorship). This often involves accepting trade-offs in general knowledge or other capabilities.

"Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you)" - reissbaker
"My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent." - Mathnerd314
"The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive." - rybosome

Potential Loss of General Knowledge ("Catastrophic Forgetting")

Multiple users raise the concern that fine-tuning can lead to a decline in a model's general knowledge and reasoning abilities if not done carefully, sometimes referred to as "catastrophic forgetting".

"If anything, I expect fine-tuning to destroy knowledge (and reasoning), which hopefully (if you did your fine-tuning right) is not relevant to the particular context you are fine-tuning for." - cbsmith
"Also I agree with the article, in the literature, the phenomenon to which the article refers is known as “catastrophic forgetting”." - sota_pop

RAG (Retrieval-Augmented Generation) as an Alternative

Some commenters suggest that Retrieval-Augmented Generation (RAG) is a more suitable approach for injecting new knowledge into LLMs, especially when dealing with temporary or variable information. RAG focuses on retrieving relevant information from an external source and providing it to the LLM as context.

"Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you), and RAG is great for knowledge injection." - reissbaker
"RAG and fine-tuning are suitable for different business scenarios. For some directional and persistent knowledge, such as adjustments for power, energy and other fields, it can bring better performance; RAG is more oriented to temporary and variable situations." - mapinxue

LoRA (Low-Rank Adaptation) as a Fine-tuning Technique

The discussion touches upon Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning method but includes dissenting opinions on whether it is a complete replacement for fine-tuning, since LoRA itself is fine-tuning.

"Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them." - reissbaker
"There is no real difference between fine-tuning with and without a lora. If you give me a model with a lora adapter, I can give you an updated model without the extra lora params that is functionally identical." - robrenaud *"Really interested in the idea though! The dream is that you have your big, general base model, then a bunch of LoRa weights for each task you’ve tuned on, where you can load/unload just the changed weights and swap the models out super fast on the fly for different tasks." - rybosome

Practical Considerations of fine-tuning

Several users highlight practical considerations like cost, latency, and performance as key drivers for fine-tuning, particularly for smaller models in specific use cases. The original article does not mention this.

"That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article." - solresol
"Cost, latency, and performance are huge reasons why my company chooses to fine tune models. We start with using a base model for a task and as our traffic grows, we tune a smaller model, resulting huge performance and cost savings." - itake

Distillation of Larger Models to Smaller Models

Model distillation, where a smaller model is trained using the outputs of a larger, more capable model, is mentioned as another relevant technique.

"Mostly referred to as model distillation, but I give the author the benefit of the doubt that they didn't mean that." - lyu07282

Interpretations of OpenAI's Claims

One user points out that OpenAI's documentation suggests that fine-tuning does effectively inject new knowledge, even if it involves some destruction of existing knowledge.

"OpenAI makes statements like: [1] 1) "excel at a particular task" 2) "train on proprietary or sensitive data" 3) "Complex domain-specific tasks that require advanced reasoning", "Medical diagnosis based on history and diagnostic guidelines", "Determining relevant passages from legal case law" 4) "The general idea of fine-tuning is much like training a human in a particular subject, where you come up with the curriculum, then teach and test until the student excels." Don't all these effectively inject new knowledge?" - lovelearning

The Difficulty of Successful fine-tuning

Several comments highlight the challenges and potential pitfalls of fine-tuning, including the risk of generating nonsensical or repetitive outputs. Successful fine-tuning is described as requiring expertise and careful attention to detail to avoid unintended consequences. * "It's pretty frustrating to spend weeks on finetuning and end up with a model that says: 'SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT ...'" - gdiamos

The Analogy of Compression & Saturation

One commenter draws an analogy between LLMs and compression algorithms, suggesting that fine-tuning can be seen as adding data to be compressed. They mention that models are often not trained to "saturation" on their initial datasets, leaving room for improvements through fine-tuning if done carefully.

"Wasn't there that thing about how large LLM's are essentially compression algorithms...Maybe that's where this article is coming from, is the idea that finetuning "adds" data to the set of data that compresses well. But that indeed doesn't work unless you mix in the finetuning data with the original training corpus of the base model." - Mathnerd314