RAFT: Adapting Language Model to Domain Specific RAG

Hi. Hope you are well.
The paper RAFT from UC Berkeley is really interesting, and I wanted to mention for this week.
Arxiv Link
Plus, Check this out.


I have skimmed this paper. It’s very interesting. Thanks for the request. I’ll review it soon.

RAG and fine-tuning (FT) are two well-known techniques to improve the performance of LLMs in a specific domain where a data source is available and must be used in conjunction with an LLM. Both have limitations. RAG’s limitation is due to the imperfect way we pull various chunks from the data source to construct an input prompt (context) to the LLM. Fine tuning is limited in that the LLM is trained to reply to a type of prompt (e.g. questions) without “understanding” how to arrive at an answer. RAG is like a student taking an open-book final exam with all the lecture notes but having skipped the whole semester. Fine tuning is like a student taking a final exam after memorizing a bunch of answer keys (fine tuning can be combined with RAG, akin to an open-book exam after rote memorization).

This paper proposes a new idea for the LLM to do well at the final: train the LLM to spot the correct answer from the context with CoT prompts. The context may occasionally be modified to remove the “oracle” chunk that contains the answer, effectively forcing the LLM to be resilient to imperfect context (another interpretation is forcing the LLM to occasionally memorize). It is not clear to me how the training data for this process was generated; perhaps by using GPT-4-1106 (see Section 4.3).

The goods: great performance gain for Llama2-7B over RAG, FT, RAG+FT, and even GPT-3.5 + RAG. Code and demo are shared.

What I found missing: a) (a lot more) details on how the training data was generated and b) include GPT-4 + RAG as another baseline.

TLDR for practitioners: This is a novel and interesting idea that is worth knowing of and keeping an eye on. If you already have a solution based on SOTA LLMs such as GPT-4+ or Claude and a highly tuned RAG, then RAFT may not give you a significant performance boost. If your application needs to handle multiple types of queries, then RAG has an edge since RAFT would require multiple training runs, each of which aims to improve performance for a specific query type.