Weekly paper roundup: Gecko text embedding distilled from LLMs (4/1/24)

Spotlight

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Authors: Google Deepmind

Summary

This paper proposes a clever way to train text embedding models that are 1) SOTA in their weight class and 2) competitive in much heavier weight classes, both in model sizes and embedding dimensions. The key idea is to use an LLM agent to generate a massive amount of synthetic data to train the embedder. This advance is highly relevant for practitioners who need an embedder, i.e. pretty much all of us. The Gecko embedder is already available via Google Cloud API.

Details

This is the second week running where the spotlight paper belongs to Google Deepmind, and where the key idea is using LLMs as AI annotators.

Text embeddings are crucial for modern AI applications, including RAG and vector databases. Often, those of us building RAG applications simply choose an embedding model that we prefer or one that comes by default with the vector database. We usually don’t concern ourselves with how these models are trained. However, this paper will be interesting for those who do. It also showcases an effective use of LLM agents for synthesizing training data, which is becoming an increasingly valuable technique for AI builders.

Typically, embedders are trained using paired corpus of (query, passage) pairs. For each pair, the query can either be a question or a statement that is answered or supported by the passage respectively. (A technical detail: this training is the last step of a three-step training regime starting with pre-training a transformer-based language model, then pre-finetuning with (title, body) pairs from web documents). To be effective, the embedder needs to see a lot of such high-quality pairs across a wide range of topics and styles during training. Just like last week’s spotlight paper, Google Deepmind uses LLM agents to synthesize a massive amount of such pairs to train the Gecko embedders.

How does the LLM agent work? The (unidentified, possibly a Gemini series) LLM is used in a two-step process. In the first step, a Web passage is selected and the LLM is prompted to generate a query (question/statement) with it. In the second step, an (unidentified), existing embedder is used to retrieve a set of additional Web passages based on the generated query. The goal of this second step is to look for passages that pair better with the query than the original one. The retriever may also supply the training regime with interesting, hard negatives, i.e. passages that may seem relevant (according to the retriever) but deemed not so according to the much more powerful LLM. (The authors use the word distillation to describe this LLM agent workflow, but I think this word is not a great choice of word as the reader may confuse with the common use of model distillation.)

The results are pretty impressive. I’d like to encourage the readers to explore ways to find variations of this LLM-agents-as-data-annotators theme in their own applications. To achieve high quality annotations, a multi-step process is often used where the LLM is used multiple times. Let me know in the comments if you have had successes with this technique.

Noteworthy papers

Octopus v2: On-device language model for super agent. Authors: Stanford.

  • Problem: build a small LLM with function calling to be deployed on mobile phones as an assistant that help users do things on their phones via a chat interface. Solution: use Google’s Gemma-2B with a special fine-tuning idea: use special tokens for function names. Results: 20x reduction in token length, 25x improvement in latency, and much improved battery life. This could be the basis of Siri 2.0.

ReFT: Representation Finetuning for Language Models. Authors: Stanford.

  • PEFT techniques such as LoRA are the current preferred ways to fine tune LLMs. Instead of updating weights, what if we train the model to learn “intervention” functions which intervene in the computation graphs via hooks at different layers? That’s the idea behind ReFT. Experiments show exciting potential: better performance than PEFT and faster training time. Code is released.

Long-context LLMs Struggle with Long In-context Learning. Authors: Waterloo, CMU, Vector Institute

  • There has been a race among LLM developers to increase context length. This paper points out a limitation of some existing LLMs with long context, in the context (no pun intended) of extreme-label classification. TLDR: beyond 20K token length, performance is severely impacted with the exception of GPT-4. Even the mighty GPt-4 buckles when the number of labels is increased to 174. However, Gemini 1.5 and Claude 3, the two most recent LLMs with much hyped long context support were not included in the study. I suspect they don’t perform much better either. Bottom line: you may see success using long context in a specific scenario, but don’t bet everything on it.

Jamba: A Hybrid Transformer-Mamba Language Model. Authors: AI21 Labs.

  • Great result coming out of AI21 labs. Jamba introduces a hybrid Transformer-Mamba architecture with a mixture-of-experts (MoE) design. This configuration, fitting on a single 80GB GPU, combines the strengths of both model families, achieving high throughput and memory efficiency while delivering SOTA performance on language tasks up to 256K tokens. Weights are released. We will hear a lot more about Mamba/SSM in the future.
1 Like