Weekly paper roundup: RULER: real context size of LLMs (4/8/2024)


RULER: What’s the Real Context Size of Your Long-Context Language Models?

Authors: NVIDIA


In LLM world, longer context, as indicated in the spec, does not necessarily mean better. This paper introduces a new benchmark, called RULER, to test LLMs’ abilities to handle long context. This benchmark is more complex than the popular retrieval-focused needle-in-the-haystack benchmark (NIAH), testing the abilities for co-reference resolution and aggregation. The authors evaluate GPT-4 and nine open source LLMs on this benchmark. While all ten models accept 32K context, only four models: GPT-4, Command-R, Yi-34B, and Mixtral 8x7B, have OK performance at this length. GPT-4 unsurprisingly emerges as the winner.


Almost every week brings announcements of new LLMs boasting impressive benchmark performances. While this is excellent news for AI developers, it underscores the need for improved benchmarks. These benchmarks should continue to drive progress and prevent overfitting. This paper contributes such a benchmark, focusing on long context performance, which is currently an extremely active area of research and commercialization (see Gemini 1.5, Claude 3, Jamba, etc.). The widely-used, retrieval-focused NIAH benchmark is nearing saturation as several leading models are performing nearly perfectly on it. RULER goes beyond simple retrieval, testing more advanced linguistic abilities such as coreference resolution and aggregation. Below are some key findings:

  • All models suffer large performance degradation as context lengths increase.
  • GPT-4-1106-preview is the winner. Even it suffers a 15 percentage point drop when context length is increased from 4K to 128K.
  • Google Gemini versions, in particular Gemini 1.5 were not included in the evaluation. Claude 3 was not included either. It would be interesting to evaluate these models on RULER.
  • Models with Transformer-alternative architectures such as Mamba and RWKV perform poorly.

TLDR for practitioners: GPT-4 continues to be a strong option. Be careful when experimenting with newer LLMs: don’t trust the specs, especially context length, and metrics on old benchmarks.

Noteworthy papers

LLoCO: Learning Long Contexts Offline. Authors: UC Berkeley.

  • Interesting work on handling long context combining context compression and LoRA fine tuning. Key passage:

    To illustrate our idea, consider an analogy: envision an LLM as a student preparing for an exam, where we, the researchers, are the examiners providing study materials and questions. Traditional in-context learning with full context or Retrieval-Augmented Generation (RAG) resembles an open-book exam, where the LLM has access to all materials while answering questions. In contrast, our approach is akin to a semi-closed-book exam, where the LLM cannot bring the entire book but is allowed to bring a cheat sheet. To excel in the exam, the student must study efficiently to distill a concise yet informative cheat sheet, and 2) effectively retrieve relevant information from the cheat sheet to accurately answer exam questions. Namely, 1) How can we train a model to produce a compact representation of the original context that the LLM can interpret and utilize effectively? 2) How to enable the LLM to proficiently navigate and extract pertinent details from this representation during inference?

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Authors: U of Hong Kong, CMU, Salesforce, U Waterloo.

  • Tough new benchmark for assessing agents that operate computers running Mac, Windows, and Linux. GPT-4(V) is SOTA at 12% versus human’s 72%. Claude 3 is significantly behind GPT-4; analysis reveals that its grounding ability is limited.

A shame that Claude is not included in this benchmark. Thanks for sharing.

The authors indicated that they evaluated Gemini-1.5-Pro which is now takes over the 1st ranking from GPT-4. Claude 3’s eval is in the work pending API access.

Results are reported at GitHub - hsiehjackson/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?