Weekly paper roundup: Long-form factuality in large language models (3/25/24)


Long-form factuality in large language models

Authors: Google Deepmind and Stanford University


How can the factuality of long-form answers provided by Large Language Models (LLMs) be measured? There are two options: use human evaluators or LLM agents equipped with Google. This paper demonstrates that LLM agents are faster, more cost-effective, and more reliable than human evaluators. Extensive benchmarking illustrates that GPT-4-Turbo performs the best in this scenario, convincingly outperforming Claude 3. The benchmark, created using GPT-4, is shared.


Evaluation has emerged as one of the biggest, if not the biggest challenge in AI. We have moved far past the convenience of metrics such as classification accuracy and F1. This work is an interesting step toward improving our abilities to evaluate AI systems.

The above is a screenshot from the paper that captures the motivation and key idea of this paper. The question “What is the Eiffel Tower?” is typically answered by LLMs with a long-form answer with multiple sentences and statements, some of which may be incorrect or irrelevant. The contributions/findings of the paper are the following.

  • Benchmark. The authors created 2,280 fact seeking prompts/questions such as the Eiffel one above, across 38 manually selected topics (e.g. Architecture, Philosophy, Sports, etc.). It’s worth noting that these prompts were created using GPT-4.
  • GPT-3.5-Turbo-0125 agent with Google search as annotator. This is an intriguing concept. While human annotators have traditionally been used for evaluations, Large Language Models (LLMs) equipped with tools have become a practical alternative. In this benchmark, the agent can dissect a lengthy response into smaller components or individual facts, then use a search engine to collect web-based information for fact-checking. GPT-3.5-Turbo was chosen as a baseline due to its cost-effectiveness, with the potential to utilize more powerful and expensive LLMs in the future.
  • LLM agents + tools can outperform human annotators at 1/20 of the cost. In a randomly sampled set of 100 facts where SAFE disagreed with humans, SAFE was correct on 76 instances, and incorrect on 24 instances. That’s a 3X win rate for SAFE.
  • GPT-4-Turbo tops the leaderboard. As expected, the bigger the LLM, the better the performance. Somewhat surprisingly, Claude 3 models (Opus/Sonnet/Haiku) are not in the top 3. GPT-4-Turbo is #1 and and Gemini-Ultra is #2.

I chose this paper for this week’s spotlight because I find the concept of using tool-equipped LLM agents as annotators interesting. This is an approach that practitioners should consider for their LLM applications. Continuous evaluation is crucial for improving the performance of LLM applications. Automating annotations could be a cost-effective and scalable method to achieve this. The paper, which is 61 pages long, is easy to read and is packed with practical suggestions, including those in the extensive appendix.

A practical challenge is evaluating Language Model (LLM) agents, which are becoming increasingly common. This is an intriguing area of research. It’s likely that LLM agent systems require a more complex evaluation approach where tool-equipped LLM agents assess subsystems, rather than the entire end-to-end system.

Noteworthy papers

AllHands: Ask Me Anything on Large-scale Verbatim Feedback via Large Language Models. Authors: Microsoft.

  • How do we interpret a large volume of user feedback? This was the focus of last week’s spotlight paper. It suggests using an LLM agent-based approach to organize user feedback into a data-driven taxonomy, nearly entirely automatically. This paper also addresses the same issue, but takes it a step further by creating an interactive user experience. Here, users can ask natural language questions and receive detailed answers that include tables, code, and interactive plots. Very cool! Unfortunately, no code or data is shared.

The Unreasonable Ineffectiveness of the Deeper Layers. Authors: Meta, Cisco, Zyphra, MIT.

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression. Authors: UTexas at Austin, Drexel, MIT, UIUC. Lawrence Livermore Lab, Center for AI Safety, UC Berkely, UChicago.

  • This is a pair of related papers that are interesting to read together. The shared theme is LLM compression: we all want smaller, more efficient and cost effective models in production. The first paper shows that compression can be achieve by pruning: a) removing certain layers using similarity as a proxy for redundancy then b) healing by fine tuning. It also shows that redundant layers are usually deeper. The second paper, which is a collaboration of a large number of institutions, studies the effect of LLM compression with respect to safety. It finds that while (moderate) quantization does not degrade safety, pruning severely degrades safety.
1 Like