Weekly paper roundup: Marco Web Search (5/13/24)


This week’s spotlight paper is about a new search click dataset from Microsoft: Marco Web Search.: 10M queries, 10B documents, and 93 languages. Hackernews’ most discussed paper is Meta’s Chameleon, which shares details about how to train a mixed-modal 34B model that beats Gemini Pro and GPT-4V in a new long-form, mixed-modal benchmark, but lacks a discussion of an important modality: audio. Databricks’ paper LoRA learns less and forgets less has a title that doubles as a TLDR. Stanford University publishes a paper focusing on many-shot learning taking advantage of ever increasing context lengths.


MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Practical score: :star::star::star:

Authors: Microsoft


MS Marco Web Search (MWS) is a massive dataset of carefully selected 10M web search queries and 10B clicked documents, in 93 languages. The documents are well-parsed and well-tagged. This is significant contribution to the research community studying text vector embeddings and information retrieval tasks such as neural indexers and approximate nearest neighbor (ANN) search. For practitioners building RAG applications, it is unclear if this resource will be useful.


Every day, millions of search queries are entered into search engines such as Google and Bing. These queries are then logged together with the clicked-through URLs whose content can be retrieved from Web repositories such as ClueWeb. The resulting query to clicked document pairs, sometimes referred to as search click log (SCL), carry a large and constantly growing amount of textual semantic labels that can be used for tasks such as vector embeddings and information retrieval. For example, SCL plays a key role at commercial search engines in both fast, recall-focused retrieval and slow, precision-focused ranking in a typical two-stage search architecture. In the early days, Bing was at a huge disadvantage against Google given the latter’s much larger SCL (because of its head start and market position).

It is thus an interesting decision by Microsoft to release MWS, a SCL dataset that is orders of magnitude larger than existing datasets. This resource will be welcomed by the research community. One research problem that appears to be challenging for a foreseeable future is end-to-end recall performance for text embeddings and disk-based ANNs which is lower than brute-force NN by 10 absolute percentage points across all tested embeddings (see Section 4.6). MWS will help researchers working on this problem.

For practitioners building RAG applications, the impact of MWS is more limited. We will typically use an existing text embedding algorithm and an ANN algorithm (which may be memory-based or disk-based depending on the scale of the document corpus). It is not clear how MWS can help us pick the right choices or fine tune them for optimal retrieval performance (and consequently RAG performance). Nevertheless, we recommend reading this paper to understand important technical details under the hood of the retrieval component of a RAG system, so we can make informed decisions on text embeddings, ANN options, and others as the research community continues to make advances in these areas.

While SCL makes a big difference for search engines, it is unclear if OpenAI’s GPTs (via its partnership with Microsoft) or Google’s Geminis have used this type of data for training (e.g. in fine tuning)–there is no public information on this. Please let us know in the comment section if you have insight here.

Paper roundup

Multimodal LLMs

Chameleon: Mixed-Modal Early-Fusion Foundation Models. Authors: Meta.

  • TLDR: First paper detailing how to train transformers on documents with interleaved text/code and images. It’s helpful to understand the difference between multi-modal and mixed-modal setups.
  • The good: Also shared a new benchmark focusing on long-form mixed-modal performance where Chameleon-34B beats Gemini-Pro and GPT-4V. This was the most discussed paper on Hackernews this past week.
  • The bad: Given the recent release of GPT-4o, a discussion on extending Chameleon to handle audio would have been nice. No demo/code/model are shared yet.
  • Practical score: :star::star:

LLM fine tuning/many shot learning.

LoRA Learns Less and Forgets Less (arxiv.org). Authors: Columbia U and Databricks

  • TLDR (direct quote from the Introduction):
    • Full finetuning is more accurate and sample-efficient than LoRA in code and math.
    • LoRA forgets less of the source domain, providing a form of regularization.
    • LoRA’s regularization is stronger compared to common regularization techniques; it also helps maintaining the diversity of generations.
    • Full finetuning finds high rank weight perturbations.
    • Compared to full finetuning, LoRA is more sensitive to hyperparameters, namely learning rate, target modules, and rank.
  • The good: First comprehensive study of this sort. Relevant to practitioners who evaluate finetuning approaches.
  • The bad: From a practical standpoint, it’s not clear if it is desirable to minimize performance loss on the source domain. If we only care about the target domain, we would prefer full finetuning if we can rule out overfitting. A discussion on this topic would have been welcomed.
  • Practical score: :star::star::star:

Many-Shot In-Context Learning in Multimodal Foundation Models. Authors: Stanford.

  • TLDR: For in-context learning (ICL), many-shot is better than few-shot, and the more shots the better. Gemini 1.5 Pro is better than GPT-4o at learning from shots.
  • The good: Interesting insight into a problem with practical implications. ICL is a wonderful property. Context size matters (but be sure to look at the real context size). This is the second most discussed paper on Hackernews this past week.
  • The bad: None.
  • Practical score: :star::star::star:

LLM efficiency

Layer-Condensed KV Cache for Efficient Inference of Large Language Models. Authors: Shanghai Tech.

  • TLDR: Use KV cache only for a selected few layers to speed up inference, at the cost of slower (3X) training.
  • The good: We need more advances in this direction (to increase LLM inference efficiency).
  • The bad: A discussion of the recent published paper (YOCO) would have been nice.
  • Practical score: :star::star:

LLM internals

The Platonic Representation Hypothesis. Authors: MIT

  • TLDR: There is one true representation of the reality. Let’s call it the platonic representation.
  • The good: Interesting musing on the recent advances in AI. This paper is lively discussed on Reddit and Hackernews.
  • The bad: Given recent advances in multi- and mixed-modal models (Chameleon, Gemini, GPT-4o), the ideas in this paper are not particularly groundbreaking.
  • Practical score: :star: