Weekly paper roundup: Arctic-Embed (5/6/24)


This week’s spotlight paper is about Snowflake’s new text embedding models, Arctic-Embed, with a practical score of 4. We also round up papers covering LLM efficiency, glitch tokens, effects of fine tuning on hallucination. Similar to last week, we saw another report from Google about Gemini for medical, this time for specific use cases such as radiology, histopathology, genomic, etc. The most talked-about paper on Hackernews, xLSTM, is also discussed.


Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models.

Practical score: :star::star::star::star:

Authors: Snowflake.


Snowflake released a few small retrieval-optimized text embedding models with sizes from 22M to 334M parameters with embedding dimensions of 384, 768, and 1024. These models achieve SOTA results in their weight classes according to retrieval portion of the MTEB benchmark. The models are Apache 2 licensed, available via Hugging Face, and already available in LangChain and LlamaIndex.


This is the second spotlight paper about text embedding models, after Google’s 1.2B-param Gecko about a month ago. The practical score of 4 reflects the important role of embedding models in contemporary AI applications, as well as the fact that Arctic-Embed models are available for experimental and production use for everyone within a day of the paper’s upload to arXiv. Sweet!

I captured the highlights in the summary section above. There are a few reasons preventing this paper achieving the highest possible score of 5:

  • The focus is on retrieval, with the retrieval portion of the MTEB benchmark used for comparisons. There is no report on Arctic-Embed’s performance on the remaining portions of MTEB.
  • While retrieval is a popular downstream use of embeddings, it is not clear if models optimized for retrieval perform better in real-world retrieval use cases compared to models that are balanced and perform well across the board in MTEB.
  • The related work section lacks a discussion of developments such as Google’s Gecko and Mixbread models. Gecko introduced a novel use of LLM agents to synthesize training data. Mixbread models are unique in their uses of Matryoshka representation learning and binary quantization learning to optimize for speed and cost. This omission is especially disappointing for practitioners given the size-focus of Arctic-Embed models.

Despite these shortcomings, Arctic-Embed model family is a welcome addition to the growing collection of text embedding models available to the community. Kudos to Snowflake! After matching their main rival Databricks’ release of DBRX LLM with Arctic LLM, Snowflake one-upped Databricks with Arctic-Embed. This likely came from the expertise of their recent acquisition of search startup Neeva.

What do you think of Arctic-Embed? Let us know in the comment section.

Paper roundup

LLM prompting

Chain of Thoughtlessness: An Analysis of CoT in Planning (arxiv.org). Authors: ASU.

  • TLDR: CoT prompting does not help LLMs think better in classical planning.
  • The good: It’s a good reminder that LLM is not good at everything, and CoT is not a reasoning silver bullet.
  • The bad: Classical planning problems such as Block World have been considered toy research problems that do not have clear real-world applications. For practitioners, we many not care that LLMs suck at toy AI problems but do great in problems that customers care about. The title is too cute; this research group at ASU has been vocal about this for some time, coming across as a Debbie Downer of LLMs.
  • Practical score: :star::star:

Benchmarks and evaluations

Advancing Multimodal Medical Capabilities of Gemini. Authors: Google Research and Google Deepmind.

  • TLDR: Gemini models can be fine tuned to perform great for multimodal medical uses: 2D and 3D radiology, histopathology, ophthalmology, dermatology, and genomic data. Another week, another massive flex of Gemini for medical, in another massive 62-page long paper.
  • The good: Good news for AI for medical.
  • The bad: No code, data, weights, APIs are available yet.
  • Practical score: :star::star::star:. Let’s keep an eye on when these models are available.

LLM efficiency

You Only Cache Once: Decoder-Decoder Architectures for Language Models. Authors: Microsoft Research Asia.

  • TLDR: Introduce the concept of a a self-decoder followed by a cross-decoder that allows an efficient global KV cache that translates to memory savings, higher throughput, and lower prefilling latency.
  • The good: Comparable performance for 3B-param LLM with other same-size models while enjoying great memory/throughput/latency benefits. Code is available.
  • The bad: None.
  • Practical score: :star::star::star:. Will be interesting to see a) experiments with larger LLMs and b) combining YOCO with BitNet and Groq.
  • Note: Not sure if this research group is impacted by the recent push by Microsoft to ask AI/cloud computing, China-based employees to relocate to other countries. They have been doing interesting work!

LLM internals

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Authors: Cohere

  • TLDR: Analysis of glitch tokens: untrained or undertrained, how to find them, and what can be done to deal with them. They are pretty prevalent.
  • The good: Actionable advice on how to align the tokenizer and the model.
  • The bad: None.
  • Practical score: :star::star::star:. Especially useful for the development of future LLMs.

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?. Technion and Google Research.

  • TLDR: Yes.
  • The good: Good to know for those who fine tune.
  • The bad: None.
  • Practical score: :star::star::star:

LLM frontier

xLSTM: Extended Long Short-Term Memory

Shows that the gap between it and Transformer can be narrowed somewhat. Remains to be seen as yet another Transformer pretender. 1/5.

  • TLDR: The LSTM strikes back, showing the gap between it and Transformer can be narrowed somewhat. Work by one of the fathers of LSTM, Sepp Hochreiter.
  • The good: This work helps deepen our understanding of the intrinsic power of various architectures.
  • The bad: None. Remains to be seen if further experiments reveals more interesting insight.
  • Practical score: :star::star:

LLM announcements

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

1 Like