Weekly paper roundup: KAN networks (4/29/24)


This week we saw two papers receiving notable attention from the community, garnering hundreds of upvotes on Hackernews. We pick one of them to be the spotlight paper. The topic of benchmarks and evaluations is the most active, with five papers addressing issues from data leakage to using LLMs to evaluate LLMs. Google published a massive, 58-page report flexing the capabilities of various Gemini models optimized for the medical domain. Most of this week’s papers are not essential reading for AI practitioners. Starting from this issue, we will assign practical scores for papers, using a scale of 1 (early/theoretical) to 5 (prototyping encouraged).


KAN: Kolmogorov-Arnold Networks. Practical score: 1/5.

Authors: Ziming Liu et al (MIT, Caltech, Northeastern U)


This paper introduces a new type of neural networks inspired by the Kolmogorov-Arnold representation theorem (KART). Existing KART-inspired neural network research has only studied a simple type of 2-layer neural networks that map neatly to the theorem. The authors experiment with making these networks deeper, and invent preliminary techniques to train them successfully, albeit slowly (especially compared to GPU-optimized networks of today). This new neural network architecture has several potential (i.e. promising but unproven) advantages over the current multi-layer perceptron architecture with respect to parameter efficiency, interpretability, and malleability. The experiments were limited to a few problems in math and physics. The paper is beautifully written and a fascinating read, especially for the math inclined. However, it has zero practical implications for today’s AI builders. The jury is out if KAN will be a purely theoretical advance or a worthy successor of MLP in the future.


We typically pick spotlight papers based on our belief that practitioners will find them helpful in building AI applications. Yet this week’s spotlight receives a practical score of 1/5. There are two reasons: a) the KAN paper received widespread coverage across places like X, Reddit, Hackernews and b) all other papers get practical scores of 3 or below.

Let’s talk about the first reason. Today we are witnessing an unprecedented level of interest in AI, including AI research papers. Every week, multiple papers are discussed on Hackernews and other social media websites (see for example data tracked by the website Emergent Mind). The community is on a treasure hunt for the next seminal advance, asking questions such as what is after Transformers? Mamba? RWKV? KAN?

History has taught us that breakthroughs come from a combination of many factors that go far beyond a clever idea in a paper. Take OpenAI’s success with their GPTs as an example. There is no doubt that the Transformers paper (actual title: Attention is All You Need) played a role there, but I argue that that role is relatively small. A much likelier explanation for their success is a combination of:

  • Betting on language modeling as a promising universal training objective (see earlier works on ELMo, Bert, etc.). Note that the Transformers paper was concerned with tasks such as machine translation and constituency parsing, not language modeling.
  • Betting on scaling: Let’s scale network sizes and data as much as possible and see what happens (and the gumption to partner with a deep-pocket company to finance this expensive experiment). They invented the first L in LLM despite the fact that Google has historically been the best proponent of scaling, in addition to being the employers of the authors of the Transformers paper.
  • Extreme attention to data (quality, diversity, and sheer volume) and other ML best practices. I heard that OpenAI’s crawling, parsing, and other data quality technologies are superior to its competitors who have years of head starts.

With this in mind the KAN paper is but a clever idea at the moment. Go read it for fun between your RAG tuning or flow engineering experiments. It has an interesting theoretical result (Theorem 2.1), thorough experiments, and outstanding exposition. Kudos to the authors!

Paper roundup

LLM applications

PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval. Authors: CSIRO, University of Waterloo, The University of Queensland.

  • TLDR: Use LLMs to generate embeddings for information retrieval, specifically for fast and high-recall doc retrieval in a two-stage IR system.
  • The good: Novel prompting strategy that elicits a good vector representation of text that can be used for both dense and sparse retrieval.
  • The bad: Seems expensive to index lots of docs offline. Gain over BM25 is not impressive.
  • Practical score: 3. Search/RAG engineers may want to keep this idea in mind.

Benchmarks and evaluations

Capabilities of Gemini Models in Medicine. Authors: Google.

  • TLDR: Massive project at Google to show the world how great Gemini models are. Strategy: pick a high profile domain (medical), and throw everything including the kitchen sink at fine tuning Gemini variants to squeeze out the best possible performance. Results show a host of SOTA results, beating GPT-4(V).
  • The good: Pushing the frontier of AI for medicine and validating the potential of Gemini and the power of Google’s mighty resources. OpenAI will need to respond, which is good for the community. The report is comprehensive (and massive).
  • The bad: Code and model weights are not available yet. The models will be available at some unspecified time in the future.
  • Practical score: 3. Will be higher when the models are available for production use.

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. Authors: KAIST, LG, CMU, MIT, AI2, UIC

  • TLDR: An open source alternative to GPT-4 for the explicit use of evaluating LLMs.
  • The good: Useful contribution to the community. Interesting use of model merging. New benchmark dataset for pairwise preference ranking.
  • The bad: Lacks a discussion of limitations and future directions
  • Practical score: 2. It’s unclear in what situations this model would best serve the needs of practitioners.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. Authors: Cohere

  • TLDR: For evaluation, use a panel of small/cheap LLMs instead of one large/expensive LLM for comparable performance and 1/7 of the cost. Selected model families: Cohere R, Mistral, GPT-3.5. Related to the Prometheus 2 paper above.
  • The good: New insight about the effectiveness of an LLM panel.
  • The bad: No code/model is shared.
  • Practical score: 2.

A Careful Examination of Large Language Model Performance on Grade School Arithmetic. Authors: Scale AI

  • TLDR: A new benchmark dataset and accompany fascinating analysis insight into the questions around LLM overfitting. Phi/Mistral overfit, whereas frontier models (GPT-4, Claude, Gemini) don’t.
  • The good: This is perhaps the first benchmark that was designed to analyze LLM overfitting. Companies, especially those who develop proprietary models, most likely have proprietary benchmarks.
  • The bad: None
  • Practical score: 2. It could be helpful to consider this analysis when selecting which LLM to use.

Benchmarking Benchmark Leakage in Large Language Models. Authors: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Generative AI Research Lab (GAIR)

  • TLDR: Use perplexity and n-gram accuracy to detect potential data leakage.
  • The good: Addressing an important challenge in LLM evaluation today. Code and demo are shared.
  • The bad: None
  • Practical score: 2.

LLM fine tuning/many shot learning.

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively. Authors: various.

  • TLDR: We can fine tune LLMs to know when to use search to answer questions correctly.
  • The good: The paper tackles a problem that has clear practical uses.
  • The bad: Training data is a bottleneck. Study is limited to question answering.
  • Practical score: 3.

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report. Authors: Predibase

  • TLDR: Comprehensive study to show that fine tuning works.
  • The good: Lots of evaluations of fine tuning LLMs.
  • The bad: Questions remain around evaluation methodology (prompting, train/test selection) and potential of overfitting. The paper seems like marketing content for the company.
  • Practical score: 3.


Extending Llama-3’s Context Ten-Fold Overnight. Authors: Beijing Academy of Artificial Intelligence, Renmin University.

  • TLDR: Fine tune (with LoRA) to extend a pre-trained LLM’s context.
  • The good: Positive experiments in an important topic.
  • The bad: The paper seems to be written in a hurry, with limited evaluation and no discussion of related work. It’s unclear if the fine-tuned, context-extended model actually performs better in real-world uses.
  • Practical score: 2

LLM efficiency

Octopus v4: Graph of language models. Authors: Stanford → Nexa4AI

  • TLDR: Continued exploration of the use of functional token, starting with Octopus v1. This time, tokens are used to select the right experts given the prompt in a mixture-of-expert-like approach. An analogy is micro services versus monoliths
  • The good: A new approach to design MoE.
  • The bad: The paper seems to be written in a hurry, lacking in-depth analyses such as ablation.
  • Practical score: 2.

Better & Faster Large Language Models via Multi-token Prediction. Authors: Meta

  • TLDR: This is the second paper after the KAN paper that received lots of attention online. It proposes the idea of predicting more than one token during pre-training and for inference as well. Experiments show some improvement in coding.
  • The good: It’s a good idea to challenge the status quo of predicting one token at a time.
  • The bad: It’s not clear why evaluation is done on a handful of benchmarks (MBPP, HumanEval, APPS/Intro). Which benchmarks show less positive findings?
  • Practical score: 2.
  • See also: Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge by Baichuan Inc. and Peking University.

LLM internals

NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment. Authors: NVIDIA

  • TLDR: Toolkit from NVIDIA to do LLM alignment.
  • The good: Useful tool for folks who need to do alignment.
  • The bad: None
  • Practical score: 3

FLAME: Factuality-Aware Alignment for Large Language Models. Authors: U Waterloo, CMU, Meta.

  • TLDR: This paper shows how to do alignment (SFT and RL) that optimize for factuality.
  • The good: Good insight that alignment can lead to hallucination, and there are ways to mitigate that.
  • The bad: None. It’s a well-written paper with thorough analyses.
  • Practical score: 2.

Iterative Reasoning Preference Optimization. Authors: Meta, NYU

  • TLDR: An advance in iterative preference optimization, targeting reasoning improvement. Significant improvements observed on Llama-2 on math benchmarks.
  • The good: Interesting insight to leverage chain-of-thought (CoT) prompting to generate training data.
  • The bad: No discussion of limitations and future work.
  • Practical score: 2.