Weekly paper roundup (4/22/24)

This is the 10th edition of Harmonious’ weekly paper roundup series. This past week I did not find any paper that merits the spotlight designation. Thus I am experimenting with giving a brief overview of papers that are organized around the following topics.

  • LLM applications: agents, chatbots, RAG, document understanding, coding, and others.
  • LLM prompting: techniques such as CoT to help us get the most out of LLMs.
  • Multimodal LLMs. This is an area where many folks expect a lot of advances in the next wave.
  • Synthetic data and other novel ways to generate training data. Despite the rise of LLMs and zero/few-shot learning, the data bottleneck is still present. It’s helpful to find creative ways to get data, not only for fine tuning but also for prior generation ML approaches.
  • Benchmarks and evaluations: we need to understand the strengths and limitations of LLMs.
  • LLM fine tuning/many shot learning. This is an important option wherever prompting is not sufficient.
  • Context: topics such as context length and limits, effective use of context, context compression, etc.
  • LLM efficiency, primarily for fine tuning and inference. This is obviously important for real world deployment.
  • LLM internals: how they work.
  • LLM frontier: what’s the next big leap beyond transformers? State-space models? Self-evolution?
  • LLM announcements: e.g. Llama, Phi, etc.

I created this taxonomy based on reading a few hundred papers for the first 9 editions of the weekly paper roundup series. I also roughly order the topics in the order of relevance to practitioners (obvious caveat: this is highly subjective). I may adjust this taxonomy if necessary. Not every topic will have papers for a given week.

Let’s look at the papers for the week of April 22, 2024.

LLM use cases


How Far Can We Go with Practical Function-Level Program Repair? Authors: Southern U of Science and Technology, Shenzhen and Kwai Inc.

  • TLDR: Study of LLM-based function-level automatic program repair (APR), focusing on few-shot learning and the auxiliary repair-relevant information. Proposes an LLM-based function-level APR technique which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information for advancing the repair performance.
  • Assessment: The good: interesting insight into using LLMs to fix bugs. The bad: unclear why GPT-4 is not included in the study, given its superior code capabilities compared to GPT-3.5.

LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. Authors: Nanyang Technological U, Singapore U of Technology, Alibaba.

  • TLDR: Use LLMs for database query rewrite for efficiency. Contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM.
  • Assessment: The good: paper seems thorough and data/code is made available. The bad: unclear how this work can have impact outside of the niche area of database query optimization.


AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation. Authors: various institutions and companies based in China

  • TLDR: Uses LLM agents to generate code for extracting information from webpages.
  • Assessment: The good: this is a useful task for LLMs to solve. The bad: this is not about crawlers; it’s about webpage parsers. Also, the findings seem inconclusive.

A Multimodal Automated Interpretability Agent. Authors: MIT.

  • TLDR: LLM agents as researchers doing interpretability analysis on machine learning models.
  • Assessment: The good: bold exploration pushing the boundaries of LLM agents; no job is safe from LLM agents’ encroachment. The bad: modest success; a fair amount of real researchers’ babysitting is still needed.

FlowMind: Automatic Workflow Generation with LLMs. Authors: JP Morgan.

  • TLDR: LLM agents to write one-off API-driven python scripts for non-technical finance folks.
  • Assessment: The good: the so-called “lecture” prompting technique is useful to know. The bad: the benchmark is too easy; it’s almost already saturated.


Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners. Authors: Wuhan U, U of Sydney, Nanyang Tech.

  • TLDR: A prompting technique to implore LLMs to think deeply about reasoning problems (e.g. math).
  • Assessment: The good: incremental gains over baseline. The bad: code is not shared.


How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites. Authors: various China-based institutions and companies.

  • TLDR: InternVL 1.5 multimodal LLM is competitive with GPT-4V and others MLLMs.
  • Assessment: The good: probably SOTA MLLM for Chinese language. Model/code is shared. The bad: no discussion about limitations or future work. WYSIWYG.

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension. Authors: Tencent, the Chinese U of HongKong.

  • TLDR: New benchmark for text-rich document understanding and another win for GPT-4V over Gemini Pro and Claude 3 Opus.
  • Assessment: The good: welcome benchmark addition for an important, highly practical task. The bad: none.

Long context

LongEmbed: Extending Embedding Models for Long Context Retrieval. Authors: Peking U and Microsoft.

  • TLDR: Extend context length of pre-trained short-context embedders instead of training long-context ones from scratch + a new benchmark for long-context tasks.
  • Assessment: The good: looks like it can be done effectively, and the benchmark seems to be well designed. The bad: analysis only on training-free techniques, and there’s no baseline for training-from-scratch long context embedders.

SnapKV: LLM Knows What You are Looking for Before Generation. Authors: UIUC, Cohere, Princeton

  • TLDR: Clever idea to compress KV cache to speed up long-context processing.
  • Assessment: The good: 3.6X faster generation and 8.2X smaller memory footprint. The bad: none

LLM efficiency

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. Authors: various China-based institutions.

  • TLDR: Significant degradation observed, especially with ultra-low bit-width.
  • Assessment: The good: useful study + code is shared. The bad: none.

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference. Authors: ServiceNow, MILA

  • TLDR: Highly technical idea to significantly reduce the memory foot print of the KV cache.
  • Assessment: The good: improvement seems impressive with real world implications. The bad: eval is done on QA benchmarks only.

LLM analysis

Retrieval Head Mechanistically Explains Long-Context Factuality. Authors: Peking U, U of Washington, MIT, UIUC, U of Edinburgh.

  • TLDR: For retrieval from long-context, there is a set of attention heads that do this job.
  • Assessment: The good: fascinating insight with potential impact across many high-level LLM tasks. The bad: none.

LLM frontier

A Survey on Self-Evolution of Large Language Models. Authors: various China-based institutions.

  • TLDR: Survey of how LLMs may create a self-training loop to continuously improve themselves.
  • Assessment: The good: a peek into potentially the next big breakthrough. The bad: none.

LLM releases

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework. Authors: Apple.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Authors: Microsoft.