Weekly paper roundup (4/22/24)

vha14 · May 2, 2024, 5:15am

This is the 10th edition of Harmonious’ weekly paper roundup series. This past week I did not find any paper that merits the spotlight designation. Thus I am experimenting with giving a brief overview of papers that are organized around the following topics.

LLM applications: agents, chatbots, RAG, document understanding, coding, and others.
LLM prompting: techniques such as CoT to help us get the most out of LLMs.
Multimodal LLMs. This is an area where many folks expect a lot of advances in the next wave.
Synthetic data and other novel ways to generate training data. Despite the rise of LLMs and zero/few-shot learning, the data bottleneck is still present. It’s helpful to find creative ways to get data, not only for fine tuning but also for prior generation ML approaches.
Benchmarks and evaluations: we need to understand the strengths and limitations of LLMs.
LLM fine tuning/many shot learning. This is an important option wherever prompting is not sufficient.
Context: topics such as context length and limits, effective use of context, context compression, etc.
LLM efficiency, primarily for fine tuning and inference. This is obviously important for real world deployment.
LLM internals: how they work.
LLM frontier: what’s the next big leap beyond transformers? State-space models? Self-evolution?
LLM announcements: e.g. Llama, Phi, etc.

I created this taxonomy based on reading a few hundred papers for the first 9 editions of the weekly paper roundup series. I also roughly order the topics in the order of relevance to practitioners (obvious caveat: this is highly subjective). I may adjust this taxonomy if necessary. Not every topic will have papers for a given week.

Let’s look at the papers for the week of April 22, 2024.

LLM use cases

LLM-code

How Far Can We Go with Practical Function-Level Program Repair? Authors: Southern U of Science and Technology, Shenzhen and Kwai Inc.

TLDR: Study of LLM-based function-level automatic program repair (APR), focusing on few-shot learning and the auxiliary repair-relevant information. Proposes an LLM-based function-level APR technique which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information for advancing the repair performance.
Assessment: The good: interesting insight into using LLMs to fix bugs. The bad: unclear why GPT-4 is not included in the study, given its superior code capabilities compared to GPT-3.5.

LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. Authors: Nanyang Technological U, Singapore U of Technology, Alibaba.

TLDR: Use LLMs for database query rewrite for efficiency. Contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM.
Assessment: The good: paper seems thorough and data/code is made available. The bad: unclear how this work can have impact outside of the niche area of database query optimization.

LLM-agents

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation. Authors: various institutions and companies based in China

TLDR: Uses LLM agents to generate code for extracting information from webpages.
Assessment: The good: this is a useful task for LLMs to solve. The bad: this is not about crawlers; it’s about webpage parsers. Also, the findings seem inconclusive.

A Multimodal Automated Interpretability Agent. Authors: MIT.

TLDR: LLM agents as researchers doing interpretability analysis on machine learning models.
Assessment: The good: bold exploration pushing the boundaries of LLM agents; no job is safe from LLM agents’ encroachment. The bad: modest success; a fair amount of real researchers’ babysitting is still needed.

FlowMind: Automatic Workflow Generation with LLMs. Authors: JP Morgan.

TLDR: LLM agents to write one-off API-driven python scripts for non-technical finance folks.
Assessment: The good: the so-called “lecture” prompting technique is useful to know. The bad: the benchmark is too easy; it’s almost already saturated.

Prompting

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners. Authors: Wuhan U, U of Sydney, Nanyang Tech.

TLDR: A prompting technique to implore LLMs to think deeply about reasoning problems (e.g. math).
Assessment: The good: incremental gains over baseline. The bad: code is not shared.

Multimodal

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites. Authors: various China-based institutions and companies.

TLDR: InternVL 1.5 multimodal LLM is competitive with GPT-4V and others MLLMs.
Assessment: The good: probably SOTA MLLM for Chinese language. Model/code is shared. The bad: no discussion about limitations or future work. WYSIWYG.

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension. Authors: Tencent, the Chinese U of HongKong.

TLDR: New benchmark for text-rich document understanding and another win for GPT-4V over Gemini Pro and Claude 3 Opus.
Assessment: The good: welcome benchmark addition for an important, highly practical task. The bad: none.

Long context

LongEmbed: Extending Embedding Models for Long Context Retrieval. Authors: Peking U and Microsoft.

TLDR: Extend context length of pre-trained short-context embedders instead of training long-context ones from scratch + a new benchmark for long-context tasks.
Assessment: The good: looks like it can be done effectively, and the benchmark seems to be well designed. The bad: analysis only on training-free techniques, and there’s no baseline for training-from-scratch long context embedders.

SnapKV: LLM Knows What You are Looking for Before Generation. Authors: UIUC, Cohere, Princeton

TLDR: Clever idea to compress KV cache to speed up long-context processing.
Assessment: The good: 3.6X faster generation and 8.2X smaller memory footprint. The bad: none

LLM efficiency

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. Authors: various China-based institutions.

TLDR: Significant degradation observed, especially with ultra-low bit-width.
Assessment: The good: useful study + code is shared. The bad: none.

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference. Authors: ServiceNow, MILA

TLDR: Highly technical idea to significantly reduce the memory foot print of the KV cache.
Assessment: The good: improvement seems impressive with real world implications. The bad: eval is done on QA benchmarks only.

LLM analysis

Retrieval Head Mechanistically Explains Long-Context Factuality. Authors: Peking U, U of Washington, MIT, UIUC, U of Edinburgh.

TLDR: For retrieval from long-context, there is a set of attention heads that do this job.
Assessment: The good: fascinating insight with potential impact across many high-level LLM tasks. The bad: none.

LLM frontier

A Survey on Self-Evolution of Large Language Models. Authors: various China-based institutions.

TLDR: Survey of how LLMs may create a self-training loop to continuously improve themselves.
Assessment: The good: a peek into potentially the next big breakthrough. The bad: none.

LLM releases

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework. Authors: Apple.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Authors: Microsoft.

Topic		Replies	Views
Weekly paper roundup: scaling laws for LLM fine tuning (2/26/24) General weekly-paper-roundup	0	211	March 12, 2024
Weekly paper roundup: KAN networks (4/29/24) General weekly-paper-roundup	0	344	May 10, 2024
Weekly paper roundup: Arctic-Embed (5/6/24) General weekly-paper-roundup	0	188	May 16, 2024
Weekly paper roundup: RULER: real context size of LLMs (4/8/2024) General weekly-paper-roundup	2	610	May 3, 2024
Weekly paper roundup: Gecko text embedding distilled from LLMs (4/1/24) General weekly-paper-roundup	0	242	April 9, 2024