Weekly paper roundup: TnT-LLM: Text mining at scale with LLMs (3/18/24)

Weekly paper roundup: TnT-LLM: Text Mining at Scale with Large Language Models (4/7/2023)


TnT-LLM: Text Mining at Scale with Large Language Models

Authors: Microsoft


This is a novel use of LLMs to text categorization where both the taxonomy and the efficient light-weight categorizer are built using LLMs in an agentic, deliberate, and automated fashion. This is a welcome line of work that has the potential to significantly level up text categorization across many use cases, putting a powerful tool into the hands of non-technical users.


Transforming unstructured text into structured and meaningful forms, organized by a comprehensive and coherent taxonomy, is a fundamental step in text mining for downstream tasks. Applications are numerous from gathering insight from customer feedback for product prioritization to analyzing trends from search logs. I lived this problem and attempted various solutions while working at Bing in the 2000s.

Before LLMs, this problem is challenging on two fronts: taxonomy creation and categorizer training. Both require significant manual efforts from domain expert taxonomists and data labelers. Alternative approaches based on clustering or topic modeling lack interpretability. In addition, it’s challenging to maintain the solution in face of emerging topics. LLMs have the potential to give us an effective solution to this problem. This paper gives us a few ideas how to do it.

  • Prompt LLMs to generate and refine a taxonomy from data.
  • Use LLM to generate (pseudo) labels for each category in the taxonomy.
  • Train lightweight (e.g. logistic regression) supervised categorizers using the above labels so they can be deployed to process high-volume data with good performance, high throughput, and low cost. Interesting, these lightweight classifiers can match or even exceed the accuracy of the LLMs used to label data.
  • All of these steps could be implemented using LLM agents, almost entirely automatically. This is very interesting since a non-technical person can use a simple point-and-click or prompt-based interface to categorize a text corpus.

There are two things I would like to see from the authors as a follow-up: a) sharing their implementation and some dataset in limited form and b) discuss the challenge of on-going maintenance of such solutions in response to new, emerging topics.

Noteworthy papers

RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners. Authors: Northeastern University, Alibaba Group, NiuTrans Research.

  • This work is along the line of incremental improvement over CoT prompting with an agentic approach. The key ideas are a) generate multiple CoT plans, then b) compare them using LLMs at each step. The result: up to 13% improvement over baseline CoT. No code is shared. Should not be difficult to implement this technique and check your mileage.

Evolutionary Optimization of Model Merging Recipes. Author: Sakana AI.

  • Use evolutionary algorithms to merge open-source LLMs to create specialized models, such as a Japanese LLM with Math reasoning capabilities, achieving state-of-the-art performance on benchmarks without lots of GPUs. Let the Cambrian explosion begin! The authors release two models and evaluation code, but not the merging code. A key limitation in their studies is around instruction tuning and alignment.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. Authors: Tsinghua University, Microsoft Corporation.

  • The paper introduces a data distillation approach for task-agnostic prompt compression that maintains essential information while reducing prompt size. Key idea: treat prompt compression as token classification (relevant/not relevant). Also shared a benchmark dataset. It’s already integrated into LlamaIndex and LangChain, so it’s worth checking out.

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. Authors: Alibaba Group, Renmin University of China

  • This paper proposes a Unified Structure Learning approach to enhance Multimodal Large Language Models (MLLMs) for Visual Document Understanding, introducing a vision-to-text module and a comprehensive training set to improve structure recognition in text-rich images, achieving state-of-the-art performance on multiple benchmarks. There are numerous use cases for extracting information from documents, so this work is worth checking out. Datasets and models are shared.

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models. Authors: University of Science and Technology of China and Shanghai AI Laboratory.

  • There is a gap between open source LLMs (e.g. Llamas) versus closed ones (GPTs, Claude) in agent applications. This paper examines potential causes and came up with a fix that improves Llama2-7B by 3.5%. Code is shared.