Weekly paper roundup: dialog state tracking through function calling (2/19/24)

Paper spotlight

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

Authors: UC Santa Barbara, CMU, and Meta AI.


This paper introduces the idea of using LLM function calling to track key information and maintain memory in task-oriented chatbots. While demonstrating impressive new SOTA performance on a few benchmarks, the paper also suggests that LLMs will likely complement but not completely replace non-LLM techniques in the near term.


Open-ended chatbots such as ChatGPT can converse with users over a wide range of topics. However, in the vast majority of real world use cases, chatbots are deployed to serve specific goals, helping user accomplish specific tasks such as customer service and support, internal business operations, etc. Such chatbots are called task-oriented and have traditionally been developed with significant complexity and cost around intent understanding, dialog management, training data curation, and (non-LLM) models. It is interesting to ask if LLMs can help speed things up here, especially given the fact that task-oriented chatbots have traditionally been considered an easier problem compared to open-ended ones.

The answer suggested by this paper seems to be a No. A key challenge for task-oriented chatbots is keeping an accurate memory of information collected from the user over a multi-turn dialog. This is called the dialog state tracking (DST) problem. Forgetting or mixing up user inputs is typically unacceptable in practice; imagine a chatbot that books a flight to a wrong destination. The paper’s main idea is to use LLM function calling for single-turn DST, which is often referred to as slot filling in dialog research literature. Below is an example of function calling as slot filling (screenshot taken from the paper):

Experiments show that this approach achieves new SOTA performance on a number of DST benchmarks such as Attraction, Hotel, Restaurant, Taxi, and Train. But as pointed by the authors in the section on limitations, the new performance is still pretty far from the threshold required by real world uses. The main reported metric is Joint Goal Accuracy (JGA), and the new SOTA, achieved by GPT-4, is only 62.6%. The authors also note that the delexicalization is used in this study, adding a further caveat to practical considerations.

The authors promised to release code and models behind this work.

1 Like

We definitely run into this a lot with Ollie.ai. I imagine this gets a lot better with GPT5 which is expected to have superior reasoning ability.

1 Like

Thanks for your interest in our paper. We have officially released the code at: GitHub - facebookresearch/FnCTOD: Official code for the publication "Large Language Models as Zero-shot Dialogue State Tracker through Function Calling" https//arxiv.org/abs/2402.10466. Feel free to explore and play with it.

Thank you @Zekun_Li for letting us know. Appreciate it.