Weekly paper roundup: BLINK: multimodal LLMs can see but not perceive (4/15/2024)


BLINK: Multimodal Large Language Models Can See but Not Perceive

Authors: UPenn, U Washington, AI2, UC Davis, Columbia

Project page: BLINK


BLINK is a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”, but pose significant challenges for current multimodal LLMs since they resist mediation through natural language (i.e. dense captioning). While humans get 96% accuracy, the best-performing GPT-4V, Gemini Pro, and Claude Opus achieve accuracies of 51%, 45%, and 43% respectively, not much better than random guessing (38%). This indicates that such perception abilities have not “emerged” yet in recent multimodal LLMs. Notably, for certain tasks some multimodal LLMs even underperform compared to random guessing. Specialist CV models could solve these problems much better.


What are the current gaps between human and AI in visual perception tasks? This is a fascinating topic for AI researchers since the beginning of the field. Let’s briefly recap a few interesting milestones.

In the 1980s, Hans Moravec, Rodney Brooks, and Marvin Minsky articulated a principle known as the Moravec paradox which states that reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. In the 1980s, it was easier to build a chess playing program that shows adult level performance than replicating visual perception capabilities of a toddler. One proposed explanation is based on evolution. Human as the most evolved biological species learn a wide range of perception capabilities over billions of years. We perform these skills subconsciously, effortlessly while computers struggle. In contrast, abstract thought is a new trick that is less than 100 thousand years old. It is hard for us but easy for computers.

Fast forward three decades, in the late 2010s when deep learning started to show its potential (AlexNet, speech recognition, etc.) but before GPT-3 and ChatGPT, leading AI researchers such as Yoshua Bengio and Andrew Ng expressed the optimism that we are close to solve the perception problem in AI. Andrew Ng wrote in 2016 that “If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future”. Yoshua Bengio spoke at Neurips 2019 keynote: “System 1 are the kinds of things that we do intuitively, unconsciously, that we can’t explain verbally. This is what current deep learning is good at.” Bengio also talked about that deep learning is not good at yet: System 2, which is about learning with limited data, ability to reason, complex problem solving such as programming, etc. (It is interesting to look back and note that Bengio seemed in 2019 unaware of the impending GPT advances where concepts such as zero/few-shot prompting, LLM agents, and GenAI for code suddenly explode onto the scene.)

This may on the surface sound confusing as it contradicts Moravec paradox which says that System 1 is hard and System 2 is easy. There is a however a fairly simple explanation. Both systems could be easy for specialized models and hard for general-purpose ones such as LLMs. Take visual correspondence as an example of a System 1 type visual perception problem, this paper reports that the top multimodal LLMs, GPT-4V and Gemini Pro, achieve the score of 37.21 while the specialized model DIFT achieves the score of 96.51, approaching human performance. That’s an astonishing gap of almost 60 points in favor of the specialist model. The story is similar for System 2 tasks. Examples that come to mind include chess playing (Deep Blue versus LLMs) and classical planning benchmarks such as block world.

In today’s environment with so much focus on LLMs (see the announcement of Llama 3 and Phi 3 just this week), we tend to associate cutting edge AI capabilities with these models and lose sight of the old adage about jack of all trades and master of none. This has important implications in real-world use of AI. I believe that:

When faced with a particular technical problem where AI could help solve, practitioners should look beyond LLMs and find specialist tools that are best for the job. Today a complete AI system typically deploys a collection of models (and tools such as calculators), some of which are specialized old-school models, working together under an intricate orchestration. The hardest and most interesting technical problems cannot be solved today by techniques such as prompt engineering and retrieval augmented generation.

When AGI arrives, as many have predicted it will in several years or decades, we probably will be looking at a general purpose system that perform at or near the level of specialist models in both System 1 and System 2 tasks. It is interesting to ask whether AGI will come in the form of single all-powerful model or a collection of specialized models, some of which can be activated on-demand ala Trinity’s helicopter-flying skill in the movie the Matrix. The paper briefly discusses the idea of distilling specialist models such as DIFT into multimodal LLMs, following the former, centralized approach. I personally believe the decentralized approach is more likely. The current mixture-of-expert technique exemplified by models such as Mixtrals is a baby step toward such an architecture.

In any case, this paper highlights the hard work left on the path to AGI. It is a lot more than getting MMLU score to the high 90s range.

Some interesting tidbits from the paper:

  • We also experiment with specialist vision models and find that they perform much better than multimodal LLMs. For example, the specialist outperforms GPT-4V by 62.8% on visual correspondence estimation, 38.7% on relative depth estimation, and 34.6% on multi-view reasoning, in terms of absolute accuracy.
  • We observe that multimodal LLMs perform relatively better on spatial reasoning, art style,
    and counting tasks, in which they are much better than random guessing
  • Notably, for certain tasks such as jigsaw, semantic correspondence, multiview reasoning, object localization, and relative reflectance, some multimodal LLMs even underperform compared to random guessing.
  • Models also demonstrate some capability in relative depth and forensics detection.
    Overall, they are doing relatively well on mid-level perception tasks. In terms of granularity, the models in general perform better on image-level tasks and struggle on pixel-level and crop-level tasks.
  • GPT-4V is much better in visual similarity, art style, jigsaw, and multi-view reasoning. Specifically, its performance on visual similarity is 29% better than Gemini Pro, demonstrating that GPT-4V possesses a nuanced understanding of visual patterns and aesthetics that is similar to humans. In contrast, Gemini Pro and LLaVA have similar performance patterns.

Noteworthy papers

[2404.11018] Many-Shot In-Context Learning (arxiv.org). Author: Google Deepmind.

  • TLDR: Interesting exploration of using increasing context length, e.g. in Gemini 1.5, to set up fine tuning-free learning with many training data points. Fine-tuned models are specialists whereas ICL are generalist. The paper however did not include a comparison between the two.

[2404.08189] Reducing hallucination in structured outputs via Retrieval-Augmented Generation (arxiv.org). Author: ServiceNow

  • Case study of GenAI for workflow generation, where workflows are described as structured JSON documents. The key idea is to train and use a small encoder-based retriever that maps natural language queries to JSON objects representing valid workflow steps. These objects will then be inserted into LLM prompts to help reduce hallucination.