Weekly paper roundup: InternVL 2.5 (12/9/2024)

Overview

The papers collectively explore advancements in multimodal language models, video generation techniques, memory evaluation in reinforcement learning, and improvements in STEM-focused and code-related language models. Notably, InternVL 2.5 and InternLM-XComposer2.5-OmniLive aim to push the boundaries of open-source multimodal AI systems through enhanced scaling strategies and real-time interaction capabilities. The research highlights the significance of effective data generation and training methodologies, as demonstrated by phi-4’s advancements in STEM question answering and CodeArena’s alignment of code language models with human preferences. In video generation, STIV offers a scalable approach using Diffusion Transformers to integrate text and image conditions, achieving state-of-the-art benchmarks. Furthermore, ProcessBench contributes to better mathematical reasoning error detection, enhancing oversight capabilities and research in reasoning assessments.

Spotlight :flashlight:

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Shanghai AI Laboratory; SenseTime Research; Tsinghua University; Nanjing University; Fudan University; The Chinese University of Hong Kong; Shanghai Jiao Tong University

      🤗   103

This paper introduces InternVL 2.5, an upgraded version of an existing open-source multimodal model, enhancing performance through refined training, testing, and data improvements. The authors explore how scaling affects model performance and show that their model excels in various benchmarks, notably surpassing a 70% threshold on the MMMU benchmark. This stands out as it also outperforms prominent commercial models like GPT-4o. The work highlights the potential of open-source multimodal models to compete with, and at times surpass, commercial counterparts. Overall, the paper effectively suggests promising directions for advancing open-source AI systems in multimodal learning.

Raw notes: r


Other papers

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Shanghai Artificial Intelligence Laboratory; The Chinese University of Hong Kong; Fudan University; University of Science and Technology of China; Tsinghua University; Beihang University; SenseTime Group

      🤗   82

This paper introduces IXC2.5-OL, a comprehensive multimodal system that enhances interactions with video and audio by focusing on long-term streaming capabilities. It tackles challenges in existing multimodal large language models by integrating features such as real-time streaming perception and memory management to facilitate human-like cognitive processing. I find this approach promising as it could significantly improve the adaptability and responsiveness of systems in real-world applications.

Raw notes: r


Phi-4 Technical Report

Microsoft Research

      🤗   76

This paper presents the Phi-4 language model with 14 billion parameters, designed to excel in STEM-related question answering. By incorporating both synthetic and organic data during training, Phi-4 manages to surpass its predecessor, GPT-4, while maintaining a similar architecture. I found it intriguing how the paper highlights the power of enhanced data quality and innovative training methodologies in pushing the boundaries of language model performance.

Raw notes: r


Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

AIRI, Moscow, Russia; MIPT, Dolgoprudny, Russia; Chandar Research Lab; Mila – Quebec AI Institute; Polytechnique Montréal

      🤗   68

This paper dives into the intricate world of memory in Reinforcement Learning agents, providing valuable clarity on types and evaluation methods. By proposing a standardized framework, the authors address the critical need for consistent assessment of memory capabilities in RL tasks. I appreciate the empirical evidence that underscores how following this structured approach can prevent misleading conclusions about an agent’s memory performance.

Raw notes: r


STIV: Scalable Text and Image Conditioned Video Generation

Apple; University of California, Los Angeles

      🤗   65

This paper presents STIV, an innovative approach for generating videos conditioned on both text and images, with an emphasis on scalability and simplicity. By incorporating image conditions into a Diffusion Transformer and using a joint image-text conditional classifier-free guidance, STIV shows impressive results across different benchmarks. I found the framework to offer promising strategies and insights that could propel future developments in video generation technology.

Raw notes: r


ProcessBench: Identifying Process Errors in Mathematical Reasoning

Qwen Team, Alibaba Inc.

      🤗   58

This paper presents ProcessBench, a benchmark focused on evaluating language models’ abilities to detect errors in mathematical reasoning. It highlights the challenges existing models face with complex tasks, while showing that critique models like QwQ-32B-Preview are quite effective. I find it promising because it sets a clear path for improving language models’ oversight and error detection in math problem-solving, which is crucial for advancing AI’s reasoning capabilities.

Raw notes: r


Evaluating and Aligning CodeLLMs on Human Preference

Alibaba Group; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Shanghai Jiao Tong University

      🤗   46

This paper presents CodeArena, a novel benchmark for evaluating code language models by aligning them with human preferences rather than just correctness, revealing significant performance differences when compared to traditional benchmarks. I find it intriguing how the study illustrates the strength of a code LLM trained with a synthetic instruction corpus, highlighting the importance of preference alignment. This approach promises to enhance the usability and relevance of code models in practical scenarios, which is an exciting advancement for developers and researchers alike.

Raw notes: r


Acknowledgements

Papers are retrieved from Hugging Face.