Weekly paper roundup: SWE-Lancer Benchmark (2/17/2025)

Overview

The collection of papers primarily focuses on advances in large language models (LLMs) and their applications across various domains, including AI agent frameworks, vision-language integration, diffusion models, and multilingual tasks. Noteworthy themes include improving model efficiency and performance through novel architectures like Native Sparse Attention, diffusion models, and innovative training methodologies such as reinforcement learning and LoRA adaptation. Additional research explores enhancing multimodal capabilities, as exemplified by models like Magma and Explorer, which integrate multiple forms of intelligence for practical applications in web and robotic tasks. A focus on evaluation and benchmarking, such as SuperGPQA and MMTEB, highlights the assessment of LLMs across diverse disciplines and languages, revealing their strengths and limitations. Collectively, these studies suggest promising directions for expanding the functionality and effectiveness of AI systems beyond conventional model configurations and tasks.

Spotlight :flashlight:

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

OpenAI

      ðŸ¤—   41

This paper introduces SWE-Lancer, an intriguing benchmark crafted from over 1,400 real-world freelance software engineering tasks from Upwork, to explore the earning potential of advanced large language models (LLMs) in this field. Although the benchmark encompasses both technical and managerial tasks with a cumulative value of $1 million, the current LLMs falter at performing these tasks effectively, revealing substantial limitations in their capabilities. The study uncovers significant gaps in AI performance within software engineering tasks, suggesting that the economic impact of AI in this area is rather complex and warrants further attention. Interestingly, the model dubbed Sonnet 3.5 achieves the best performance, pointing out specific areas where some progress has been made. Overall, this research underscores the immense potential for further advancements and innovation at the intersection of AI and freelance software engineering.

Raw notes: Sonnet 3.5 performs best


Other papers

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

University of California, Santa Barbara; University College London; University of Wisconsin–Madison; University of Oxford; PyTorch Core Libraries at Meta; FAIR at Meta; GenAI at Meta

      ðŸ¤—   161

This paper presents MLGym, a robust framework and benchmark aimed at evaluating large language model agents across a range of AI research tasks. It cleverly highlights the current limitations of state-of-the-art LLMs in producing innovative solutions, despite their strengths in optimizing existing methods. By making the framework open-source, the authors encourage further exploration and development in AI research capabilities.

Raw notes: r


Qwen2.5-VL Technical Report

Alibaba Group

      ðŸ¤—   143

This paper details the progress of the Qwen2.5-VL model in integrating vision and language processing. The model impressively handles visual tasks such as object localization and long-video comprehension through innovative dynamic resolution processing. Its adaptability across static and interactive tasks showcases its state-of-the-art performance, making it a powerful tool in document understanding while maintaining robust linguistic capabilities.

Raw notes: r


Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

DeepSeek-AI; Peking University; University of Washington

      ðŸ¤—   134

This paper presents Native Sparse Attention (NSA), a novel approach designed to efficiently handle long-context modeling in language models. By combining both coarse and fine-grained token selection strategies, NSA delivers impressive speedups and maintains or enhances performance relative to traditional attention mechanisms. The proposed method is particularly optimized for contemporary hardware, enabling end-to-end training that cuts down on computational expense during pretraining without compromising accuracy.

Raw notes: r


SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Google DeepMind

      ðŸ¤—   115

This paper presents SigLIP 2, which significantly boosts multilingual and multimodal tasks by refining semantic understanding and localization capabilities. I appreciate how the authors detail the integration of multiple resolutions and diverse training data, showing a commitment to enhancing multilingual understanding and fairness. Overall, the step forward from the original SigLIP to SigLIP 2 demonstrates a substantial advancement in handling vision-language tasks with more efficiency and precision.

Raw notes: r


SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

ByteDance.Inc

      ðŸ¤—   91

This paper presents SuperGPQA, a comprehensive benchmark aimed at evaluating large language models across a diverse range of 285 graduate-level disciplines, highlighting significant performance gaps and a highest accuracy of only 61.82%. I appreciate the paper’s collaborative approach, leveraging expert feedback for question refinement, which underscores the need for continued advancements in LLM capabilities. Additionally, the authors offer valuable insights into managing extensive annotation processes, which could be instrumental for future research endeavors in the LLM evaluation sphere.

Raw notes: r


Large Language Diffusion Models

Gaoling School of Artificial Intelligence, Renmin University of China; Ant Group

      ðŸ¤—   80

This paper introduces LLaDA, a diffusion model designed to contend with the prominent autoregressive models typically used in large language models. It offers an innovative data masking technique, enabling it to perform competitively across benchmarks and instruction-following tasks. I find it compelling that the research suggests diffusion models might provide a viable alternative, challenging the current preconceptions about the necessity of autoregressive methods.

Raw notes: r


Soundwave: Less is More for Speech-Text Alignment in LLMs

The Chinese University of Hong Kong, Shenzhen

      ðŸ¤—   75

This paper presents Soundwave, a model that aligns speech with text far more efficiently than its predecessors, requiring only a fraction of the training data. It tackles discrepancies in representation space and sequence length between speech and text, yielding impressive results in speech translation and outperforming the Qwen2-Audio model on AIR-Bench tasks. I’m impressed by how effectively it maintains conversational capabilities with such limited data.

Raw notes: r


How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

AIRI; Skoltech; Moscow Institute of Physics and Technology; Nazarbayev University

      ðŸ¤—   74

This paper explores the techniques for injecting new knowledge into Large Language Models using Low-rank Adaptation while trying to preserve their prior capabilities. It uncovers that while combining both new and established information during fine-tuning achieves favorable outcomes, there remains a risk of diminishing the model’s original performance, especially if the training data is biased. The findings underscore the delicate balance required to enhance LLMs without compromising their existing strengths.

Raw notes: r


Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

AIRI, Moscow, Russia; Neural Networks and Deep Learning Lab, MIPT, Dolgoprudny, Russia; Independent Researcher, Amsterdam, Netherlands; London Institute for Mathematical Sciences, London, UK

      ðŸ¤—   62

This paper explores the potential for dramatically increasing the compression of token sequences into real-valued vectors, far beyond current practices. The authors reveal that while common methods manage a compression ratio of around x10, their approach can achieve compression of up to x1500. This suggests a vast opportunity for improving model efficiency by rethinking how data is embedded and potentially transforming how information is processed in machine learning models.

Raw notes: r


The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

University of California, Berkeley, USA; ETH, Zurich, Switzerland; University of Illinois Urbana-Champaign, USA; Carnegie Mellon University, USA

      ðŸ¤—   53

This paper dives into the problem of overthinking in Large Reasoning Models (LRMs) when they perform interactive tasks, showing how it can hamper performance through issues like Analysis Paralysis and Premature Disengagement. The researchers present a framework aimed at mitigating these behaviors, which not only increases model efficiency but also cuts down on computational expenses. It’s a thoughtful exploration of striking a balance between reasoning and action, underscoring practical steps for enhancing AI systems.

Raw notes: r


S*: Test Time Scaling for Code Generation

University of California, Berkeley

      ðŸ¤—   52

This paper presents S*, a hybrid test-time scaling framework that enhances the performance of large language models in code generation. I appreciate how S* cleverly combines parallel and sequential scaling with a distinct selection mechanism to improve both coverage and selection accuracy. The framework’s ability to enable smaller models to outperform larger ones is particularly impressive, marking a notable advancement in the efficiency and effectiveness of code-generating models.

Raw notes: r


Continuous Diffusion Model for Language Modeling

Korea Advanced Institute of Science and Technology (KAIST); DeepAuto.ai

      ðŸ¤—   49

This paper introduces a novel continuous diffusion model for language modeling, bridging the gap between discrete and continuous approaches by utilizing the geometry of categorical distributions. The results indicate that this model not only surpasses existing discrete diffusion models but also comes close to the effectiveness of autoregressive models. I found the innovative connection between the discrete and continuous methodologies particularly compelling, as it opens new horizons for enhancing language model performance.

Raw notes: r


Magma: A Foundation Model for Multimodal AI Agents

Microsoft Research; University of Maryland; University of Wisconsin-Madison; KAIST; University of Washington

      ðŸ¤—   43

This paper presents Magma, a foundation model tailored for multimodal AI tasks, with a strong focus on integrating vision-language understanding and spatial-temporal intelligence. By employing innovative labeling methods and extensive pretraining, Magma showcases enhanced performance in tasks like UI navigation and robotic manipulation, outperforming existing models. I find the approach of using Set-of-Mark and Trace-of-Mark methods particularly innovative for improving action grounding and planning.

Raw notes: r


Learning Getting-Up Policies for Real-World Humanoid Robots

University of Illinois Urbana-Champaign; Simon Fraser University

      ðŸ¤—   36

This paper introduces an innovative learning framework that excels in teaching humanoid robots how to recover from falls across different settings and surfaces. By utilizing a two-phase approach, it effectively first discovers feasible trajectories and then refines them for practical use, showcasing a remarkable improvement in real-world applications. I was particularly impressed by its success on challenging surfaces such as grass and snow, highlighting its robustness and adaptability.

Raw notes: r


MMTEB: Massive Multilingual Text Embedding Benchmark

Aarhus University; Individual Contributor; Esker; INSA Lyon, LIRIS; University of Amsterdam; MBZUAI; Jina AI; Microsoft Research; Wikit; McGill University; University of Oxford; ITMO University; Koç University; Heritage Institute of Technology; Apart Research; BAAI; National Information Processing Institute; New York University; Ellamind; Peking University; CentraleSupélec; Artefact Research Center; Hugging Face; Wrocław University; Korea University; Illuin Technology; Comenius University Bratislava; Cisco Systems; University of Waterloo; Cohere For AI; University of Zurich; Stanford University; FRC CSC RAS; Salesforce; IIT Madras; Sapienza University of Rome; University of Pennsylvania; SaluteDevices; Princeton University; University of Washington; Imperial College London; R. V . College of Engineering; Robert Koch Institute; HSE University; Nirma University; Occiglot; Allen Institute for AI; Tano Labs; The London Institute of Banking and Finance; Cornell University; Northeastern University; Hong Kong University; Durham University; ServiceNow Research; Johns Hopkins University; Contextual AI

      ðŸ¤—   31

This paper introduces the Massive Multilingual Text Embedding Benchmark (MMTEB), which provides a robust evaluation framework across over 500 tasks in more than 250 languages. I find it impressive that despite the scope, the research reveals that a relatively smaller model can outperform larger language models across this extensive range of tasks. Additionally, the optimizations for computational efficiency are noteworthy, maintaining result integrity while reducing resource demands.

Raw notes: r


Small Models Struggle to Learn from Strong Reasoners

University of Washington; Carnegie Mellon University; Western Washington University

      ðŸ¤—   26

This paper delves into the challenges that small models encounter when trying to learn from complex, long-reasoning processes typically handled by larger models. By introducing Mix Distillation, a method that incorporates a blend of long and short reasoning examples, the authors effectively improve the performance of small models, enhancing their reasoning abilities. I find it innovative how this approach significantly bridges the learning gap for models with 3 billion parameters or fewer.

Raw notes: r


S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Tencent; Tsinghua University; The University of Hong Kong; Fudan University; The Hong Kong University of Science and Technology (Guangzhou)

      ðŸ¤—   23

This paper presents the S$^2$R framework, which effectively enhances the reasoning and accuracy of large language models by enabling self-verification and self-correction through reinforcement learning. I appreciate how the authors manage to achieve a remarkable accuracy improvement, particularly with the Qwen2.5-math-7B model, demonstrating the potential of this approach even with limited resources. It’s impressive that this method outperforms traditional approaches relying heavily on extensive datasets, highlighting its efficiency and practicality.

Raw notes: r


Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

The Ohio State University; Microsoft Research, Redmond

      ðŸ¤—   9

This paper introduces Explorer, a multimodal web agent that demonstrates improved performance in autonomous web task completion, thanks to a cost-effective approach to generating extensive and diverse web trajectory datasets. I found the scalability and resourcefulness of the dataset generation method particularly notable, as it has the potential to significantly advance the training of large multimodal models. The thorough benchmarks provided highlight the agent’s enhanced capabilities and underscore the work’s contribution to the accessibility of high-quality training data for the wider research community.

Raw notes: r


TESS 2: A Large-Scale Generalist Diffusion Language Model

Yale University; University of Washington; Allen Institute for AI; The Ohio State University

      ðŸ¤—   5

This paper introduces TESS 2, a cutting-edge diffusion language model that outshines existing diffusion models and holds its ground against strong autoregressive models. By refining a robust autoregressive model with continued pretraining and instruction tuning, along with a unique inference-time guidance procedure, the authors show impressive results. I find the model’s ability to leverage increased computational resources to enhance performance particularly noteworthy.

Raw notes: r


Acknowledgements

Papers are retrieved from Hugging Face.