HuggingFace Papers 2025-07-18

数据来源：HuggingFace Papers

Latest Papers

1. A Survey of Context Engineering for Large Language Models

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

中文摘要

大型语言模型（LLM）的性能从根本上取决于推理过程中提供的上下文信息。本调查介绍了上下文工程，这是一门超越简单提示设计的正式学科，涵盖了LLM信息有效载荷的系统优化。我们提出了一个全面的分类法，将上下文工程分解为其基本组件以及将它们集成到智能系统中的复杂实现。我们首先研究了基础组件：上下文检索和生成、上下文处理和上下文管理。然后，我们探索这些组件是如何在架构上集成以创建复杂的系统实现的：检索增强生成（RAG）、存储系统和工具集成推理以及多代理系统。通过对1300多篇研究论文的系统分析，我们的调查不仅为该领域建立了技术路线图，还揭示了一个关键的研究差距：模型能力之间存在根本的不对称。虽然当前的模型在高级上下文工程的增强下，在理解复杂上下文方面表现出了非凡的能力，但在生成同样复杂的长篇输出方面却存在明显的局限性。解决这一差距是未来研究的首要任务。最终，这项调查为研究人员和工程师推进情境感知人工智能提供了一个统一的框架。

LLM Analysis

LLM Analysis Failed

Authors: Lingrui Mei,Jiayu Yao,Yuyao Ge,Yiwei Wang,Baolong Bi,Yujun Cai,Jiazhi Liu,Mingyu Li,Zhong-Zhi Li,Duzhen Zhang,Chenlin Zhou,Jiayi Mao,Tianze Xia,Jiafeng Guo,Shenghua Liu

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13334.pdf

Arxiv URL: https://arxiv.org/abs/2507.13334

Arxiv ID: 2507.13334

Published: 2025-07-17T17:50:36Z

Updated: 2025-07-17T17:50:36.000Z

2. VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

中文摘要

视觉语言模型（VLM）的最新进展通过增加视觉标记的数量来提高性能，视觉标记的长度通常比文本标记长得多。然而，我们观察到，大多数现实世界的场景不需要如此大量的视觉标记。虽然在一小部分OCR相关任务中性能显著下降，但模型在大多数其他一般VQA任务中仍能准确执行，分辨率仅为1/4。因此，我们建议动态处理具有不同分辨率的不同样本，并提出了一种新的视觉令牌压缩范式，即VisionThink。它从降采样图像开始，巧妙地决定它是否足以解决问题。否则，模型可以输出一个特殊的令牌来请求更高分辨率的图像。与使用固定修剪比率或阈值压缩令牌的现有高效VLM方法相比，VisionThink自主决定是否逐案压缩令牌。因此，它在OCR相关任务上表现出强大的细粒度视觉理解能力，同时在更简单的任务上节省了大量的视觉符号。我们采用强化学习，并提出LLM作为判断策略，将RL成功应用于一般VQA任务。此外，我们精心设计了一个奖励函数和惩罚机制，以实现稳定合理的图像调整调用率。大量的实验证明了我们的方法的优越性、效率和有效性。我们的代码可在https://github.com/dvlab-research/VisionThink.

LLM Analysis

LLM Analysis Failed

Authors: Senqiao Yang,Junyi Li,Xin Lai,Bei Yu,Hengshuang Zhao,Jiaya Jia

Categories: cs.CV,cs.AI,cs.CL,cs.LG

PDF URL: https://arxiv.org/pdf/2507.13348.pdf

Arxiv URL: https://arxiv.org/abs/2507.13348

Arxiv ID: 2507.13348

Published: 2025-07-17T17:59:55Z

Updated: 2025-07-17T17:59:55.000Z

3. π^3: Scalable Permutation-Equivariant Visual Geometry Learning

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.

中文摘要

我们介绍了一种前馈神经网络，它提供了一种新的视觉几何重建方法，打破了对传统固定参考视图的依赖。以前的方法通常将重建锚定在指定的视点上，如果参考值不是最优的，这种归纳偏差可能会导致不稳定和失败。相比之下，$\pi^3$采用完全置换等变架构来预测仿射不变的相机姿态和缩放不变的局部点图，而不需要任何参考帧。这种设计使我们的模型对输入排序具有固有的鲁棒性，并且具有高度的可扩展性。这些优势使我们的简单无偏方法能够在各种任务上实现最先进的性能，包括相机姿态估计、单目/视频深度估计和密集点图重建。代码和模型是公开的。

LLM Analysis

LLM Analysis Failed

Authors: Yifan Wang,Jianjun Zhou,Haoyi Zhu,Wenzheng Chang,Yang Zhou,Zizun Li,Junyi Chen,Jiangmiao Pang,Chunhua Shen,Tong He

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.13347.pdf

Arxiv URL: https://arxiv.org/abs/2507.13347

Arxiv ID: 2507.13347

Published: 2025-07-17T17:59:53Z

Updated: 2025-07-17T17:59:53.000Z

4. The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

中文摘要

长度泛化，即解决比训练过程中观察到的序列更长的问题的能力，是基于Transformer的大型语言模型（LLM）的核心挑战。尽管现有的研究主要集中在算术运算和符号操作任务的数据驱动方法上，但这些方法往往是针对特定任务的，整体性能有限。为了寻求更通用的解决方案，本文关注的是更广泛的可计算推理问题，即算法可以解决的问题，因此可以通过图灵机来解决。从这个角度来看，本文提出了图灵机器模仿学习（TAIL）来提高LLM的长度泛化能力。TAIL通过计算机程序合成模仿图灵机执行过程的思维链（CoT）数据，将推理步骤线性扩展到原子状态，以减轻快捷学习和显式内存提取机制，从而降低基本操作中动态和远程数据访问的难度。为了验证TAIL的可靠性和通用性，我们构建了一个具有挑战性的合成数据集，涵盖8类算法和18个任务。在没有花哨功能的情况下，TAIL显著提高了长度泛化能力，以及Qwen2.5-7B在仅使用合成数据的各种任务上的性能，超越了之前的方法和DeepSeek-R1。实验结果表明，在TAIL进行长度泛化时，图灵机中的关键概念而不是思维方式是必不可少的，通过这些概念，模型在其注意层中表现出与图灵机属性一致的读写行为。这项工作为未来从合成数据中学习LLM推理的研究提供了一个有前景的方向。

LLM Analysis

LLM Analysis Failed

Authors: Zhouqi Hua,Wenwei Zhang,Chengqi Lyu,Yuzhe Gu,Songyang Gao,Kuikun Liu,Kai Chen

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13332.pdf

Arxiv URL: https://arxiv.org/abs/2507.13332

Arxiv ID: 2507.13332

Published: 2025-07-17T17:50:07Z

Updated: 2025-07-17T17:50:07.000Z

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\’s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

中文摘要

可控字幕对于精确的多模式对齐和指令遵循至关重要，但现有的模型往往缺乏细粒度的控制和可靠的评估协议。为了解决这一差距，我们提出了AnyCap项目，这是一个跨越模型、数据集和评估的集成解决方案。我们介绍了AnyCapModel（ACM），这是一个轻量级的即插即用框架，它增强了现有全模态字幕基础模型的可控性，而无需重新训练基础模型。ACM重用基础模型中的原始字幕，同时结合用户指令和模态特征来生成改进的字幕。为了解决可控多模式字幕中的数据稀缺问题，我们构建了AnyCapDataset（ACD），涵盖了三种模式、28种用户指令类型和300k个高质量数据条目。我们进一步提出了AnyCapEval，这是一个新的基准，通过将内容准确性和风格保真度解耦，为可控字幕提供了更可靠的评估指标。ACM显著提高了AnyCapEval上各种基础模型的字幕质量。值得注意的是，ACM-8B将GPT-4o的内容得分提高了45%，风格得分提高了12%，并且在MIA Bench和VidCapBench等广泛使用的基准测试中也取得了长足的进步。

LLM Analysis

LLM Analysis Failed

Authors: Yiming Ren,Zhiqiang Lin,Yu Li,Gao Meng,Weiyun Wang,Junjie Wang,Zicheng Lin,Jifeng Dai,Yujiu Yang,Wenhai Wang,Ruihang Chu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.12841.pdf

Arxiv URL: https://arxiv.org/abs/2507.12841

Arxiv ID: 2507.12841

Published: 2025-07-17T07:04:05Z

Updated: 2025-07-17T07:04:05.000Z

6. Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .

中文摘要

本文解决了以稀疏视图视频为输入的人类高保真视图合成的挑战。以前的方法通过利用4D扩散模型在新的视点生成视频来解决观察不足的问题。然而，这些模型生成的视频往往缺乏时空一致性，从而降低了视图合成质量。本文提出了一种新的滑动迭代去噪方法，以提高4D扩散模型的时空一致性。具体来说，我们定义了一个潜在网格，其中每个潜在网格对特定视点和时间戳的图像、相机姿态和人体姿态进行编码，然后用滑动窗口沿空间和时间维度交替对潜在网格进行去噪，最后从相应的去噪延迟中解码目标视点的视频。通过迭代滑动，信息在潜在网格中充分流动，使扩散模型获得较大的接收场，从而增强输出的4D一致性，同时使GPU内存消耗负担得起。在DNA渲染和ActorsHQ数据集上的实验表明，我们的方法能够合成高质量和一致的新颖视图视频，并且明显优于现有的方法。有关交互式演示和视频结果，请参阅我们的项目页面：https://diffuman4d.github.io/ .

LLM Analysis

LLM Analysis Failed

Authors: Yudong Jin,Sida Peng,Xuan Wang,Tao Xie,Zhen Xu,Yifan Yang,Yujun Shen,Hujun Bao,Xiaowei Zhou

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.13344.pdf

Arxiv URL: https://arxiv.org/abs/2507.13344

Arxiv ID: 2507.13344

Published: 2025-07-17T17:59:17Z

Updated: 2025-07-17T17:59:17.000Z

7. RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.

中文摘要

低秩自适应（LoRA）已成为大型语言模型（LLM）参数高效微调的广泛采用的标准，显著降低了内存和计算需求。然而，挑战仍然存在，包括找到最优的初始化策略或减轻低秩矩阵分解中的过度参数化。在这项工作中，我们提出了一种在统一框架内同时解决这两个挑战的新方法。我们的方法将一组固定秩LoRA矩阵视为平滑流形。将适配器视为该流形上的元素可以消除过度参数化，同时确定沿流形损失减少最快的方向可以提供初始化。特别注意使用数值线性代数和黎曼优化的最佳实践来获得我们方法的数值稳定和计算高效的实现。LLM和扩散模型架构的实验结果表明，RiemannLoRA在收敛速度和最终性能方面始终优于标准LoRA及其最先进的修改。

LLM Analysis

LLM Analysis Failed

Authors: Vladimir Bogachev,Vladimir Aletov,Alexander Molozhavenko,Denis Bobkov,Vera Soboleva,Aibek Alanov,Maxim Rakhuba

Categories: cs.LG,cs.CL,cs.NA,math.DG,math.NA,68T07,65F55,53Z50

PDF URL: https://arxiv.org/pdf/2507.12142.pdf

Arxiv URL: https://arxiv.org/abs/2507.12142

Arxiv ID: 2507.12142

Published: 2025-07-16T11:17:12Z

Updated: 2025-07-16T11:17:12.000Z

8. FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model’s ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is https://fantasy-amap.github.io/fantasy-portrait/.

中文摘要

从静态图像中制作富有表现力的面部动画是一项具有挑战性的任务。先前依赖于显式几何先验（如面部地标或3DMM）的方法在交叉重现中经常出现伪影，难以捕捉微妙的情绪。此外，现有的方法缺乏对多角色动画的支持，因为来自不同个体的驱动特征经常相互干扰，使任务复杂化。为了应对这些挑战，我们提出了FantasyPortrait，这是一个基于扩散变换器的框架，能够为单角色和多角色场景生成高保真度和情感丰富的动画。我们的方法引入了一种表情增强学习策略，该策略利用隐式表示来捕捉与身份无关的面部动态，增强了模型渲染细粒度情绪的能力。对于多字符控制，我们设计了一种掩码交叉注意力机制，确保独立但协调的表达式生成，有效防止特征干扰。为了推进这一领域的研究，我们提出了Multi-Expr数据集和ExprBench，它们是专门为训练和评估多角色肖像动画而设计的数据集和基准。大量实验表明，FantasyPortrait在定量指标和定性评估方面明显优于最先进的方法，特别是在具有挑战性的交叉重现和多字符上下文中表现出色。我们的项目页面是https://fantasy-amap.github.io/fantasy-portrait/.

LLM Analysis

LLM Analysis Failed

Authors: Qiang Wang,Mengchao Wang,Fan Jiang,Yaqi Fan,Yonggang Qi,Mu Xu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.12956.pdf

Arxiv URL: https://arxiv.org/abs/2507.12956

Arxiv ID: 2507.12956

Published: 2025-07-17T09:50:43Z

Updated: 2025-07-17T09:50:43.000Z

9. MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

中文摘要

三维空间中的空间推理是人类认知的核心，对于导航和操纵等具体任务来说是不可或缺的。然而，最先进的视觉语言模型（VLM）经常难以完成像预测自我中心运动后场景的样子这样简单的任务：它们感知2D图像，但缺乏3D动态的内部模型。因此，我们提出了MindJourney，这是一个测试时间缩放框架，通过将VLM与基于视频扩散的可控世界模型耦合，赋予VLM这种缺失的能力。VLM迭代地绘制了一个简洁的相机轨迹，而世界模型在每一步都合成了相应的视图。然后，VLM对交互式探索过程中收集的多视图证据进行了推理。在没有任何微调的情况下，我们的MindJourney在代表性的空间推理基准SAT上实现了平均8%以上的性能提升，这表明将VLM与世界模型配对以进行测试时间扩展为稳健的3D推理提供了一条简单、即插即用的途径。同时，我们的方法还改进了通过强化学习训练的测试时间推理VLM，这证明了我们的方法利用世界模型进行测试时间缩放的潜力。

LLM Analysis

LLM Analysis Failed

Authors: Yuncong Yang,Jiageng Liu,Zheyuan Zhang,Siyuan Zhou,Reuben Tan,Jianwei Yang,Yilun Du,Chuang Gan

Categories: cs.CV,cs.AI,cs.RO

PDF URL: https://arxiv.org/pdf/2507.12508.pdf

Arxiv URL: https://arxiv.org/abs/2507.12508

Arxiv ID: 2507.12508

Published: 2025-07-16T17:59:36Z

Updated: 2025-07-16T17:59:36.000Z

10. AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

中文摘要

我们介绍了AbGen，这是第一个旨在评估LLM在设计科学研究消融研究方面能力的基准。AbGen由来自807篇NLP论文的1500个专家注释示例组成。在此基准测试中，LLM的任务是根据给定的研究背景，为指定的模块或过程生成详细的消融研究设计。我们对DeepSeek-R1-0528和o4-mini等领先LLM的评估突显了这些模型与人类专家在消融研究设计的重要性、可信度和合理性方面存在显著的性能差距。此外，我们证明，目前的自动评估方法对我们的任务来说并不可靠，因为与人工评估相比，它们显示出明显的差异。为了更好地研究这一点，我们开发了AbGen Eval，这是一个元评估基准，旨在评估常用自动评估系统在测量LLM任务绩效方面的可靠性。我们在AbGen Eval上研究了各种LLM作为法官系统，为未来为复杂科学任务开发更有效、更可靠的基于LLM的评估系统的研究提供了见解。

LLM Analysis

LLM Analysis Failed

Authors: Yilun Zhao,Weiyuan Chen,Zhijian Xu,Manasi Patwardhan,Yixin Liu,Chengye Wang,Lovekesh Vig,Arman Cohan

Categories: cs.CL,cs.AI

PDF URL: https://arxiv.org/pdf/2507.13300.pdf

Arxiv URL: https://arxiv.org/abs/2507.13300

Arxiv ID: 2507.13300

Published: 2025-07-17T17:09:22Z

Updated: 2025-07-17T17:09:22.000Z

11. Voxtral

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

中文摘要

我们介绍Voxtral Mini和Voxtral Small两种多模式音频聊天模式。Voxtral经过培训，能够理解口语音频和文本文档，在各种音频基准测试中实现最先进的性能，同时保持强大的文本功能。Voxtral Small的性能优于许多闭源模型，同时足够小，可以在本地运行。32K上下文窗口使该模型能够处理长达40分钟的音频文件和长时间的多回合对话。我们还贡献了三个基准，用于评估关于知识和琐事的语音理解模型。这两个Voxtral模型都是在Apache 2.0许可证下发布的。

LLM Analysis

LLM Analysis Failed

Authors: Alexander H. Liu,Andy Ehrenberg,Andy Lo,Clément Denoix,Corentin Barreau,Guillaume Lample,Jean-Malo Delignon,Khyathi Raghavi Chandu,Patrick von Platen,Pavankumar Reddy Muddireddy,Sanchit Gandhi,Soham Ghosh,Srijan Mishra,Thomas Foubert,Abhinav Rastogi,Adam Yang,Albert Q. Jiang,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devendra Singh Chaplot,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gabrielle Berrada,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jason Rute,Jean-Hadrien Chabran,Jessica Chudnovsky,Joachim Studnia,Joep Barmentlo,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Karmesh Yadav,Kartik Khandelwal,Kush Jain,Lélio Renard Lavaud,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Matthieu Dinot,Maxime Darrin,Maximilian Augustin,Mickaël Seznec,Neha Gupta,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Rémi Delacourt,Romain Sauvestre,Roman Soletskyi,Sagar Vaze,Sandeep Subramanian,Saurabh Garg,Shashwat Dalal,Siddharth Gandhi,Sumukh Aithal,Szymon Antoniak,Teven Le Scao,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Tom Bewley,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yihan Wan,Yunhao Tang

Categories: cs.SD,cs.AI,eess.AS

PDF URL: https://arxiv.org/pdf/2507.13264.pdf

Arxiv URL: https://arxiv.org/abs/2507.13264

Arxiv ID: 2507.13264

Published: 2025-07-17T16:17:37Z

Updated: 2025-07-17T16:17:37.000Z

12. Teach Old SAEs New Domain Tricks with Boosting

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

中文摘要

稀疏自编码器已经成为解释大型语言模型内部表示的强大工具，但它们往往无法捕捉到训练语料库中不常见的领域特定特征。本文介绍了一种残差学习方法，该方法在不需要完全重新训练的情况下解决了这种特征盲。我们建议专门训练一个二级SAE，以模拟特定领域文本上预训练SAE的重建误差，有效地捕捉主模型遗漏的特征。通过在推理过程中将两个模型的输出相加，我们证明了LLM交叉熵和跨多个专门领域的解释方差度量的显著改进。我们的实验表明，该方法有效地将新的领域知识整合到现有的SAE中，同时保持其在一般任务上的性能。这种方法使研究人员能够有选择地提高SAE对特定感兴趣领域的可解释性，为LLM的有针对性的机械可解释性开辟了新的可能性。

LLM Analysis

LLM Analysis Failed

Authors: Nikita Koriagin,Yaroslav Aksenov,Daniil Laptev,Gleb Gerasimov,Nikita Balagansky,Daniil Gavrilov

Categories: cs.LG,cs.AI,cs.CL

PDF URL: https://arxiv.org/pdf/2507.12990.pdf

Arxiv URL: https://arxiv.org/abs/2507.12990

Arxiv ID: 2507.12990

Published: 2025-07-17T10:57:49Z

Updated: 2025-07-17T10:57:49.000Z

13. FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

中文摘要

语言模型（LM）很难通过简单的微调来适应新的数据分布。这是由于它们的子词标记器的刚性，在适应过程中通常保持不变。这种缺乏灵活性通常会导致低效的标记化，导致分发域外、看不见的语言或脚本的过度碎片化。在这项工作中，我们开发了具有可学习标记器的字节级LM，使标记化具有自适应性。我们的模型包括一个子模块，该子模块学习预测输入字节序列之间的边界，并将其编码为可变长度段。现有的无标记器方法使用辅助损失来训练这个边界预测器，该损失在训练语料库中强制执行固定的压缩率，引入了一种新的刚性。我们提出了FLEXITOKENS，这是一种简化的训练目标，在适应过程中具有更大的灵活性。在多个多语言基准测试、形态多样的任务和域中进行评估后，我们证明FLEXITOKENS始终如一地减少了令牌过度碎片化，与子词和其他基于梯度的令牌化器相比，下游任务性能提高了10%。我们实验的代码和数据将在https://github.com/owos/flexitokens

LLM Analysis

LLM Analysis Failed

Authors: Abraham Toluase Owodunni,Orevaoghene Ahia,Sachin Kumar

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.12720.pdf

Arxiv URL: https://arxiv.org/abs/2507.12720

Arxiv ID: 2507.12720

Published: 2025-07-17T01:55:41Z

Updated: 2025-07-17T01:55:41.000Z

14. TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.

中文摘要

视频帧插值（VFI）旨在基于两个连续的相邻帧$I_0$和$I_1$预测中间帧$I_n$（我们使用n表示视频中的时间，以避免扩散模型中时间步长$t$的符号过载）。最近的方法在这项任务中应用了扩散模型（基于图像和基于视频），并取得了很好的性能。然而，基于图像的扩散模型无法提取时间信息，与非扩散方法相比效率相对较低。基于视频的扩散模型可以提取时间信息，但在训练规模、模型大小和推理时间方面都太大。为了缓解上述问题，我们提出了一种基于视频帧插值的时间感知潜在布朗桥扩散（TLB-VFI），这是一种高效的基于视频的扩散模型。通过我们提出的3D小波门控和时间感知自动编码器从视频输入中提取丰富的时间信息，我们的方法在最具挑战性的数据集上的FID比最近基于图像的扩散模型的SOTA提高了20%。同时，由于存在丰富的时间信息，我们的方法在参数减少3倍的情况下实现了很强的性能。这样的参数减少会使速度提高2.3倍。通过结合光流引导，我们的方法需要的训练数据比基于视频的扩散模型少9000倍，参数也少20倍以上。代码和结果可在我们的项目页面上找到：https://zonglinl.github.io/tlbvfi_page.

LLM Analysis

LLM Analysis Failed

Authors: Zonglin Lyu,Chen Chen

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.04984.pdf

Arxiv URL: https://arxiv.org/abs/2507.04984

Arxiv ID: 2507.04984

Published: 2025-07-07T13:25:32Z

Updated: 2025-07-07T13:25:32.000Z

15. Automating Steering for Safe Multimodal Large Language Models

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

中文摘要

多模态大型语言模型（MLLM）的最新进展释放了强大的跨模态推理能力，但也引发了新的安全问题，特别是在面对对抗性多模态输入时。为了提高MLLM在推理过程中的安全性，我们引入了一种模块化和自适应的推理时间干预技术AutoSteer，而不需要对底层模型进行任何微调。AutoSteer包含三个核心组件：（1）新的安全意识评分（SAS），可自动识别模型内部层中与安全最相关的区别；（2）经过训练的自适应安全探测器，用于估计中间表示的有毒输出的可能性；以及（3）轻型拒绝头，当检测到安全风险时，选择性地干预以调节发电。在LLaVA OV和变色龙上进行的各种安全关键基准测试表明，AutoSteer显著降低了文本、视觉和跨模式威胁的攻击成功率（ASR），同时保持了一般能力。这些发现将AutoSteer定位为一个实用、可解释和有效的框架，用于更安全地部署多模式人工智能系统。

LLM Analysis

LLM Analysis Failed

Authors: Lyucheng Wu,Mengru Wang,Ziwen Xu,Tri Cao,Nay Oo,Bryan Hooi,Shumin Deng

Categories: cs.CL,cs.AI,cs.IR,cs.LG,cs.MM

PDF URL: https://arxiv.org/pdf/2507.13255.pdf

Arxiv URL: https://arxiv.org/abs/2507.13255

Arxiv ID: 2507.13255

Published: 2025-07-17T16:04:55Z

Updated: 2025-07-17T16:04:55.000Z

16. Einstein Fields: A Neural Perspective To Computational General Relativity

We introduce Einstein Fields, a neural representation that is designed to compress computationally intensive four-dimensional numerical relativity simulations into compact implicit neural network weights. By modeling the \emph{metric}, which is the core tensor field of general relativity, Einstein Fields enable the derivation of physical quantities via automatic differentiation. However, unlike conventional neural fields (e.g., signed distance, occupancy, or radiance fields), Einstein Fields are \emph{Neural Tensor Fields} with the key difference that when encoding the spacetime geometry of general relativity into neural field representations, dynamics emerge naturally as a byproduct. Einstein Fields show remarkable potential, including continuum modeling of 4D spacetime, mesh-agnosticity, storage efficiency, derivative accuracy, and ease of use. We address these challenges across several canonical test beds of general relativity and release an open source JAX-based library, paving the way for more scalable and expressive approaches to numerical relativity. Code is made available at https://github.com/AndreiB137/EinFields

中文摘要

我们介绍了Einstein Fields，这是一种神经表示，旨在将计算密集型四维数值相对论模拟压缩为紧凑的隐式神经网络权重。通过对广义相对论的核心张量场emph{metric}进行建模，爱因斯坦场能够通过自动微分来推导物理量。然而，与传统的神经场（例如，带符号的距离、占用或辐射场）不同，爱因斯坦场是{神经张量场}，其关键区别在于，当将广义相对论的时空几何编码为神经场表示时，动力学自然会作为副产品出现。爱因斯坦场显示出非凡的潜力，包括4D时空的连续建模、网格不可知性、存储效率、导数精度和易用性。我们在广义相对论的几个规范测试台上解决了这些挑战，并发布了一个基于JAX的开源库，为更具可扩展性和表现力的数值相对论方法铺平了道路。代码可在以下网址获得https://github.com/AndreiB137/EinFields

LLM Analysis

LLM Analysis Failed

Authors: Sandeep Suresh Cranganore,Andrei Bodnar,Arturs Berzins,Johannes Brandstetter

Categories: cs.LG,gr-qc

PDF URL: https://arxiv.org/pdf/2507.11589.pdf

Arxiv URL: https://arxiv.org/abs/2507.11589

Arxiv ID: 2507.11589

Published: 2025-07-15T14:55:39Z

Updated: 2025-07-15T14:55:39.000Z

HuggingFace Papers 2025-07-18

Latest Papers

1. A Survey of Context Engineering for Large Language Models

2. VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

3. π^3: Scalable Permutation-Equivariant Visual Geometry Learning

4. The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

5. AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

6. Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

7. RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

8. FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

9. MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

10. AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

11. Voxtral

12. Teach Old SAEs New Domain Tricks with Boosting

13. FLEXITOKENS: Flexible Tokenization for Evolving Language Models

14. TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

15. Automating Steering for Safe Multimodal Large Language Models

16. Einstein Fields: A Neural Perspective To Computational General Relativity