数据来源:HuggingFace Papers

Latest Papers

1. On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

中文摘要

我们提出了一种简单但有理论依据的改进方案,用于大型语言模型(LLM)的监督微调(SFT),以解决其与强化学习(RL)相比的有限泛化能力。通过数学分析,我们揭示了标准SFT梯度隐含地编码了一个有问题的奖励结构,这可能严重限制模型的泛化能力。为了解决这个问题,我们提出了动态微调(DFT),通过根据该标记的概率动态重新调整目标函数,稳定每个标记的梯度更新。值得注意的是,这一行代码的更改显著优于多个具有挑战性的基准和基础模型上的标准SFT,显示出显著改善的泛化能力。此外,我们的方法在离线RL环境中也表现出竞争性结果,提供了一种有效但更简单的替代方案。这项工作将理论见解和实际解决方案连接起来,实质性地提高了SFT性能。代码将在 https://github.com/yongliang-wu/DFT 上提供。

LLM Analysis

Q: 这篇论文试图解决什么问题?

A: 这篇论文试图解决的问题是监督式微调(Supervised Fine-Tuning, SFT)在大型语言模型(Large Language Model, LLM)中的泛化能力有限的问题。尽管SFT在实现专家行为模式方面具有简单性和高效性,但与强化学习(Reinforcement Learning, RL)方法相比,SFT通常在泛化能力上存在不足。强化学习通过明确的奖励信号或验证信号,允许模型探索多样化的策略,从而实现更强的泛化能力。然而,强化学习方法往往需要大量的计算资源,对超参数调整敏感,并且依赖于奖励信号的可用性,这些条件在实际应用中并不总是可行的。

论文的核心目标是从根本上改进SFT本身,使其在没有负样本、奖励信号或验证模型的情况下,也能实现更好的泛化能力。作者通过数学分析揭示了SFT梯度隐含的奖励结构问题,并提出了一个简单而有效的解决方案——动态微调(Dynamic Fine-Tuning, DFT),通过动态调整目标函数来稳定梯度更新,从而显著提升SFT的泛化能力。

Q: 有哪些相关研究?

A: 以下是与本文相关的研究工作,按照研究方向进行分类:

监督式微调(SFT)与强化学习(RL)的比较

  • Chu et al. (2025):对SFT和RL进行了系统性比较,发现“SFT倾向于记忆,而RL倾向于泛化”,但同时也指出SFT作为初始化步骤在稳定输出格式方面是必要的。
  • Ouyang et al. (2022):研究了SFT和RL在语言模型对齐中的应用,发现SFT在模仿专家示范方面简单高效,但RL在泛化能力上更强。
  • Christiano et al. (2017):提出了从人类偏好中学习的深度强化学习方法,展示了RL在泛化方面的优势。
  • Bai et al. (2022):研究了如何通过强化学习从人类反馈中训练有用的助手,进一步证实了RL在泛化方面的潜力。

混合方法:结合SFT和RL

  • Ouyang et al. (2022):提出了InstructGPT,一种先进行SFT预训练,然后通过基于学习到的奖励模型的RL进行微调的方法。
  • Sheng et al. (2025):探索了在SFT和RL步骤之间进行交替,以提高稳定性和性能。
  • Liu et al. (2025):研究了如何通过混合SFT和RL来提高模型的泛化能力。
  • Qiu et al. (2025):提出了MetisRise,一种结合RL激励和SFT增强的多模态推理模型学习方法。

理论分析:SFT和RL的统一

  • Du et al. (2025):将RLHF(Reinforcement Learning from Human Feedback)重新表述为一种奖励加权的SFT形式,简化了训练流程,但仍然依赖于显式的奖励

Authors: Yongliang Wu,Yizhou Zhou,Zhou Ziheng,Yingzhe Peng,Xinyu Ye,Xinting Hu,Wenbo Zhu,Lu Qi,Ming-Hsuan Yang,Xu Yang

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2508.05629.pdf

Arxiv URL: https://arxiv.org/abs/2508.05629

Arxiv ID: 2508.05629

CoolPaper URL: https://papers.cool/arxiv/2508.05629

Published: 2025-08-07T17:59:04Z

Updated: 2025-08-07T17:59:04.000Z


2. MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.

中文摘要

视频对象分割(VOS)旨在对视频中的指定目标对象进行分割。尽管一些最先进的方法在现有基准测试(如DAVIS和YouTube-VOS)上取得了令人印象深刻的表现(例如,90%以上的J&F),但这些数据集主要包含显著的、主导的和孤立的对象,限制了它们在现实场景中的泛化能力。为了推动VOS向更现实的环境发展,推出了复杂视频对象分割(MOSEv1),以促进复杂场景中VOS研究的发展。基于MOSEv1的优缺点,我们提出了MOSEv2,这是一个显著更具挑战性的数据集,旨在进一步推动在现实条件下的VOS方法的发展。MOSEv2包含5024个视频,针对200个类别的10074个对象提供超过701976个高质量的掩膜。与前一个版本相比,MOSEv2引入了显著更高的场景复杂性,包括更频繁的对象消失和重新出现、严重的遮挡和拥挤、更小的对象,以及一系列新的挑战,例如恶劣天气(如雨、雪、雾)、低光场景(如夜间、水下)、多镜头序列、伪装对象、非物理目标(如影子、反射)、需要外部知识的场景等。我们在5种不同的设置下基准测试了20种代表性的VOS方法,并观察到一致的性能下降。例如,SAM2在MOSEv1上的性能从76.4%降到MOSEv2上的50.9%。我们进一步评估了9种视频对象跟踪方法,发现类似的下降,表明MOSEv2在各项任务中都带来了挑战。这些结果突显出,尽管在现有数据集上具有高精度,但当前的VOS方法仍在现实世界的复杂性中挣扎。MOSEv2已在https://MOSE.video上公开发布。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败: Waiting failed: 30000ms exceeded

Authors: Henghui Ding,Kaining Ying,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang,Philip H. S. Torr,Song Bai

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2508.05630.pdf

Arxiv URL: https://arxiv.org/abs/2508.05630

Arxiv ID: 2508.05630

CoolPaper URL: https://papers.cool/arxiv/2508.05630

Published: 2025-08-07T17:59:27Z

Updated: 2025-08-07T17:59:27.000Z