HuggingFace Papers 2026-06-05
数据来源:HuggingFace Papers
Latest Papers
1. OPRD: On-Policy Representation Distillation
Abstract:On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen’s ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: this https URL.
中文摘要
翻译失败: Translation limit exceeded. Please try it again later.
Response status: 401 (Unauthorized)
Response body : “{\”ShowCaptcha\”:false}”
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Shenzhi Yang,Guangcheng Zhu,Bowen Song,Haobo Wang,Mingxuan Xia,Xing Zheng,Yingfan Ma,Zhongqi Chen,Weiqiang Wang,Gang Chen
PDF URL: https://arxiv.org/pdf/2606.06021.pdf
Arxiv URL: https://arxiv.org/abs/2606.06021
Arxiv ID: 2606.06021
CoolPaper URL: https://papers.cool/arxiv/2606.06021
Published: 2026-06-05T01:52:37.189Z
Updated: 2026-06-05T01:52:37.189Z