ArXiv Domain 2026-06-08

数据来源：ArXiv Domain

LLM Domain Papers

1. Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

Abstract:Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

中文摘要

摘要：大量语言模型（LLMs）主要在英文数据上训练，能够编码大量的世界知识，但在其他语言中往往无法可靠地表达，这一现象被称为跨语言事实不一致。为研究和解决这一问题，我们引入了PolyFact，这是一个大规模平行多语言事实问答数据集，包含12种类型学多样语言的10万条基于Wikidata的事实。利用PolyFact，我们比较了轻量持续预训练（CPT）、监督微调（SFT）以及通过组相对策略优化（GRPO）的强化学习方法，以提升Qwen-2.5-7B和OLMo-2-1124-7B的跨语言事实回忆能力。我们发现，GRPO始终优于SFT，不仅提升了跨语言一致性，还改善了对未见语言的泛化能力，而在平行数据上进行CPT带来的额外增益有限。机制分析进一步显示，GRPO通过减少MLP层和注意力头中的语言专化，对多语言路由进行重组，从而促进更多共享的跨语言表示。我们已公开我们的代码、模型和数据集。

LLM Analysis

Q: 这篇论文试图解决什么问题？

该论文旨在解决跨语言事实不一致性（Cross-Lingual Factual Inconsistency）问题。具体而言，尽管大型语言模型（LLMs）主要在英语数据上训练并编码了丰富的世界知识，但这些模型在通过非英语界面访问和表达已存在的潜在知识时表现出显著的不一致性。

核心问题可细分为以下方面：

知识访问瓶颈：跨语言事实召回的失败并非源于知识缺失，而是发生在语言转换阶段（language transition phase）。模型通常能在中间层正确检索事实（往往以英语形式），但在后续层无法将该潜在概念可靠地解码为目标语言的正确词元。
持续预训练的局限性：轻量级的持续预训练（CPT）虽能改善翻译流畅性，但对要求更高的跨语言事实召回任务收益有限，表明其主要改善表面流畅性而非深层知识访问机制。
后训练方法的比较：研究假设通过分离”表征对齐”与”跨语言知识访问”，可在无需大规模重新预训练的情况下改善多语言事实召回。为此，论文比较了监督微调（SFT）与基于群组相对策略优化（GRPO）的强化学习方法，验证一致性驱动的奖励信号能否更有效地促进共享的跨语言表征。

简言之，该工作试图回答：如何使英语主导的LLMs能够通过非英语界面可靠地访问其已编码的潜在知识，而无需昂贵的全面再训练？

Q: 有哪些相关研究？

该论文的相关研究主要涵盖以下三个领域：

1. 跨语言事实召回（Cross-Lingual Factual Recall）

近期针对大型语言模型的机制研究表明，跨语言事实召回任务的主要瓶颈并非知识缺失，而是语言转换阶段（language transition phase）的失败。这种失效可能发生在早期层（模型未能将提示映射到共享的英语化概念空间）或更常见的最终层（潜在概念未能解码为目标语言词元）。虽然查询级干预方法（如”subject injection”或英语支点策略）可作为推理时的临时缓解措施，但其成功表明跨语言对齐是多语言一致性的关键瓶颈，这促使研究转向更持久的模型级适应方法。

2. 平行数据与持续预训练（Parallel Data）

扩展英语中心模型多语言能力的常规方法依赖于持续预训练（Continual Pretraining, CPT），但该方法计算昂贵且常导致原始模型英语能力的灾难性遗忘。近期研究指出，平行数据是预训练期间多语言能力的最显著来源。然而，Shen et al. (2025) 发现重要局限：尽管基于平行语料库的CPT能改善翻译能力，但往往无法显著提升跨语言事实召回等更具挑战性的任务。这表明CPT主要改善了内部表征的对齐，但模型仍难以通过非英语语言接口可靠访问由对齐表征编码的知识，从而产生不一致的多语言输出。

3. 基于强化学习的后训练（Post-Training via Reinforcement Learning）

强化学习（RL）作为后训练方法为适应新领域和改善与任务特定目标的对齐提供了新路径。机制分析指出，监督微调（SFT）与RL在事后训练中扮演互补角色：SFT扩展模型的行为搜索空间，而RL则”压缩”该空间，将概率质量集中在一致且正确的推理路径上。在单语言推理一致性方面，现有工作（如DeReason和CC-Learn）强调RL在SFT”热身”阶段之后最为有效，以缓解其冷启动问题。GRPO（Group Relative Policy Optimization）通过DeepSeek-R1得以推广，利用可验证的一致性奖励激励推理能力。近期该方法的验证已扩展至多语言领域，应用于检索增强生成（RAG）任务。

Q: 论文如何解决这个问题？

该论文通过以下多维度方案解决跨语言事实不一致性问题：

1. 构建专用数据集（POLYFACT）

创建POLYFACT——一个包含10万个基于Wikidata事实的完全平行多语言多项选择问答数据集，覆盖12种语言（包括高资源和低资源语言）。该数据集确保每个事实在所有12种语言中完全平行，从而隔离语言接口效应与底层知识差异。

2. 一致性驱动的强化学习（GRPO）

采用Group Relative Policy Optimization (GRPO) 进行后训练，核心创新在于设计了一个鼓励跨语言一致性的奖励函数。

对于每个事实，模型在 L=12 种语言中独立生成答案 yell(ell=1)^L ，奖励计算如下：

R = ∑_(ell=1)^(L) r_ell + I[∀ ell, r_ell = 1]

其中单项奖励 r_ell 定义为：
r_ell = +1 & 若 y_ell 为正确选项 -0.5 & 若 y_ell 为幻觉（非选项内容） 0 & 若 y_ell 为错误选项

关键设计是一致性奖励（最后一项）：当且仅当所有12种语言均回答正确时，给予额外 +1 奖励。这直接激励模型发展共享的内部表征，而非仅记忆表面语言模式。

3. 对比训练策略

系统比较了三种方法：

轻量级持续预训练（CPT）：在平行语料（TED2025）上进行，但发现对事实召回收益有限
监督微调（SFT）：使用联合分类加一致性目标（含KL散度项），但主要导致表面记忆
GRPO：显著优于SFT，改善跨语言一致性和对未见语言的泛化

4. 机制层面的表征重组

通过LAHIS（语言注意力头重要性评分）和LAPE（语言激活概率熵）分析，发现GRPO从根本上重塑了多语言处理机制：

减少语言专业化：GRPO减少了MLP层中语言特定神经元的数量（如孟加拉语特定神经元减少36.1%，斯瓦希里语减少31.2%）
延迟语言分化：将语言特定处理从早期层推迟到后期层（第28-32层），保留更大的语言无关中间表征空间
促进跨语言共享：增加不同语言间注意力头的重叠，特别对 typologically distant 的语言对（如日语-中文）效果显著

5. 分离表征对齐与知识访问

核心假设是跨语言事实不一致源于知识访问失败而非表征对齐失败。因此，方案绕过昂贵的全面再训练，直接通过一致性奖励优化跨语言知识检索路径，使模型能将语言无关的概念空间可靠地转换为目标语言输出。

Q: 论文做了哪些实验？

该论文设计了系统的实验方案，涵盖模型训练、多维度评估和机制分析三个层面：

1. 模型与训练配置实验

在两种不同多语言能力的7B规模解码器模型上进行：

OLMo-2-1124-7B：英语主导的基线模型，英语与非英语性能差距显著
Qwen-2.5-7B：具备较强多语言能力的对比模型

采用六种训练变体以分离不同技术的效果：

变体	描述
Base	原始预训练模型
CPT	轻量级持续预训练（235.5M平行语料）
SFT	监督微调（从Base训练）
GRPO	强化学习（从Base训练）
CPT + SFT	先CPT后SFT
CPT + GRPO	先CPT后GRPO

2. 跨语言事实召回性能评估

在三个互补基准上测试：

（1）POLYFACT（领域内测试）

测试集：2,500个事实，覆盖12种语言
方法：长度归一化的条件对数似然评分，选择平均对数概率最高的选项
目的：评估训练任务格式的直接性能

（2）KLAR（跨语言迁移测试）

包含6种训练见过的语言（en, es, fr, ru, zh, ja）和11种未见过语言（ar, ca, el, fa, he, hu, ko, nl, tr, uk, vi）
方法：3-shot贪婪生成，最大10个新词元，NFC归一化字符串匹配
关键差异：要求自由形式生成答案（非多项选择），测试是否真正掌握跨语言检索能力而非仅模式匹配

（3）Global-MMLU（广义知识推理）

覆盖12种目标语言
方法：标准对数似然评分
目的：测试在需要多步推理和领域知识的更困难任务上的泛化能力

3. 机制可解释性分析实验

（1）LAHIS（Language Attention Head Importance Score）分析

使用TED2025平行语料，对12种语言分别计算
识别每种语言最重要的2%注意力头（20/1,024个头）
追踪这些头在基线、SFT和GRPO模型中的分布变化
量化指标：跨语言头重叠率、层分布变化

（2）LAPE（Language Activation Probability Entropy）分析

定义MLP层中神经元在特定语言下的激活频率
计算香农熵： LAPE(i,j) = -∑(k) p’(i,j,k) log(p’(i,j,k))
识别最低1%熵的语言特定神经元
分析其在不同层的分布（早期层vs晚期层）

（3）层秩分析（Layer-rank Analysis）

逐层追踪正确答案词元在目标语言和英语中的排名
可视化知识检索位置（中间层）与语言转换失败点（最终层）
对比基线模型与GRPO模型的转换路径差异

4. 关键发现总结

实验类型	主要发现
性能对比	GRPO在所有三个基准上均优于SFT，SFT在KLAR自由形式生成上甚至退化为输出”1”或”2”
跨语言迁移	GRPO在未见过语言上提升显著（如OLMo-2上从13.3%→17.6%），SFT几乎无迁移能力
脚本效应	GRPO对拉丁字母和阿拉伯-波斯字母语言迁移效果好，对希腊、韩文、希伯来文等孤立脚本效果有限
神经元重组	GRPO使英语特定神经元增加38.2%，孟加拉语和斯瓦希里语特定神经元分别减少36.1%和31.2%
处理延迟	GRPO将语言特定处理从第0层推迟到第28-32层，保留更大的语言无关中间空间
头共享	SFT主要增加 Indo-European 语言间的头共享（如德-法从25%→90%），GRPO更均匀地提升 typologically distant 语言对（如中日从50%→80%）

Q: 有什么可以进一步探索的点？

基于论文的局限性与实验发现，以下方向值得进一步探索：

1. 模型规模与架构扩展

更大规模与更小规模模型：当前实验局限于7B参数模型，需验证GRPO在更大规模（如70B+）或更小规模（如1-3B）模型上的有效性，以及其在混合专家架构（Mixture-of-Experts, MoE）中的适用性
不同预训练分布：探索预训练语料中英语占比极低或极高的模型，检验方法对极端多语言不平衡场景的鲁棒性

2. 深度推理与复杂任务泛化

超越事实检索：当前方法主要改善跨语言事实访问（factual access），但对需要多步推理、数学推导或领域知识的任务（如Global-MMLU）提升有限。需探索如何结合GRPO与思维链（Chain-of-Thought）训练，以促进深层跨语言推理能力的对齐
开放域问答：在开放式生成任务（而非多项选择或受限回答）中验证一致性奖励的有效性

3. 脚本与语言特定优化

孤立脚本语言处理：实验显示GRPO对希腊语、韩语、希伯来语等脚本隔离（script-isolated）语言迁移效果有限。未来可探索：
针对无共同子词单元的语言对，引入显式跨语言映射机制
结合视觉编码器处理不同书写系统，或采用子词级一致性奖励
低资源语言扩展：将方法应用于训练数据中未出现的极低资源语言（如尼日尔-刚果语系、南岛语系小众语言）

4. 奖励函数与训练策略改进

细粒度信用分配：当前奖励函数为二元或三元（正确/错误/幻觉），缺乏部分信用（partial credit）。未来可探索：
基于语义相似度的连续奖励（如嵌入空间中的答案距离）
对”正确语言但错误信息”与”错误语言但正确信息”分别给予差异化反馈，针对性解决不同语言类型的失效模式
课程学习：借鉴DeReason等工作的难度感知课程，设计从简单语言对到复杂语言对的渐进式训练方案

5. 数据偏见与质量提升

专有名词本地化：解决Wikidata衍生的英语中心偏见（如人名、地名默认使用英语形式），构建更倾向于本土专有名词表达的训练数据，减少”英语泄漏”（English leak）现象
动态数据过滤：开发自动化管道，实时检测并过滤高歧义关系（如”原产国”、”出生地”），而非仅依赖静态的POLYFACT-CLEAN过滤列表

6. 机制层面的深入干预

表征解耦：进一步探究GRPO诱导的”英语神经元激增”现象（English-specific neurons增加38.2%）是否代表真正的共享潜在空间形成，或仅是英语作为”锚点语言”的强化
层-wise干预：基于发现的”延迟语言专业化”现象（late discovery effect），探索在推理时动态抑制早期层语言特定激活，或增强中期层跨语言路由的干预方法

7. 与其他正交方法的协同

结合推理时干预：将GRPO训练与英语支点（English pivoting）或主体注入（subject injection）等查询级干预结合，验证是否能产生协同效应而非简单叠加
检索增强生成（RAG）：在多语言RAG场景中应用一致性奖励，确保检索到的外部知识在跨语言生成中保持一致（呼应Qi et al. 2026的方向）

Q: 总结一下论文的主要内容

该论文研究跨语言事实不一致性问题，即大型语言模型（LLMs）虽在英语中编码了丰富知识，却难以通过非英语界面可靠访问这些知识。针对此问题，论文提出了新的数据集与基于一致性强化的训练方法，并通过系统的机制分析揭示了其工作原理。

1. 研究背景与核心问题

跨语言事实召回的失败通常并非源于知识缺失，而是发生在语言转换阶段：模型常在中间层以英语形式正确检索事实，却无法在后续层将其可靠解码为目标语言词元。轻量级持续预训练（CPT）虽能改善翻译流畅性，但对事实召回任务收益有限，表明其仅改善表面流畅性而非深层知识访问机制。

2. 提出的解决方案

POLYFACT数据集：构建包含100,000个基于Wikidata事实的完全平行多语言多项选择问答数据集，覆盖12种语言（英语、德语、西班牙语、法语、葡萄牙语、印尼语、俄语、阿拉伯语、中文、日语、斯瓦希里语、孟加拉语）。每个事实在所有语言中严格平行，以隔离语言接口效应。

一致性驱动的GRPO方法：采用Group Relative Policy Optimization（GRPO）进行后训练，设计跨语言一致性奖励函数。对于每个事实，模型在 L=12 种语言中生成答案 yell(ell=1)^L ，奖励计算为：
R = ∑_(ell=1)^(L) r_ell + I[∀ ell, r_ell = 1]
其中 r_ell 为各语言答案的正确性奖励（正确+1，幻觉-0.5，错误0），最后一项为一致性奖励：仅当所有语言均回答正确时给予额外+1奖励。该设计直接激励模型发展共享的内部表征，而非仅记忆表面语言模式。

3. 实验结果与发现

在OLMo-2-1124-7B和Qwen-2.5-7B上的实验表明：

性能优势：GRPO在POLYFACT（领域内）、KLAR（自由形式生成）和Global-MMLU（广义知识）三个基准上均显著优于监督微调（SFT）。SFT在KLAR上甚至退化为输出选项编号（”1”或”2”），表明其仅学习模式匹配而非真正跨语言检索；而GRPO展现出对未见语言的泛化能力。
CPT的局限性：轻量级CPT在平行语料上训练对事实召回提升有限，且与GRPO结合时未产生叠加增益，证实了表征对齐与知识访问可分离的假设。

4. 机制可解释性分析

通过LAHIS（注意力头重要性）和LAPE（神经元激活熵）分析，发现GRPO从根本上重塑多语言处理机制：

减少语言专业化：GRPO显著减少语言特定神经元（孟加拉语-36.1%，斯瓦希里语-31.2%），同时增加英语特定神经元（+38.2%），反映其”压缩”行为空间至更稳定的表征主干。
延迟语言分化：GRPO将语言特定处理从早期层（第0层附近）推迟至晚期层（第28-32层），在中间层保留更大的语言无关空间。拉丁字母语言处理前移，而非拉丁字母语言（阿拉伯语、日语、中文）集中于最终层。
促进跨语言共享：GRPO增加不同语言间注意力头的重叠，特别是对 typologically distant 的语言对（如日语-中文从头重叠50%提升至80%），而SFT主要增加 Indo-European 语言间的共享。

5. 结论与贡献

论文证明，跨语言事实不一致性主要源于知识访问机制的失效而非知识缺失。通过一致性驱动的强化学习（无需大规模再训练），可有效改善跨语言事实召回。GRPO通过重组多语言路由、减少语言专业化并促进共享表征，实现了比SFT更优的跨语言一致性和对未见语言的泛化能力。研究开源了代码、模型和POLYFACT数据集，为后续多语言LLM研究提供了基础。

Q: 想要进一步了解论文

以上只是了解一篇论文的几个基本FAQ。如果你还想与Kimi进一步讨论该论文，请点击这里为你跳转Kimi AI网页版，并启动一个与该论文相关的新会话。

Authors: Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O’Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06586.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06586

Published: 2026-06-08T01:58:35.482Z

2. Re-Centering Humans in LLM Personalization

Abstract:Despite growing interest, most evaluations of large language models’ (LLMs’) personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

中文摘要

摘要：尽管兴趣日益增长，但大多数对大型语言模型（LLM）个性化能力的评估仍依赖于合成数据。目前尚不清楚现有的个性化系统对真实用户的效果如何。在本文中，我们研究了在使用合成数据与人类数据时，LLM个性化性能的差距。我们收集了人类对话（550次对话）以及个性化三个阶段的评估数据：从对话中提取用户属性（5,949条评估）、将相关属性与新提示配对（11,919条评估）、以及将相关属性融入个性化回复（1,101条评估）。引入人类数据揭示了系统在每个阶段的局限性。模型在从人类对话中提取属性时表现不佳，对相关属性的判断与人类评估存在分歧，并生成的人类评估认为不优于通用回复的个性化答案（尽管LLM评估普遍认为更好）。我们引入了两种轻量型基于训练的干预措施，使自动化个性化评估在前两个阶段更接近人类数据。然而，在第三阶段我们发现，学习的奖励模型仅与人类评分保持适度相关，这表明人类对个性化质量的评价难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决大语言模型（LLM）个性化评估中合成数据与真实人类数据之间的系统性差距问题。

具体而言，论文针对以下核心问题展开研究：

1. 合成数据评估的局限性

现有LLM个性化能力评估主要依赖合成数据（如模拟用户画像、合成对话、LLM自动评估），但合成数据与真实人类响应存在显著差异，导致无法准确反映系统在真实用户身上的实际表现。

2. 三阶段个性化pipeline中的具体缺陷

通过构建基于人类真实对话的数据集（550段对话，涵盖50名真实用户），论文揭示了个性化系统在以下三个阶段的局限性：

阶段一（用户属性提取）：模型从真实人类对话中提取稳定用户属性的准确率显著低于合成数据，存在22%的额外错误率，主要问题包括过度泛化（overgeneralization）、证据不足等。
阶段二（属性相关性匹配）：模型在选择与当前提示相关的属性时，与人类判断存在系统性分歧（Cohen’s kappa ≈ 0.30 ），存在20-40%的过度选择（over-selection）问题，即将无关属性误判为相关。
阶段三（个性化响应生成）：即使提供了相关属性，LLM生成的个性化响应在**54.6%**的情况下被人类评价为不优于通用响应；而LLM评判者却系统性地给予 inflated scores（虚高评分），与人类偏好相关性较低（Spearman rho ≈ 0.3-0.4 ）。

3. 自动化评估与人类判断的错位

论文发现LLM评判者在评估个性化质量时存在表面层次偏好（surface-level preference），即过度奖励显式提及用户属性的表达（如”鉴于您对机器学习的兴趣…”），而人类对此类机械式 personalization 并不买账。

解决方案贡献

针对上述问题，论文提出了轻量级的基于训练的干预措施：

使用RoBERTa验证器过滤提取的属性（提升F1至0.726）
通过监督分类和GRPO强化学习优化相关性选择（F1提升至0.641）
指出第三阶段（响应生成）的奖励建模仍具挑战性，需要以用户为中心的特定偏好模型而非全局评判者

简言之，该论文试图将LLM个性化评估从合成数据重新锚定到真实人类交互数据，揭示当前系统的实际局限性，并为构建真正符合人类需求的个性化系统奠定基础。

Q: 有哪些相关研究？

Authors: Lechen Zhang, Jiarui Liu, Tal August

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06614.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06614

Published: 2026-06-08T01:58:35.482Z

3. UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Abstract:We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

中文摘要

摘要：我们介绍了 UnpredictaBench，这是一种评估方法，用于测试大型语言模型（LLM）捕捉真实潜在分布的能力。随着 LLM 越来越多地被用作其他实体的替代（例如，在经济模拟中替代人类），许多模型倾向于收敛到唯一合理答案，这意味着无法捕捉真实系统的不可预测性。近期关于提升输出多样性的研究对于此场景仍然不足：模拟需要生成与目标分布校准的样本，而不仅仅是多样化的输出。UnpredictaBench 分离出了这一问题的简化但基础版本：从单个目标分布中采样结果，包括经典统计分布、由随机程序诱导的分布，以及描述随机过程的自然语言场景。我们引入了 448 个此类问题，并提出了 KS@N，这是一种通用评估指标，通过 Kolmogorov-Smirnov 统计检验量化模型输出如何近似黑箱目标分布。该指标反映了我们未能拒绝模型大小为 N 的样本与真实样本的比例，N 越大表示难度越大。在对开放和专有模型进行测试时，我们发现分布能力差异很大。例如，当模型生成大小为 100 的样本（KS@100，我们的标准指标）时，得分范围从接近 0 到超过 20%。没有模型在 KS@100 上超过 40%，显示出分布采样能力仍有显著提升空间。尽管增加推理能力可以在一定程度上提高得分，但我们未找到对此问题的即时解决方案。UnpredictaBench 显示，即使是简单的分布模拟仍然具有挑战性，这是将 LLM 用作复杂系统替代方案的必要第一步。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting failed: 30000ms exceeded

Authors: Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang, Giuseppe Carenini, Peter West

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06622.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06622

Published: 2026-06-08T01:58:35.482Z

4. How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

Abstract:Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework’s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

中文摘要

摘要：语言模型推理中的失败通过不同的过程出现，并在推理轨迹中留下可识别的特征。我们使用基于标记的不确定性信号来描述这些失败，发现它们通过两种可经验区分的过程产生。第一种是固定失败，在这种情况下，模型在轨迹早期就锁定在错误的推理路径上。一个核心诊断特征是“承诺点”，超越该点考虑额外的标记反而会损害而非帮助失败检测。第二种是持续不确定性，其中不确定性在整个过程中逐渐积累，需要完整的轨迹才能最好地区分失败与成功的完成。上述特征在23个模型-数据集配置中得以复现，该框架的可证伪预测在23个案例中有20个成立，远高于两种失败模式的随机水平。最后，我们展示了该失败模式框架对自一致性具有直接影响，能够识别不确定性信号何时能够补充自一致性，以及何时可以选择性地跳过。研究结果为理解大语言模型推理失败何时可被检测以及相应调整检测策略奠定了基础。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06635.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06635

Published: 2026-06-08T01:58:35.482Z

5. CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

Abstract:Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent’s output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

中文摘要

摘要：将自然文本中的复杂推理形式化是计算语言学的核心挑战之一。这要求系统不仅理解关键词，还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘（Argument Mining, AM）技术能够识别基本论点和前提，但它们通常难以捕捉高级模式（如Carneades论证框架 CAF）所需的更丰富结构信息，这些模式包含前提类型、证明标准和论证方案等特征。我们通过引入CAF-Gen来解决这一限制，该系统是一个自动化多智能体框架，旨在将浅层的论证结构丰富为符合CAF的论证模型。通过采用迭代的创造者-审阅者（Creator-Reviewer）流程，创造者智能体的输出由批判性智能体验证，以确保结构完整性。这种多智能体协作对于减轻单次生成模型所典型的结构不稳定性至关重要。我们的实验表明，迭代反馈循环提高了生成数据的质量，并与原始注释实现了高度一致，同时生成了结构更丰富的模型。我们的研究发现，多智能体系统能够克服单次生成的局限性，为形式化论证的自动建模提供了一种稳健的方法。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jakub Bąba, Jarosław Chudziak

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06646.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06646

Published: 2026-06-08T01:58:35.482Z

6. The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Abstract:The mechanisms behind LLMs’ broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

中文摘要

摘要：大型语言模型（LLMs）在训练示例之外的广泛过度泛化的机制仍不清楚。突现性失调（Emergent Misalignment，EM）提供了一个引人注目的案例研究：在狭窄任务上进行微调会导致对语义无关的测试领域的广泛失调。在本研究中，我们提出了“搭便车假说”（Piggyback Hypothesis）：聊天模板令牌可以将微调后的行为搭便车到域外查询上。我们通过以下方式验证了这一假说：对前缀（位于所有用户查询之前的令牌）进行微妙扰动，或者用未微调模型的前缀表示进行替换，可以在不改变用户查询的情况下恢复对齐。基于这一发现，我们提出了令牌正则化微调（Token-Regularized Finetuning，TReFT），在训练过程中对特定令牌表示进行正则化，以减轻EM。在不同模型和多种EM诱导数据集上，TReFT在保持域内学习的同时减少了EM。在Llama-3.1-8B对法律领域微调的实验中，TReFT比使用保留集合对齐示例的数据交错方法多实现了33.5%的EM减少。我们进一步表明，TReFT可扩展到其他狭窄微调设置，包括回避、工具使用和拒绝（平均减少离题泛化54.3%），支持了搭便车假说。总体而言，我们的工作强调了大型语言模型可能以意外方式学习和泛化，并提出了更受约束的微调路径。同时，这也呼吁进一步研究共享输入特征如何在不同领域间搭便车模型行为。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06667.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06667

Published: 2026-06-08T01:58:35.482Z

7. What Do People Actually Want From AI? Mapping Preference Plurality

Abstract:Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people’s preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by “truthfulness”, they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do “by default” versus “if requested”) that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users’ clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

中文摘要

摘要：大型语言模型（LLMs）通常通过人类反馈强化学习（RLHF）进行微调，以与人们的偏好和价值观保持一致。然而，这种方法存在已知的局限性：它会汇总冲突的偏好，通常依赖不具代表性的样本，并且只使用二元比较。通过分析来自75个国家的PRISM数据集中1500条开放式回答，我们考察了人们实际上希望AI系统提供什么，并揭示了当前方法的具体失败。我们发现，不同的人有不同的需求：大部分价值观的请求者不到四分之一，唯一的例外是真实性，占49%。此外，相同的词语隐藏着不同的含义：当人们描述他们所说的“真实性”时，揭示出不同且可能不兼容的认知基础，有些人要求有来源的论断，有些人要求专家意见，有些人甚至要求不受欢迎的观点。某些能力，即模型表现得多像人类，以及某些特性，如AI安全防护措施，存在明显争议，有些人希望拥有，而有些人拒绝。我们还发现，人们常常使用上下文区分（AI应“默认”做什么与“应请求时”做什么），而二元比较无法捕捉这些差异。这些发现揭示了当前对齐实践中的根本性问题。当49%的人要求真实性但定义各不相同时，单一的奖励模型很难捕捉到这一点。尽管用户明确要求准确性，但资金充足的模型仍存在高虚假信息率，这表明当前方法未能识别实际偏好。本文揭示了当前被压缩成统一偏好模型的情境化、有争议且不完美的信号，这种做法被其他研究者称为认知暴力。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Julia Sepúlveda Coelho, Scott A. Hale

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06674.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06674

Published: 2026-06-08T01:58:35.482Z

8. HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

Abstract:Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK’s court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $\kappa = 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at this https URL.

中文摘要

摘要：法院判决在法律实践和法理学中具有核心地位，但由于缺乏专家标注的语料库，香港判决的话语分析受到的关注有限。我们介绍了香港判决话语数据集（HKJudge），这是第一个按句子级别进行专家标注的法律话语语料库。HKJudge涵盖香港法院五个层级的刑事判决，包括约 290,000 个句子和约 650 万个词令，全部由法律语言学专家标注。我们设计了一个两层话语架构，捕捉法院所认定的事实、推理方式及裁决结果。在句子层级，每个句子被分配 26 个修辞角色之一。在跨度层级，句子进一步标注三个判刑要素（指控、刑期、罚款）。十名法律语言学标注员完成了标注，标注者间一致性为 κ = 0.8。我们在 HKJudge 上设置了两个任务，分别称为修辞角色分类和法律要素抽取，并对四个基于 BERT 的模型、两种开源大语言模型在零样本和微调设置下以及四个商业大语言模型在这两个任务上进行了首次基准评测。我们的工作展示了句子级话语标注在建模香港判决结构方面的价值，并为未来法律判决预测研究提供了丰富的数据基础。HKJudge 数据集和代码可通过此 HTTPS URL 获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06679.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06679

Published: 2026-06-08T01:58:35.482Z

9. Signal-Driven Observation for Long-Horizon Web Agents

Abstract:Web agents operating over long horizons ingest raw DOM and accessibility trees — routinely tens of thousands of tokens — at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires — triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.

中文摘要

摘要：在长时程中运行的网页代理在每个操作步骤都会处理原始 DOM 和可访问性树——通常包含数万个标记——这会导致上下文逐步退化，从而在任务完成之前就削弱了推理能力。我们认为，将观察频率与操作频率耦合是一种架构错误。借鉴递归语言模型的观点，即查询文档的效果优于整体阅读，我们提出了信号驱动观察（Signal-Driven Observation, SDO）：一个专门的子调用读取完整的 DOM，但只返回与任务相关的元素及其选择器，并且仅在轻量级信号检测器触发时重新调用——触发因素包括 URL 转换、新出现的可交互元素、操作失败或外部浏览器事件。我们概述了 SDO 引入的开放问题，并呼吁社区在网页代理设计中将观察压缩作为核心架构决策对待。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Shubham Gaur, Ian Lane

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06708.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06708

Published: 2026-06-08T01:58:35.482Z

10. Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Abstract:We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

中文摘要

摘要：我们研究自回归模型（ARLMs）向扩散语言模型（DLMs）的转化。与从零开始预训练不同，以往的工作是用双向注意力替换 ARLM 中的因果注意力，然后使用 DLM 目标训练所得模型。然而，这些方法存在两个分布偏移问题。首先，从下一个 token 预测目标转换到 DLM 目标可能会丢失 ARLM 在训练中获得的知识。其次，标准 DLM 在训练和推理之间存在不匹配，因为训练损失是定义在随机掩码序列上，而不是由基于置信度的解码产生的推理轨迹上。为了解决这两个挑战，我们引入了一种On-Policy扩散语言模型（OPDLM），其中使用 On-Policy 蒸馏（OPD）进行 ARLM 到 DLM 的转化。具体来说，OPDLM 通过自 OPD 训练，其中学生模型是具有双向注意力的 ARLM，自行生成轨迹，教师模型为原始冻结的 ARLM，通过在这些轨迹上提供目标 logits 来蒸馏其知识。通过以策略内（on-policy）的方式直接训练，OPDLM 消除了 DLM 中的训练-推理不匹配，同时从原始模型的蒸馏增强了对 ARLM 知识的保留。实证结果表明，OPDLM 在各种任务中表现出色，同时所需训练 token 比传统方法少 15 到 7,000 倍。OPDLM 避免了 DLM 预训练的高昂成本，并将 DLM 转化定位为一种 ARLM 后训练形式。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.06712.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06712

Published: 2026-06-08T01:58:35.482Z

Agent Domain Papers

1. Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

Abstract:Machine learning systems deployed in high stakes socioeconomic settings routinely display bias. We formalize bias as a symmetry breaking operation: a classifier is fair if its outputs remain invariant under the counterfactual operation of switching a sensitive attribute, with merit features held fixed. We implement loss based regularization as a symmetry restoring mechanism and evaluate the framework on four synthetic datasets with varying levels of noise, correlation, and bias. The framework achieves upwards of 90\% violation reduction, with accuracy costs around 5\%. This framework does not require causal graph knowledge, is computationally lightweight, and generalizes to any sensitive attribute definable as a bit-flip, making it suitable for contexts where local sources of discrimination remain absent from mainstream benchmarks.

中文摘要

摘要：部署在高风险社会经济环境中的机器学习系统通常会表现出偏差。我们将偏差形式化为一个对称破缺操作：如果在保持优点特征不变的情况下切换敏感属性后分类器的输出保持不变，则该分类器是公平的。我们将基于损失的正则化实现为一种对称恢复机制，并在四个具有不同噪声、相关性和偏差水平的合成数据集上评估该框架。该框架在违规率降低方面达到90%以上，准确率损失约为5%。该框架不需要因果图知识，计算负担轻，并且可推广到任何可以定义为比特翻转的敏感属性，使其适用于主流基准中尚未出现的局部歧视来源的环境。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Nishit Singh

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06514.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06514

Published: 2026-06-08T02:06:01.800Z

2. DiBS: Diffusion-Informed Branch Selection

Abstract:Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at this https URL.

中文摘要

摘要：数独是一种典型的约束满足问题，需要在严格的离散约束下进行全局结构推理。现有的数独求解工作主要集中在两种主要方法，即传统启发式方法和深度学习求解器。然而，它们存在两个互补的局限性：基于学习的求解器缺乏严格的正确性保证，而完全符号求解器仍然容易出现长尾搜索问题。为了解决这些不足，我们提出了一种新颖的扩散模型指导方法，称为 DiBS，用于分支选择搜索过程。具体来说，DiBS 保持符号求解器的完备性，并使用扩散模型作为分支排序的指导。其核心方法是在当前部分赋值和轻量一致性信号下，对候选值进行排序。此外，我们提供了深入的理论证明，揭示其工作原理及原因。在具有挑战性的 Royle 17 提示数独基准上的实验表明，相较于强启发式基线，我们的 DiBS 显著降低了搜索成本，尤其是在节点数量、回溯次数和长尾百分位上。此外，这些结果还证实了在分支顺序错误代价最高的困难实例中，学习到的全局指导是有效的。所有代码均可在此 https URL 获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06518.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06518

Published: 2026-06-08T02:06:01.800Z

3. SafeGene: Reusable Adapters for Transferable Safety Alignment

Abstract:Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned—degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety—utility trade-off.

中文摘要

摘要：开放权重的大型语言模型（LLM）正越来越多地被微调为定制化助手，但下游微调可能削弱安全性对齐，使模型即使在训练数据并非刻意有害的情况下也更容易受到恶意提示的影响。这在目标模型反复更新新任务数据或用户交互时，造成了反复出现的安全恢复问题。我们提出了 SafeGene，一种可重复使用的安全适配模块，旨在在每个架构兼容的模型家族内实现跨任务复用。SafeGene 并不将安全恢复视为模型特定的修复步骤，而是将安全能力视为独立的、可复用的适配器表示，与特定任务的更新相分离。这种表示是从对齐后退化的模型差异中获得的，通过数据感知的层选择被提炼为可任务迁移的安全向量，并通过少量样本的逐层系数重新校准，在每个下游任务适配的模型中表达。跨多个模型家族、下游任务和安全评估实验表明，增强了 SafeGene 的模型在减少有害响应率的同时保持下游性能，在安全性与效用权衡方面优于代表性的安全适配方法。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06519.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06519

Published: 2026-06-08T02:06:01.800Z

4. Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Abstract:Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs’ agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose Lean4Agent, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. Lean4Agent launches FormalAgentLib, an extensible Lean4 library for formally modeling and verifying agent workflows’ semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on FormalAgentLib, we further develop LeanEvolve, which applies results in FormalAgentLib to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of 11.94%, and LeanEvolve further improves SWE performance by 7.47% on average. Furthermore, Lean4Agent establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

中文摘要

摘要：让大型语言模型（LLMs）能够执行可靠的多步骤工作流已成为人工智能中的一个核心挑战。尽管LLMs的自主能力最近有所提升，但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的正式方法。这一挑战反映了数学中的一个长期存在的问题：自然语言（NLs）的歧义性促使人们开发形式语言（FLs）。受这一范式的启发，我们提出了Lean4Agent，据我们所知，这是第一个使用Lean4（一种依赖类型形式语言）来建模和验证智能体行为的框架。Lean4Agent推出了FormalAgentLib，这是一个可扩展的Lean4库，用于在明确假设下正式建模和验证智能体工作流的语义一致性，并实现对执行轨迹中显现的运行时故障的定位。在FormalAgentLib的基础上，我们进一步开发了LeanEvolve，该方法利用FormalAgentLib中的成果修订工作流以提升其能力。在对SWE-Bench-Verified的困难问题子集和ELAIP-Bench子集的5个领先LLMs上进行的大量实验表明，验证通过的工作流平均比失败的工作流高出11.94%，而LeanEvolve进一步平均提升SWE性能7.47%。此外，Lean4Agent为使用具有表达能力的依赖类型形式语言来正式建模和验证智能体行为开辟了新领域的基础。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06523.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06523

Published: 2026-06-08T02:06:01.800Z

5. CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

Abstract:Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES—Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

中文摘要

摘要：大型语言模型在数学推理方面取得了显著进展，但现有的基准通常评估的是具有最终答案、逐步解法或完整证明的明确问题。它们无法捕捉协作开放问题的解决过程：在这种情境中，参与者提出部分论证，识别前一步骤中的漏洞或错误，修正错误推理，并逐步将增量贡献综合为完整证明。我们介绍了 CrowdMath，这是一个包含 164 条专家标注进展链的数据集，来自 MIT PRIMES—Art of Problem Solving (AoPS) CrowdMath 项目（2016-2025），这是一个协作研究计划，其讨论已促成同行评审的出版物。每条进展链追踪从开放问题陈述到完成证明的多参与者论坛讨论。帖子按照其在不断发展的解决过程中的功能角色进行标注，包括部分进展、证明完成、错误推理和错误识别等。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖预测中达到 83-88% 的准确率，表明它们能够跟随数学讨论的局部流程。然而，它们在识别单个贡献的功能意义方面表现欠佳，性能最好的模型在帖子角色分类上仅达到 0.42 的宏 F1 值。CrowdMath 揭示了在解决明确数学问题与理解协作数学进展展开过程之间的差距。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06526.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06526

Published: 2026-06-08T02:06:01.800Z

6. Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

Abstract:An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

中文摘要

摘要：相比无差别攻击的攻击者，能够策略性选择攻击时机的攻击者更难被抓住。AI控制是一种安全框架，用于在能力强但不可信的AI代理的部署下，由更弱、可信的监控器以及有限的人类审计预算进行监督。控制评估通过将红队攻击策略与蓝队监控器对抗来测试这些协议的稳健性，但当前评估通常假设攻击者不会策略性地选择攻击时机。我们通过将攻击决策分解为启动策略（决定攻击者何时应该攻击）和停止策略（决定攻击者何时应中止正在进行的攻击）来研究代理环境中的这种能力——攻击选择。在两个代理环境中，BashArena和LinuxArena，这两种策略在不改变基础攻击能力的情况下都显著降低了实际测得的安全性。在1%的审计预算下，我们的启动策略在BashArena和LinuxArena上均将安全性降低了20个百分点，而我们的停止策略在BashArena和LinuxArena上分别将安全性降低了20个百分点和28个百分点。这些下降应被解读为攻击选择影响的上限。因此，现有的控制评估可能会对选择性攻击者的安全性产生过于乐观的估计。我们建议未来的评估、系统卡片和安全案例应引入攻击选择，以产生更现实的安全性估计。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad IV, Joachim Schaeffer, Ram Potham, Tyler Tracy

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06529.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06529

Published: 2026-06-08T02:06:01.800Z

7. CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

Abstract:The critical question after a correct driving veto is not only whether a maneuver is unsafe, but whether the blocked interaction admits a lawful, auditable, and responsibility-bounded repair. Prediction and game-theoretic planners can suggest plausible cooperation, yet they do not return a proof that the repair respects hard rules, right-of-way, cost allocation, and ego fallback. We introduce CARVE, Certified Affordable Repair of Vetoed maneuvers via Envelopes, a certificate architecture for prediction-free interactive repair. Given a vetoed maneuver, CARVE constructs a finite repair lattice and emits a structured certificate recording the binding rule, selected joint repair, right-of-way-scaled cooperation envelope, responsibility-weighted cost split, and ego-only fallback. This certificate view reveals the algorithmic bottleneck: multi-owner repair induces a product lattice $M = \prod_j |\mathcal{A}_j|$. We therefore introduce CARVE-Q, a verifier-shielded quantum-AI search layer that applies quantum minimum finding only to this black-box lattice while leaving all safety authority classical. In the conservative verifier-oracle model, exact classical minimum finding requires $\Theta(M)$ queries in the worst case, whereas Durr-Hoyer/Grover minimum finding uses $O(\sqrt{M})$ oracle queries with high probability. We prove verifier-shielded certificate soundness, priority non-elicitation, black-box query separation, and finite-precision reversible-oracle constructibility. We then demonstrate state-vector minimum finding on CARVE repair oracles up to 65,536 assignments and validate certificate preservation on Lanelet2-grounded INTERACTION replay with 100% right-of-way respect, 100% blame consistency, and zero priority false positives. The result is a trust-bounded quantum-AI pattern for certified autonomy: quantum proposes; CARVE certifies.

中文摘要

摘要：正确驾驶否决后的关键问题不仅在于操作是否不安全，还在于阻碍的互动是否允许合法、可审计且有责任约束的维修。预测和博弈论规划者可以提出合理的合作，但它们无法证明修复遵守硬规则、优先通行权、成本分配和自我退缩。我们引入了CARVE，即通过信封实现被否决操作的认证经济维修，这是一种无预测交互维修的证书架构。在被否决的操作下，CARVE构建了一个有限的修复格，并发布结构化证书，记录约束规则、选择性关节修复、优先权规模合作包线、责任加权成本分摊以及仅自我的后备方案。该证书视图揭示了算法瓶颈：多所有者修复诱导产品格$M = \prod_j |\mathcal{A}_j|$。因此，我们引入了CARVE-Q，一种验证屏蔽的量子人工智能搜索层，仅对该黑箱晶格应用量子极小值，且保持所有安全权威为经典。在保守的验证者-预言机模型中，精确经典最小值求得在最坏情况下需要$\Theta（M）$次查询，而Durr-Hoyer/Grover最小值求得则使用$O（\sqrt{M}）$次预言机查询，概率较高。我们证明了验证者屏蔽证书的合理性、优先级非诱发、黑箱查询分离以及有限精度可逆预言机构造性。随后，我们展示了在CARVE修复oracle上最多65,536次任务的状态向量最小寻得，并在基于Lanelet2的INTERACTION重播中验证证书保存，保证100%优先权尊重、100%责备一致性和零优先级误报。结果是一个信任受限的量子人工智能模式用于认证自治：量子提出;CARVE认证。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yifan Wang

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06531.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06531

Published: 2026-06-08T02:06:01.800Z

8. Position: Don’t Just “Fix it in Post”: A Science of AI Must Study Training Dynamics

Abstract:What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

中文摘要

摘要：拥有对人工智能的科学理解意味着什么？模型并不是静态的对象：它们是随时间演化的过程的快照，由数据、目标、架构和优化动态塑造。然而，大部分人工智能研究将模型视为固定的产物，分析训练后的行为，而不是探问其出现的原因。本文观点认为，人工智能科学必须超越事后修正，研究生成模型行为的训练动态。这种科学应支持逐步更强形式的理解：从早期训练信号预测结果，在轨迹出错时进行干预，并最终设计能够更可靠地产生期望特性的训练程序。规模效应定律已使损失预测成为常规；挑战在于将这一成功扩展到能力、偏差、鲁棒性和与安全相关的行为。我们阐述了基于科学史与科学哲学的此类理论要求，检视了在机制可解释性、公平性、记忆化和简化偏差方面的进展，并识别了具体的开放问题。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06533.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06533

Published: 2026-06-08T02:06:01.800Z

9. Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

Abstract:We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

中文摘要

摘要：我们提出了加速傅里叶SAT（AFSAT），这是一种基于连续局部搜索（CLS）的伪布尔可满足性问题的GPU加速求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器，支持在单个问题实例中任意异构混合的对称约束类型和长度。通过使用JAX编译器，AFSAT利用纯函数组合、自动向量化、自动微分以及即时（JIT）编译，在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证方法在数值稳定性、运行性能和内存效率上的显著提升。我们通过识别并解决由内存延迟和浮点表示带来的各种限制，同时利用自动并行化和紧凑表示来实现这一目标。浮点数固有的表示和稳定性限制部分通过定制的离散傅里叶变换实现得到解决。在通过JAX数组分片扩展到多个加速器时，我们实现了近线性的吞吐量。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Cody J Christopher, Charles Gretton

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06641.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06641

Published: 2026-06-08T02:06:01.800Z

10. A Study of Parallel Continuous Local Search

Abstract:We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

中文摘要

摘要：我们研究了并行连续局部搜索（Continuous Local Search, CLS）作为解决具有对称伪布尔（PB）约束的布尔可满足性问题的方案。在这里，将 $n$ 变量的 PB 可满足性问题松弛为在 $n$ 维超立方体上具有可微目标函数的连续优化问题。对于可满足的实例，该优化问题的全局极小值点对应于手头 SAT 问题的可满足赋值。通过实证实验，我们提出了几个新发现：(i) 冗余约束可能抑制而非加速收敛；(ii) CLS 在混合设置中作为子求解器显示出潜力，可以快速完成部分赋值；(iii) 局部搜索会迅速收敛到一个稳定的解质量分布（即满足程度），这是由于鞍点密集的目标导致额外的求解步骤收益递减。我们的研究结果为在现代加速器硬件上实际使用 CLS 解决 SAT 提供了参考。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Cody J Christopher, Charles Gretton

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06656.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06656

Published: 2026-06-08T02:06:01.800Z

Evaluation Domain Papers

1. Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

中文摘要

摘要：部署在高风险社会经济环境中的机器学习系统通常会表现出偏差。我们将偏差形式化为一个对称性破缺操作：如果在保持优点特征不变的情况下切换敏感属性，其输出保持不变，则分类器是公平的。我们将基于损失的正则化实现为一种对称性恢复机制，并在四个具有不同噪声、相关性和偏差水平的合成数据集上评估该框架。该框架实现了超过90%的违规减少，准确率损失约为5%。该框架不需要因果图知识，计算量轻，并且能推广到任何可定义为位翻转的敏感属性，这使其适用于主流基准中尚未出现的局部歧视来源的环境。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Nishit Singh

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06514.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06514

Published: 2026-06-08T02:12:22.841Z

2. DiBS: Diffusion-Informed Branch Selection

中文摘要

摘要：数独是一种典型的约束满足问题，需要在严格的离散约束下进行全局结构推理。现有的数独求解工作主要集中在两种主导方法，即传统启发式方法和深度学习求解器。然而，它们存在两个互补的局限性：基于学习的求解器缺乏严格的正确性保证，而完整的符号求解器仍然容易遭遇长尾搜索问题。为了解决这些不足，我们提出了一种新颖的扩散模型引导方法，称为 DiBS，用于分支选择搜索过程。具体而言，DiBS 保持符号求解器的完整性，并使用扩散模型作为分支排序的指导。核心方法是在当前部分分配和轻量级一致性信号下对候选值进行排名。此外，我们提供了深入的理论证明，以揭示其工作原理及原因。在具有挑战性的 Royle 17 提示数独基准上的实验表明，与强启发式基线相比，我们的 DiBS 显著降低了搜索成本，尤其是在节点、回溯和长尾百分位数方面。此外，这些结果证实了在分支顺序错误代价最高的困难实例上，学习到的全局指导是有效的。所有代码均可在此 https URL 获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06518.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06518

Published: 2026-06-08T02:12:22.841Z

3. SafeGene: Reusable Adapters for Transferable Safety Alignment

中文摘要

摘要：开放权重的大型语言模型（LLM）正越来越多地被微调为定制化助手，但下游微调可能削弱安全性对齐，使模型即使在训练数据并非刻意有害的情况下也更容易受到恶意提示的影响。这在目标模型反复更新新任务数据或用户交互时，造成了反复出现的安全恢复问题。我们提出了 SafeGene，一种可重复使用的安全适配模块，旨在在每个架构兼容的模型家族内进行跨任务复用。SafeGene 并不将安全恢复视为模型特定的修复步骤，而是将安全能力视为独立的、可复用的适配器表示，并与任务特定更新相分离。这种表示是从对齐后退化的模型差异中获得的，通过数据感知的层选择被提炼为可任务迁移的安全向量，并通过少量样本的逐层系数重新校准，在每个下游任务适配模型中体现。跨多个模型家族、下游任务和安全评估的实验表明，增强了 SafeGene 的模型在减少有害响应率的同时保持下游性能，在安全性与效用权衡方面优于代表性的安全适配方法。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06519.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06519

Published: 2026-06-08T02:12:22.841Z

4. Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

中文摘要

摘要：让大型语言模型（LLMs）能够执行可靠的多步骤工作流已成为人工智能中的一个核心挑战。尽管LLMs的自主能力最近有所提升，但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的正式方法。这一挑战反映了数学中的一个长期存在的问题：自然语言（NLs）的歧义性促使人们开发形式语言（FLs）。受这一范式的启发，我们提出了Lean4Agent，据我们所知，这是第一个使用Lean4（一种依赖类型形式语言）来建模和验证智能体行为的框架。Lean4Agent推出了FormalAgentLib，这是一个可扩展的Lean4库，用于在明确假设下正式建模和验证智能体工作流的语义一致性，并实现对执行轨迹中显现的运行时故障的定位。在FormalAgentLib的基础上，我们进一步开发了LeanEvolve，该方法利用FormalAgentLib中的结果修订工作流以提升其能力。在对SWE-Bench-Verified的困难问题子集和ELAIP-Bench子集的5个领先LLMs上进行的大量实验表明，验证通过的工作流平均比失败的工作流高出11.94%，而LeanEvolve进一步平均提升SWE性能7.47%。此外，Lean4Agent为使用具有表达能力的依赖类型形式语言来正式建模和验证智能体行为开辟了新领域的基础。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06523.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06523

Published: 2026-06-08T02:12:22.841Z

5. CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

中文摘要

摘要：大型语言模型在数学推理方面取得了显著进展，但现有的基准通常评估的是具有最终答案、逐步解法或完整证明的明确问题。它们无法捕捉协作开放问题的解决过程：在这种情境中，参与者提出部分论证，识别前一步骤中的漏洞或错误，修正错误推理，并逐步将增量贡献综合成完整证明。我们介绍了 CrowdMath，这是一个包含 164 条专家标注进展链的数据集，来自 MIT PRIMES—Art of Problem Solving (AoPS) CrowdMath 项目（2016-2025），这是一个协作研究计划，其讨论已促成同行评审的出版物。每条进展链追踪从开放问题陈述到完成证明的多参与者论坛讨论。帖子按照其在不断发展的解决过程中的功能角色进行标注，包括部分进展、证明完成、错误推理和错误识别等。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖预测中达到 83-88% 的准确率，表明它们能够跟随数学讨论的局部流程。然而，它们在识别单个贡献的功能意义方面表现欠佳，性能最好的模型在帖子角色分类上仅达到 0.42 的宏 F1 值。CrowdMath 揭示了在解决明确数学问题与理解协作数学进展展开过程之间存在的差距。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06526.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06526

Published: 2026-06-08T02:12:22.841Z

6. Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

中文摘要

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad IV, Joachim Schaeffer, Ram Potham, Tyler Tracy

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06529.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06529

Published: 2026-06-08T02:12:22.841Z

7. CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

中文摘要

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yifan Wang

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06531.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06531

Published: 2026-06-08T02:12:22.841Z

8. Position: Don’t Just “Fix it in Post”: A Science of AI Must Study Training Dynamics

中文摘要

摘要：拥有对人工智能的科学理解意味着什么？模型并不是静态的对象：它们是随时间演化的过程的快照，由数据、目标、架构和优化动态塑造。然而，大部分人工智能研究将模型视为固定的产物，分析训练后的行为，而不是探究其出现的原因。本文观点认为，人工智能科学必须超越事后修正，研究生成模型行为的训练动态。这种科学应支持逐步更强形式的理解：从早期训练信号预测结果，在轨迹出错时进行干预，并最终设计能够更可靠地产生期望特性的训练程序。规模效应定律已使损失的预测变得常规；挑战在于将这一成功扩展到能力、偏差、鲁棒性及与安全相关的行为。我们阐述了基于科学历史和哲学的此类理论的要求，考察了机制可解释性、公平性、记忆化和简单性偏差方面的进展，并识别了具体的开放问题。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06533.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06533

Published: 2026-06-08T02:12:22.841Z

9. Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

中文摘要

摘要：我们提出了加速傅里叶SAT（AFSAT），这是一种基于连续局部搜索（CLS）的伪布尔可满足性问题的GPU加速求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器，支持在单个问题实例中任意异构混合的对称约束类型和长度。通过使用JAX编译器，AFSAT利用纯函数组合、自动向量化、自动微分以及即时（JIT）编译，在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证方法在数值稳定性、运行性能和内存效率上的显著提升。我们通过识别并解决由内存延迟和浮点表示带来的各种限制，同时利用自动并行化和紧凑表示来实现这一点。浮点数固有的表示和稳定性限制部分通过定制的离散傅里叶变换实现得到解决。在通过JAX数组分片扩展到多个加速器时，我们实现了近线性的吞吐量。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Cody J Christopher, Charles Gretton

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06641.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06641

Published: 2026-06-08T02:12:22.841Z

10. A Study of Parallel Continuous Local Search

中文摘要

摘要：我们研究了并行连续局部搜索（Continuous Local Search, CLS）作为解决具有对称伪布尔（PB）约束的布尔可满足性问题的方案。在这里，将 $n$ 变量的 PB 可满足性问题松弛为在 $n$ 维超立方体上具有可微目标函数的连续优化问题。对于可满足的实例，该优化问题的全局极小值点对应于手头 SAT 问题的可满足赋值。通过实证实验，我们提出了几个新发现：(i) 冗余约束可能抑制而非加速收敛；(ii) CLS 在混合设置中作为子求解器显示出潜力，可以快速完成部分赋值；(iii) 局部搜索会迅速收敛到一个稳定的解质量分布（即满足度），这是由于鞍点密集的目标导致额外的求解步骤收益递减。我们的研究结果为在现代加速器硬件上实际使用 CLS 解决 SAT 提供了参考。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Cody J Christopher, Charles Gretton

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.06656.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06656

Published: 2026-06-08T02:12:22.841Z

VLM Domain Papers

1. Applying Deep Learning for cockpit segmentation in the context of mixed reality

Abstract:Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it’s necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence this http URL convolutional neural network architectures “U-net” and “DeepLabV3+” are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

中文摘要

摘要：计算机视觉是一个持续增长的领域。随着第一视角技术的发展，新的发展机会在该领域中出现。混合现实促进了虚拟环境的呈现，使物理世界的物体能够实时显示。为此，有必要关注用户在该模拟环境中的沉浸感，越来越追求将其接近可能的理想现实。本文提出了图像处理的发展，以执行图像分割，从而识别前景和背景，以便促进虚拟图像与真实图像的融合。因此，本研究通过卡特彼勒 CAT793F 高速越野卡车模拟器和摄像机获取用户的真实图像，以便使用人工智能对这些图像进行分割，本研究应用卷积神经网络架构“U-net”和“DeepLabV3+”进行图像分割。结果显示，相关指标达到了约90%的准确率，并确定了最佳模型。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Alexandre Leles Sousa, Pedro de Oliveira Nielson, Erick Oliveira Rodrigues, Rafael Francisco dos Santos, Giovani Bernardes Vitor

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06520.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06520

Published: 2026-06-08T02:18:38.654Z

2. GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

Abstract:Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model’s ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

中文摘要

摘要：尽管在自主长视频理解方面取得了显著进展，现有方法仍然缺乏结合高效记忆架构的详细运动理解。在本文中，我们提出了GOPAgen，这是一种新方法，首先通过精心设计的运动代理将视频编解码器整合到视频理解框架中，该代理在视频编解码器的图像组（GOP）上进行训练。我们进一步开发了GOP树推理算法，该算法自然与视频编解码器对齐，并增强了模型理解视频中局部细节动作的能力。此外，我们精心设计了一个结构化记忆机制，将局部运动信息与结构页中的详细字幕整合，并提出了一种高效的粗到细放大算法，以充分利用结构化记忆。此外，我们在框架中引入了运动矢量数据库，以实现不同粒度下运动矢量的高效检索。总体而言，我们的方法在包括MotionBench和Egoschema在内的各种视频理解基准上实现了卓越的视频问答（VQA）性能，从而展示了我们所提出框架的优势。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Haozhe Chi, Yang Jin, Yadong Mu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06532.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06532

Published: 2026-06-08T02:18:38.654Z

3. Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

Abstract:Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

中文摘要

摘要：由于类别严重不平衡、规模变化大以及缺陷实例在无人机（UAV）图像中的空间范围小，高压输电线路绝缘子自动缺陷检测仍然具有挑战性。为应对这些挑战，本文提出了AE-YOLO，一种基于注意力引导的自编码器增强YOLO框架，用于稳健的绝缘子缺陷检测。该架构在特征金字塔网络-路径聚合网络（FPN-PAN）颈部中集成了轻量级瓶颈自编码器。这在多尺度特征融合过程中保留了对异常敏感的信息。卷积块注意力模块（CBAM）贯穿整个骨干网络，增强特征区分能力并抑制背景干扰。该框架还引入了一种方差最大化自编码器正则化策略，以促进多样化、缺陷判别性的潜在表示。网络采用统一目标进行训练，结合了焦点损失、Complete IoU（CIoU）损失和自编码器正则化，以应对前景-背景不平衡并提升定位精度。在推理过程中，采用加权框融合（WBF）结合YOLOv8、YOLOv10和YOLO11的预测结果。自编码器引导的置信度提升机制提高了对稀有缺陷类别的敏感性。在绝缘子缺陷检测数据集上的实验表明，采用EfficientNetV2骨干的AE-YOLO在mAP@0.5达到95.10%，精度达到96.40%，召回率为93.80%。这一性能比最强的YOLO系列基线在mAP@0.5上高出5.0个百分点，在召回率上高出6.7个百分点。结果验证了该框架的有效性和适应性。该模型是无人机输电线路巡检和缺陷监测的实用且可扩展的解决方案。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Malak Allam, Khaled Shaban, Ali Hamdi

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06536.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06536

Published: 2026-06-08T02:18:38.654Z

4. WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract:In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

中文摘要

摘要：在实际应用中，模型被期望能在各种不同环境中可靠地运行。然而，许多现有的多模态基准在扩展任务类型的同时，并未涵盖处理开放式视觉输入所需的视觉多样性。我们提出了WorldBench，这是一个具有挑战性且视觉上多样化的推理基准，用于评估多模态大语言模型（MLLMs）。我们构建了一个跨多个领域（例如生物）的数千个视觉概念的分类体系。在该分类体系的指导下，我们从搜索引擎和现有数据集中策划了大量图像，以全面呈现视觉世界。通过结构化的试错过程，我们手工设计了前沿MLLM无法回答的挑战性问题。在定量和人工评估中，WorldBench在视觉多样性方面超过了任何现有的多样化基准。对15个MLLM在WorldBench上的评估揭示了视觉理解的弱点：即使最强的模型也仅达到64.0%的准确率，而一些模型的表现仅略高于随机水平。我们希望我们的工作能够凸显在构建多模态基准时视觉多样性的重要性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06538.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06538

Published: 2026-06-08T02:18:38.654Z

5. Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

Abstract:Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF — dynamic temperature goodness, decoupled normalization, and multi-layer fusion — as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% — the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] — exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF’s 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

中文摘要

摘要：Forward-Forward（FF）学习 [Hinton, 2022] 用严格的层局部优度更新替代了反向传播。最近的 FF-CNN 研究在 32x32 基准上缩小了与 BP 的差距，引发了关于层局部训练在实际规模下是否成为可行替代方案的讨论。为了严格探讨这一问题，我们开发了 DTG-FF —— 动态温度优度、解耦规范化和多层融合 —— 作为一种工具，使 FF 家族在九个真实数据基准上达到了最先进水平（CIFAR-10 为 91.8%，并且首次在 ImageNet-100 224x224 上建立了 FF 基线），并用其来评估层局部训练的实际可扩展性。 (1) 真实数据的扩展。在相同配方和骨干下，与架构匹配的 BP-DeepSup 基线在 CIFAR-10/CIFAR-100 上分别比 DTG-FF 高出 2.40/5.93 个百分点，且随着类别数的增加，这一差距进一步扩大。在 224x224 下，相同工具仅达到 49.4% —— 这是该规模下的首个 FF 基线，而典型的 BP 超过 75% [Tian et al., 2020] —— 暴露了在 32x32 时不可见的真实数据上限。 (2) 合成与真实的 K 冲突。随着类别数 K 在合成教师-学生任务中增加，DTG-FF 的表现逐渐优于 BP，但在真实图像上，FF-BP 差距反转并随 K 扩大。在 CIFAR-100 数据集内的粗粒度与细粒度探测中，将标签层次与图像分布分离：合成 K 扫描混淆了输出维度与细粒度识别难度，从而高估了 FF 的可迁移性。 (3) 系统审查。FF 可以在不存储全深度激活的情况下实现，但在 8 GB 常规硬件上，标准 BP+梯度累积可以达到 4.18 GB / 157 图像/秒，而 DTG-FF 为 7.90 GB / 138 图像/秒，因此在公平基线下，FF 在此规模上的内存理由不成立。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yucheng Chen

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06539.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06539

Published: 2026-06-08T02:18:38.654Z

6. Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

Abstract:Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object’s 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

中文摘要

摘要：对象插入旨在将参考对象无缝合成到背景图像的指定区域。最近基于扩散的方法能够实现高视觉质量，但将插入任务简单地视为二维修补任务，未提供对对象三维姿态的显式控制，从而限制了其实际应用。我们提出了DIRECT（参考组成与目标整合的分解注入），这是一个将交互式姿态操控与高保真二维图像合成相结合的创新框架，以实现可姿态控制的对象插入。我们的方法将插入条件分解为三个互补组件：捕捉参考对象视觉细节的外观引导、由用户调整的三维代理生成的几何引导，以及来自目标背景的上下文引导。通过通过不同路径注入这些组件，DIRECT 避免了特征缠结，同时能够保持参考对象的外观，遵循用户指定的姿态，并使对象适应目标场景。我们还引入了自动化数据构建流程，以提高训练数据的多样性和质量。实验表明，DIRECT 在几何可控性和视觉质量方面均优于以往方法。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06601.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06601

Published: 2026-06-08T02:18:38.654Z

7. From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

Abstract:Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

中文摘要

摘要：关节接触力决定植入物的寿命、软骨健康和康复结果，决定谁会发展成骨关节炎，谁从关节置换中恢复良好，谁从生物力学干预中受益。然而，这些影响仅在几十例植入器械患者中具有侵入性测量。我提出了一个无物理的流程，可以从未校准的单眼视频中预测即时三维髋关节和膝盖接触力：没有标记、力板、肌电图、特定受试者成像或肌肉骨骼模型。参数化身体网格每帧恢复，编码为运动学特征，并由变换器解码为力，其姿态流在每一层通过体形、关节、侧面、活动文本和自监督视频标记（V-JEPA 2）自适应调制，将髋关节和膝盖统一于单一模型中。在体内OrthoLoad数据库中，26名患者和25个活动类别的“遗漏一人”交叉验证下，该流程匹配受试者特定肌肉骨骼模拟的准确性（髋关节RMSE为0.32美元/pm 0.08美元;膝盖为0.23至0.03美元-体重），并解决了比步态再训练和骨关节炎进展报告的峰值力量变化更小的情况。将零剂量应用于独立仪器队列，其性能可与之前发表的方法媲美甚至优于。即使没有策划的活动标签，视频功能本身也能保持准确性，并实现对原始素材的端到端推断。在预测变量驱动下，生成运动先验产生具有生物力学上合理的变体，峰值负载降低，重新发现了预测仿真文献中的策略。该流程确立了未经校准的单眼视频作为估算关节负荷的可行方法，开启了对存档临床记录的回顾性分析、初级护理筛查和居家康复追踪的道路。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jessy Lauer

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06631.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06631

Published: 2026-06-08T02:18:38.654Z

8. Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Abstract:Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: this https URL.

中文摘要

摘要：尽管精度很高，视觉变换器（Vision Transformer, ViT）的预测可能受到虚假线索的驱动，这就需要在安全部署前理解其内部工作原理。稀疏自编码器（Sparse Autoencoders, SAEs）为将模型表示分解为人类可解释概念提供了有前景的视角，但由于对概念覆盖控制有限以及特征解释主观且不可扩展，将基于SAE的解释方法应用于ViTs仍然具有挑战性。为填补这一空白，基于受神经科学启发的原则，我们提出了ViSAE，一种通过概念电路理解ViT内部工作机制的机理解释工具箱。ViSAE由三个组件组成：(1) 一个包含64K图像和16K视觉基础概念词汇的探测套件，使概念覆盖效率比ImageNet提高20倍，解释精度比现有概念集提高28.7%。(2) 自顶向下的概念读取和自底向上的电路追踪算法，通过概念电路自动恢复ViT内部工作原理。(3) 用于审计和引导ViT行为的应用。通过概念编辑，ViSAE在WaterBirds数据集上将最差组别的准确率提高了48.2%，比现有方法高出23.8%。我们的数据和代码：https URL。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06664.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06664

Published: 2026-06-08T02:18:38.654Z

9. Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

Abstract:Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources — epistemic, aleatoric, calibration, conformal, and distributional — by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20—45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5—6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41—0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

中文摘要

摘要：深度伪造检测系统在基准测试中几乎可以达到完美准确率，但法务部署需要可靠的预测不确定性。现有的不确定性量化（UQ）方法依赖单一来源，忽略了不同架构之间最优不确定性组合的差异。我们提出了相关优化融合（COF），这是一个架构自适应框架，通过在概率单纯形上进行约束优化，最大化融合不确定性分数与预测误差之间的皮尔逊相关性，融合五种互补的不确定性来源——认知不确定性、随机性不确定性、校准不确定性、一致性不确定性和分布不确定性。COF不需要修改模型，仅需42秒的权重优化，而5模型深度集成则需要20至45小时。在FaceForensics++上对十一种架构的评估显示了一种基本权衡：在匹配的训练/评估协议下，非线性方法的域内相关性比COF高约5~6%（平均r=0.438），但在分布转移下情况则相反。在CelebDF数据集上，COF在11种架构中有9种优于随机森林，相关性最高可达7.3倍（MaxViT-B：r=0.249 对比 0.034）；随机森林跨域性能下降85%，仅剩r=0.071，而COF仍保留了更多信号（下降74%至 r=0.116）。在CelebDF和DFDC的跨数据集评估表明，所有方法都存在灾难性泛化失败：域内相关性为0.41~0.47，而外部几乎降至零（平均下降90.7%），且十一种架构中有七种出现不确定性反转。这些结果确立了COF作为一种实用且可解释的受控分布部署框架，并指出域自适应不确定性量化是法务部署的核心未解决挑战。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Ritesh Sharma, Mohammad Ghasemigol, Yuichi Motai

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06666.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06666

Published: 2026-06-08T02:18:38.654Z

10. JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

Abstract:Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network’s initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

中文摘要

摘要：现有的隐式神经表示（INR）方法存在随机初始化的问题，这种方法不能保证在多次运行中取得一致或高质量的性能，在图像回归中差异可高达2.5 dB（78%）。这种波动对于科学计算和模拟来说是一个问题，因为结果的可重复性至关重要。为了解决这个问题，我们提出了Jacobi-Anger正弦表示网络（JA-SIREN），这是一种基于经典谱分析的正弦网络确定性初始化方案。通过计算目标信号的离散正弦变换（DST）并利用Jacobi-Anger展开式，我们推导出一个两层正弦多层感知器（MLP）的闭式权重，从而在分析上匹配网络的初始谱响应与目标信号，无需随机种子或额外超参数调节。在Kodak数据集上，JA-SIREN实现了67.18 dB的平均PSNR，相较于最佳基线提高了21.30 dB。这是在零运行间方差的情况下实现的，确认了基于谱信息的初始化比随机初始化对于正弦INR来说是一种更有效且可重复的替代方案。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Mohammed Alsakabi, Kejia Hu, John M. Dolan, Ozan K. Tonguz

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.06671.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.06671

Published: 2026-06-08T02:18:38.654Z