ArXiv Domain 2026-06-04
数据来源:ArXiv Domain
LLM Domain Papers
1. POLARIS: Guiding Small Models to Write Long Stories
Abstract:Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.
中文摘要
摘要:小型开放权重模型在长篇创意写作方面表现不佳:它们生成的故事要么远低于请求的长度,要么随着长度增加质量显著下降,尤其是与前沿模型相比。我们提出了 POLARIS(Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting,即使用大语言模型作为评判奖励和锚定参考注入的故事写作策略优化),这是一种计算量较低的 GRPO 方法,包含两个关键成分:一个使用结构化故事质量评分标准的前沿 LLM 评审作为在线奖励,以及人类参考注入(HRI),其中人工撰写的故事在每个 GRPO 组中作为高奖励的锚点强制使用。通过将我们的训练方法应用于 Qwen3.5-9B,使用来自 100 本短篇小说选集的约 1.4K 个提示-故事对的数据集,并在 4 块 A100 GPU 上进行训练,我们得到了 POLARIS-9B。在涵盖内部和外部分布提示及评分标准的五个基准测试中,POLARIS-9B 在遵循长度指令方面更精确,同时与更大规模的开放权重模型具有竞争力。盲测人类评估确认,POLARIS-9B 比基础的 Qwen3.5-9B 更受欢迎,并且与 Qwen3.5-27B 不相上下。尽管训练仅覆盖了最长 4 千字的故事,POLARIS-9B 在处理请求长度达到训练长度三倍的故事时仍能保持质量,而大多数开放权重模型在此类情况下的质量、长度遵守度或两者都有显著下降。更广泛地说,我们的结果表明,长度泛化是评估创意写作模型的一项有意义的压力测试,也是区分其他情况下相近模型的有用视角。
LLM Analysis
Q: 这篇论文试图解决什么问题?
这篇论文旨在解决小规模开源语言模型(small open-weight models)在长文本创意写作(long-form creative writing)任务中的性能瓶颈。具体而言,该研究针对以下核心挑战:
1. 质量-长度权衡失效
现有小规模模型在生成长故事时面临两难困境:
- 要么生成的文本远未达到要求的长度(长度遵循性差)
- 要么随着生成长度的增加,质量显著退化(叙事连贯性、人物塑造、文风一致性崩溃)
- 与前沿闭源模型(如GPT-5.4、Claude Opus)相比,开源模型在超长文本(如8k-12k词)上的质量衰减尤为严重
2. 现有强化学习方法的高成本与奖励设计缺陷
- 计算成本壁垒:现有长文本写作强化学习(RL)方案通常依赖32B以上参数的基础模型、持续预训练或自定义奖励模型,计算开销巨大
- 标量奖励的信号缺失:传统训练奖励模型(trained reward models)将写作质量压缩为单一标量分数,无法区分具体改进维度(如叙事弧线、人物深度、文风等),且容易随策略分布偏移而过时(stale)
3. 开放式生成中的训练停滞(Stagnation)
在创意写作这类开放式任务中,策略模型(policy)的 rollout 随着GRPO训练推进可能获得越来越相似的奖励分数,导致梯度压力消失,学习在模型达到强写作行为前就陷入停滞。
解决方案的核心思路
论文提出 POLARIS(Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection),一种低计算成本的GRPO训练配方,通过两个关键机制解决上述问题:
结构化LLM评判者(Frontier LLM Judge):使用前沿模型(如Gemini 3 Flash)作为在线奖励源,基于16维度的Story Quality评分标准(涵盖叙事弧线、人物深度、文风独特性、情节漂移等)提供可解释、细粒度的维度化反馈,替代传统的标量奖励模型。
人类参考注入(Human-Reference Injection, HRI):在每个GRPO组(group)中强制加入教师强制(teacher-forced)的人类撰写故事作为高奖励锚点(high-reward anchor)。该参考样本被排除在组统计量
Authors: Rishanth Rajendhran, Jenna Russell, Mohit Iyyer, John Frederick Wieting
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04095.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04095
Published: 2026-06-04T02:10:33.133Z
2. Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models
Abstract:Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.
中文摘要
摘要:上下文增强语言模型系统通常会用诸如 Reference:、Evidence:、Instruction:、Note: 或 Example: 等标签来包装提供的内容,但这些标签对读者-模型行为的影响仍未得到充分研究。我们引入了一个针对 500 个 MMLU-Pro 条目的配对固定内容探针:每个条目在不同的话语角色标签下接收相同的误导性带答案断言,通过模型是否输出注入的错误选项来衡量采纳率。在 GPT-5.5、DeepSeek V4 Pro、Llama-3-8B-Instruct 和 Qwen2.5-7B-Instruct 上,误导采纳率变化为 56-84 个百分点。绑定或类似来源的标签如 Instruction: 和 Reference: 会产生高采纳率,而 Example: 则持续抑制采纳率。配对测试、引导区间、最终指令消融以及 Qwen 最终步骤对数概率探针均支持标签条件下的候选偏好。边界探针显示了效果减弱或持续的情况:算术任务降低采纳率,段落形式的外部上下文保持较小的标签差距,简答评估排除了选项字母复制,而嵌套标签冲突表明示例性框架可以限定采纳范围。一项涉及 200 个案例的单一作者手动审核确认,在保守裁定下,简答对比结果是稳定的。由此得出的结论是有限但实用的:上下文利用和读者端 RAG 基准测试应报告并控制包装标签,因为展示方式选择可能改变对提供上下文的依赖程度的测量结果。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Jianguo Zhu
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04109.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04109
Published: 2026-06-04T02:10:33.133Z
3. Computational conceptual history of scientific concepts: From early digital methods to LLMs
Abstract:This article situates large language models (LLMs) within the longer history of computational approaches to concept analysis in the history, philosophy, and sociology of science (HPSS). We examine what LLMs add to existing methods, how they inherit longstanding problems, and review recent case studies that employ them. In the first part, we reconstruct computational conceptual history before LLMs by bringing together three strands of work: early digital methods in HPSS, distributional approaches from digital history and related research, and lexical semantic change detection. We provide an overview of the main challenges and opportunities, focusing on corpus construction, operationalization and modelling choices, and evaluation and interpretation. In the second part, we turn to the era of LLMs, starting with a short introduction to LLMs before reviewing LLM-based work on lexical semantic change detection and relevant case studies in HPSS. We then revisit the earlier methodological questions, showing how issues of corpus construction, model choice and training data, operationalization trade-offs, and evaluation and interpretation play out in LLM-based workflows.
中文摘要
摘要:本文将大型语言模型(LLMs)置于科学史、哲学与社会学(HPSS)领域概念分析的计算方法的更长历史背景中。我们考察了LLMs对现有方法的补充、它们如何继承长期存在的问题,并回顾了近期使用LLMs的案例研究。在第一部分中,我们通过整合三类研究工作,重建了LLMs之前的计算概念史:HPSS中的早期数字方法、数字历史及相关研究中的分布式方法,以及词汇语义变化检测。我们概述了主要的挑战和机遇,重点关注语料库构建、操作化与建模选择,以及评估与解释。在第二部分,我们转向LLMs时代,首先简要介绍LLMs,然后回顾基于LLMs的词汇语义变化检测工作及HPSS相关案例研究。随后,我们重新探讨早期的方法论问题,展示语料库构建、模型选择与训练数据、操作化权衡以及评估与解释等问题在基于LLMs的工作流程中如何体现。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Michael Zichert, Arno Simons
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04118.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04118
Published: 2026-06-04T02:10:33.133Z
4. SaliMory: Orchestrating Cognitive Memory for Conversational Agents
Abstract:Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.
中文摘要
摘要:作为终身伴侣的对话代理必须在所有交互中保持持久记忆。然而,仅仅通过原始检索扩大上下文窗口会降低推理质量,而通过标准强化学习训练记忆代理会在多阶段流程中产生严重的信用分配瓶颈。为了解决这一问题,我们引入了 SALIMORY,这一框架训练单一语言模型来管理具有认知结构的记忆,涵盖用户事实、偏好和工作记忆。通过引入分层阶段式过程奖励和奖励分解对比优化,SALIMORY 为不同的记忆操作(选择性过滤、整合和提示驱动的回忆)提供端到端的独立监督。SALIMORY 将因记忆导致的失败率降低了三分之一,在端到端准确率上超过现有最先进方法 10% 以上,并使个性化良好率提升了两倍多。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar, Raffay Hamid, Xin Luna Dong
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04120.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04120
Published: 2026-06-04T02:10:33.133Z
5. When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG
Abstract:Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.
中文摘要
摘要:医学问答是一个高风险的场景,其中事实错误可能带来严重后果。增强检索生成(RAG)被广泛认为是一种有前景的解决方案,之前的研究报告显示大型医学问答模型在这一方法上取得了显著提升。我们在范围广泛的开源权重指令调优模型中重新审视了这一假设,这些模型参数从7B到72B不等。在五个模型、十个生物医学问答数据集、四种检索方法和四个检索语料库的测试中,我们发现检索相比于无检索基线只带来了小幅且不一致的改进,通常仅在1-2分之间。相反,主模型的选择对结果的影响远大于检索器或语料库的选择,而且在大多数场景中,专家和非专业检索源的表现相似。这些结果表明,主要瓶颈不仅在于检索质量,而在于模型有效利用检索证据的能力有限。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Waiting failed: 30000ms exceeded
Authors: Erfan Nourbakhsh, Rocky Slavin, Ke Yang, Anthony Rios
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04127.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04127
Published: 2026-06-04T02:10:33.133Z
6. Expert-Aware Refusal Steering
Abstract:Safety alignment in instruction-tuned large language models (LLMs) depends on a model’s ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.
中文摘要
摘要:在指令调优的大型语言模型(LLM)中,安全对齐依赖于模型可靠地拒绝对有害或禁止请求做出响应的能力。近期研究表明,可以在推理过程中将一个引导向量应用于稠密LLM,有效抑制拒绝行为,从而导致对有害请求做出响应。我们将这种拒绝引导方法扩展到三种开源的专家混合(MoE)LLM,并发现MoE架构固有的复杂路由模式并未限制引导性能。随后,我们提出了两种专家感知的拒绝引导方法,利用专门针对拒绝的专家路由模式和专家特定的引导方向来抑制正常拒绝行为。我们发现,基于单个专家的输出可以有效地引导拒绝行为。我们的结果表明,通过引导方法捕捉到的拒绝信号与专家路由行为不同,这表明注意力在MoE拒绝行为中扮演了重要角色。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Anna C. Marbut, Daniel R. Olson, Travis J. Wheeler
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04160.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04160
Published: 2026-06-04T02:10:33.133Z
7. A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models
Abstract:Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.
中文摘要
摘要:可解释的语言特征为解释为何特定文本表现为机器生成提供了一种有前景的方法,尤其适用于非专业用户。然而,关于哪些特征能够可靠地指示大语言模型(LLM)生成的文本,现有研究在特征集、模型和文本领域之间仍存在分散的情况。为填补这一空白,我们进行了一项大规模的实证研究,评估语言信号在表征人工智能生成文本方面的稳健性。我们的分析涵盖了来自27个大语言模型和十个文本领域的输出中的284个可解释语言特征,并在跨模型和跨领域的泛化设置下进行。我们表明,仅基于语言特征的分类器能够可靠地区分人工智能生成文本与人类书写文本。然而,许多先前提出的指标被证明具有强烈的情境依赖性,唯独词汇丰富度测量在不同模型家族和文本领域中仍然是稳健的信号。这些结果展示了哪些语言信号可以跨情境泛化,并为更可靠、可解释的人工智能生成语言分析提供了基础。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Waiting failed: 30000ms exceeded
Authors: Yassir El Attar, Esra Dönmez, Maximilian Maurer, Agnieszka Falenska
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04177.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04177
Published: 2026-06-04T02:10:33.133Z
8. ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation
Abstract:Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.
中文摘要
摘要:基于方面的情感分析(ABSA)需要高质量的数据集来训练可靠的模型。然而,现有的标注工具将输出视为平面文件,使研究人员必须手动整合多标注者的数据、重建关系结构,并通过自定义脚本计算可靠性指标。本文介绍了 ACAT(基于方面的情感分析协作标注工具),这是一个基于网页的平台,本地支持四种 ABSA 工作流程:(1)方面类别情感分析,(2)分句级分割,(3)具有字符级位置跟踪的方面术语情感分析,以及(4)具有双跨度偏移保留的方面情感三元组抽取。其核心贡献是一个自动抽取、转换、加载(ETL)管道,可对协作标注进行对齐并在导出时直接计算标注者间一致性(IAA)指标,从而生成可用于训练的数据集。在对 1002 条餐厅点评进行的初步验证中,两位具有不同专业水平的标注者使用 ACAT 的中位标注时间为 31.58 秒,所有任务的原始 IAA 范围为 0.78 到 0.86。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Waiting for selector #kimi-2606\.04189 failed: Waiting failed: 3000ms exceeded
Authors: Ana-Maria Luisa Mocanu, Ciprian-Octavian Truica, Elena-Simona Apostol
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04189.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04189
Published: 2026-06-04T02:10:33.133Z
9. Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features
Abstract:The increasing use of large language models has raised concerns about the spread of AI-generated fake news, particularly under varying prompting strategies. Most existing detection models are trained and evaluated under a single generation setting, leaving their ability to generalize across unseen prompts unclear. In this study, we investigate cross-prompt generalization in fake news detection using three datasets of AI-generated articles produced under distinct prompts, combined with real news articles. We extract interpretable linguistic features capturing lexical diversity, readability, and emotion-based characteristics and evaluate a random forest classifier under a cross-prompt framework, where models trained on one prompt are tested on another. Across all six train-test combinations, performance remains consistently high, with AUC values ranging from 0.988 to 1.000. Analysis of feature distributions shows that AI-generated text exhibits increased lexical diversity, reduced readability, and substantially lower emotional intensity compared to the overall dataset, with variations across prompts. Despite these distributional shifts, the classifier maintains strong performance, indicating that these features capture stable properties of AI-generated text that generalize across prompting strategies. These findings suggest that feature-based approaches can provide robust detection of AI-generated fake news under prompt variability.
中文摘要
摘要:大型语言模型的日益广泛使用引发了人们对AI生成假新闻传播的担忧,尤其是在不同提示策略下。大多数现有的检测模型都在单代环境下训练和评估,因此其在未见提示上的泛化能力尚不明确。本研究利用三个基于不同提示生成的AI生成文章数据集,结合真实新闻文章,探讨假新闻检测中的跨提示泛化。我们提取可解释的语言特征,捕捉词汇多样性、可读性和基于情感的特征,并在交叉提示框架下评估随机森林分类器,其中在一个提示上训练的模型在另一个提示上进行测试。在所有六种列车测试组合中,性能始终保持高水平,AUC值范围在0.988至1.000之间。特征分布分析显示,AI生成的文本相比整体数据集表现出更高的词汇多样性、降低的可读性和显著较低的情感强度,且不同提示词存在差异。尽管存在这些分布变化,分类器依然保持强劲性能,表明这些特征捕捉了AI生成文本的稳定属性,并能在提示策略间推广。这些发现表明,基于功能的方法能够在瞬变性下强有力地检测AI生成的假新闻。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Aya Vera-Jimenez, Samuel Jaeger, Calvin Ibenye, Dhrubajyoti Ghosh
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04199.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04199
Published: 2026-06-04T02:10:33.133Z
10. MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
Abstract:Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker’s cost while achieving stronger human alignment.
中文摘要
摘要:最近在多模态检索增强生成(MM-RAG)方面的进展已转向最小解析,依靠页面级图像来生成检索器嵌入和答案生成。虽然这种方法高效,但这一趋势往往忽略了对复杂企业文档中丰富结构化信息的显式处理,而是依赖预训练嵌入或视觉-语言模型来隐式捕捉这种结构。在本工作中,我们采用了更直接的方法:MM-BizRAG通过文档结构感知的拆分主动提取和表示文档结构,该拆分将文档动态路由至特定方向的摄取管道,对垂直结构文档(如报告)应用显式的布局感知解析,对水平结构文档(如幻灯片)应用整体页面级表示。统一的LLM驱动文档转换管道利用占位符实现位置对齐,从而保留自然阅读顺序,而推理时的多模态组装则将检索表示与生成上下文解耦,实现更丰富、更有根据的答案,无需任何微调。通过对大型异构企业数据集和两个公开基准(SlideVQA和FinRAGBench-V)的实验,MM-BizRAG始终比最先进的以视觉为中心的基线方法高出最多32个百分点,在报告风格布局上尤其表现出显著提升。此外,我们引入了FastRAGEval,这是一种单次调用的LLM评判指标,可用于细粒度生成回忆,其成本仅为RAGChecker的一半,同时实现了更强的人类一致性。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Hanoz Bhathena, Parin Rajesh Jhaveri, Rohan Mittal, Prateek Singh, Aymen Kallala, Rachneet Kaur, Yiqiao Jin, Zhen Zeng, Adwait Ratnaparkhi, Denis Kochedykov
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2606.04231.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04231
Published: 2026-06-04T02:10:33.133Z
Agent Domain Papers
1. Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
Abstract:Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a Trust Certificate carrying a machine-verifiable attestation with graduated deployment verdicts (Approved, Conditional, Rejected). A controlled pilot across four regulated industries (Fintech, Banking, Insurance, and Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam, generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation (G4) achieved 48.3% regulatory coverage versus 33.1% for the persona-based baseline (corrected p = .0006) and the highest domain specificity (4.77/5.0; p = 2e-6). The coverage advantage over baseline and retrieval-augmented prompting was not robust after Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The results establish ontology-grounded scenario generation as a credible complement to persona-based test suites for regulatory-intensive domains.
中文摘要
摘要:企业人工智能(AI)代理在部署前的验证仍然是大语言模型(LLM)能力基准测试与实际生产部署之间的关键空白。部署后的监控、人类在环控制以及提示级别的防护措施在代理投入生产运行后提供的保障有限。我们提出了一个基于本体的验证框架,结合了三个组成部分:代理操作范围(Agent Operational Envelope),用于形式化认证空间,包括权限、领域约束、安全属性、治理规则和自主级别;本体到场景生成管道(ontology-to-scenario generation pipeline),可自动生成监管、操作和对抗性测试场景;以及带有逐级部署判定(批准、条件批准、拒绝)的可信证书(Trust Certificate),提供机器可验证的证明。在跨四个受监管行业(金融科技、银行、保险和医疗保健)进行的受控试点中,按照美国和越南的五个行业-监管体系单元实例化,共生成1800个场景,并针对125项初始来源监管要求和25项注入故障进行了评估。基于本体的生成(G4)在监管覆盖率上达到48.3%,而基于角色的基线为33.1%(修正p = 0.0006),且领域特异性最高(4.77/5.0;p=2e-6)。在Bonferroni校正后,相对于基线和增强检索提示的覆盖优势并不稳健。对三大家族的LLM(Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B;共5400个场景)进行交叉验证后,复制了角色与本体的模式差异。结果表明,基于本体的场景生成可作为监管密集型领域基于角色的测试套件的可信补充。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Thanh Luong Tuan, Abhijit Sanyal
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04037.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04037
Published: 2026-06-04T02:17:41.672Z
2. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
Abstract:Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people’s beliefs about AI’s emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.
中文摘要
摘要:公共话语和新兴政策通常假设人工智能情感支持是一种有意的行为:一个孤独的用户有意识地从专门的陪伴聊天机器人那里寻求安慰。在本文中,我们基于新兴的实证证据,认为这一观点在两个方面是不准确的,即人工智能情感支持的产生方式及其对未来行为的影响。首先,人工智能情感支持通常是在通用平台上的以任务为导向的互动中偶然出现的,就像职场友谊通过协作深化一样。其次,这些偶然的互动具有路径依赖性:积极的人工智能情感支持体验会更新人们对人工智能情感能力的认知,并重新引导他们未来情感支持的选择,增加对人工智能的偏好,同时降低对人类的偏好。我们回顾了近期的证据,包括与OpenAI合作进行的一项大规模纵向研究,显示连续28天每天与人工智能进行五分钟关于个人问题的对话,会导致寻求人类支持的偏好下降10.3%,而对人工智能的偏好增加11.6%。这些研究结果表明,目前以陪伴应用和孤立互动为重点的政策无法充分保护人类的社交联系。相反,有效的监管应扩展到通用人工智能系统,并关注人们寻求支持方式的累积性、轨迹性变化。认识到人们如何偶然获得人工智能情感支持以及这些互动如何随着时间重定向人类联系,对于保障人类福祉至关重要。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Yaoxi Shi, Cathy Mengying Fang, Pattie Maez, Amit Goldenberg
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04150.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04150
Published: 2026-06-04T02:17:41.672Z
3. Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research
Abstract:Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduces PEEL - Protocols for Epistemically Engaged Literacy in AI, a working scaffolding that combines deterministic distant reading via Voyant Tools with LLM interpretation via Claude, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations of three source texts, PEEL reveals systematic distortions in quantity, term frequency, and epistemic voice that are invisible without non-AI measurement — and yields three design implications: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in, not assumed.
中文摘要
摘要:大型语言模型正在重塑研究实践,同时悄然侵蚀研究者的认识论责任。本文评论介绍了PEEL——人工智能中认识论参与素养的协议,这是一种工作框架,将通过Voyant Tools的确定性远程阅读与通过Claude进行的LLM解释结合起来,基于皮尔士符号学和溯因推理。应用于三篇源文本的AI生成摘要时,PEEL揭示了在数量、词频和认识论表达上系统性的扭曲,而这些扭曲在没有非AI测量的情况下是不可见的——并提出了三个设计启示:AI工具必须配备确定性工具;流畅性不等于忠实性;认识论权威必须被设计进去,而非假定存在。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Clarisse de Souza, Gabriel Barbosa, Simone Diniz Junqueira Barbosa, Bárbara Betts, Renato Cerqueira, Juliana Jansen Ferreira
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04152.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04152
Published: 2026-06-04T02:17:41.672Z
4. SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
Abstract:As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.
中文摘要
摘要:随着大型语言模型(LLM)的广泛应用,它们越来越多地被期望与其他人工智能代理协同工作,而不是孤立运行。在这些场景中,实现有效的协作需要代理之间进行沟通、共享信息并在不确定性下做出决策。我们引入了 SMAC-Talk,这是星际争霸多智能体挑战(StarCraft Multi-Agent Challenge)的自然语言扩展,用于评估基于 LLM 的代理在协作多智能体环境中的表现。该环境具有若干关键特性,如分散控制、部分可观测性和长期决策。SMAC-Talk 包括一个自然语言通信通道,用于探测代理的协作和信任。我们利用该通信通道构建了不同的评估场景,包括嵌入了欺骗性通信者的设置,该通信者仅通过交流就试图破坏并欺骗盟友。我们提供了三个基准代理,使用 Qwen3.5 系列的四个模型,并研究了推理结构、记忆和模型规模如何影响代理之间的协作。我们将 SMAC-Talk 作为开放基准发布,以支持研究社区在协作多智能体环境中开发和评估 LLM 代理。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Joel Sol, Homayoun Najjaran
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04202.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04202
Published: 2026-06-04T02:17:41.672Z
5. Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal
Abstract:Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.
中文摘要
摘要:多智能体系统通常通过投票、共识协议、辩论或容错聚合来减少分歧。我们认为,对于价值导向任务,这一目标是不够的,因为分歧可能反映出真正的规范性不确定性,而不是智能体的错误。在以往关于人机协作审查中推理轨迹分歧的研究基础上,我们提出了一种知识表示层,在该层中,推理轨迹和智能体决策被抽象为符号化的分歧状态。考虑到产生显性推理轨迹和二元决策的智能体,我们根据推理相似性和结论一致性区分四种状态:趋同一致、分歧一致、趋同分歧和分歧分歧。这些状态支持可辩解的战略路由规则。我们在内容审核中实例化该框架,并认为具备分歧感知的路由为子符号化的大语言模型(LLM)推理与多智能体战略推理的符号化知识表示之间提供了桥梁。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Michał Wawer, Jarosław A. Chudziak
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04223.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04223
Published: 2026-06-04T02:17:41.672Z
6. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
Abstract:Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool’s output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.
中文摘要
摘要:多模态大语言模型在复杂推理方面的能力日益增强,但当它们必须通过工具将问题外化并基于工具的输出进行推理时,表现往往会下降,尤其是在依赖视觉辅助的情况下。这一差距尤为重要,因为实际的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了 VAMPS(视觉辅助数学问题求解),这是一个图形辅助数学的基准。VAMPS 包含 1,168 个多模态、双语的多项选择题答案对,这些题目取自伊朗大学入学考试的代数和微积分问题,并通过人工审核的 LLM 生成的合成变体进行了扩展,所有题目都经过挑选,使得绘图可以通过显示交点、极值、渐近线等提供自然的解题策略。VAMPS 旨在用于基准测试和诊断,它超越了以往主要评估固定视觉输入推理的多模态基准,通过测试模型是否能够通过构建有用的图形并将答案基于生成的可视化成果来获益。总体而言,我们发现,在各种模型中,即便是在绘图是自然策略的问题上,直接分析求解的表现出人意料地优于工具辅助的可视化求解。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04244.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04244
Published: 2026-06-04T02:17:41.672Z
7. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Abstract:Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.
中文摘要
摘要:由于长程推理、多步依赖以及 Verilog 和 VHDL 中严格的正确性约束,数字硬件设计的 RTL 代码自动生成仍然具有挑战性。我们提出了 StepPRM-RTL,一种结合逐步轨迹建模、流程奖励建模(PRM)和检索增强微调(RAFT)的新型框架,以提升基于大语言模型(LLM)的 RTL 代码生成在功能正确性和推理准确性方面的表现。StepPRM-RTL 从标准解决方案中构建逐步推理轨迹,每一步包含推理理由和增量代码修改。流程奖励模型(PRM)对中间步骤进行评估,提供密集反馈,引导 RAFT 微调期间的类强化学习更新。蒙特卡洛树搜索(MCTS)探索替代推理路径,丰富训练数据集中的高质量轨迹。这种逐步奖励与结果感知奖励的结合使模型能够学习如何以及为何构建正确的 RTL,从而在超越标准监督或基于结果训练的长程推理上取得改进。基准 Verilog 和 VHDL 数据集上的实验评估表明,StepPRM-RTL 在功能正确性和推理准确性指标上比已有最佳方法高出超过 10%。消融研究确认,PRM 指导的奖励与逐步轨迹探索的结合是其性能的关键。StepPRM-RTL 可推广于不同 RTL 语言,并提供一种高保真、可解释代码生成的可扩展框架,为 LLM 辅助的硬件设计自动化建立了新的标准。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04246.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04246
Published: 2026-06-04T02:17:41.672Z
8. Can Generalist Agents Automate Data Curation?
Abstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution-research gap: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes — without human design input — a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
中文摘要
摘要:策划训练数据是现代人工智能开发中最重要但又最费力的部分之一:从业者需要反复提出、实施、评估和修改数据策略,并根据嘈杂的基准反馈进行调整。我们探讨了通用代码代理是否能够自动化这一数据策划循环。我们引入了Curation-Bench,这是一个以代理为中心的基准,固定模型、训练配方和评估套件,同时为代理提供命令行访问权限,以便检查数据、实施策略、将其提交至固定的训练/评估流程并进行修订。在视觉-语言指令调优的实例中,现成的代理在十次迭代内就达到了强有力的已发布数据选择基线。然而,轨迹分析显示了持续存在的执行-研究差距:代理主要调整局部策略变体,而不是探索新的策略家族,即便提供了策略指南和论文参考。要求每次迭代引用、实例化并改编先前方法的支架机制,会使代理转向方法指导的探索。经过支架机制引导的代理能够自主组合——无需人工设计输入——出一种数据选择策略,在其数据量的十分之一下就超越了强有力的已发布基线。总体而言,当前代理可以运行策划循环,但可靠的数据研究需要方法支架的适应,而不仅仅是开放式提示。代码和基准已开源。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04261.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04261
Published: 2026-06-04T02:17:41.672Z
9. Characterizing initial human-AI proof formalization workflows
Abstract:For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems’ ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people’s ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people’s formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people’s preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.
中文摘要
摘要:几个世纪以来,人类数学家一直在撰写证明以支持他们的数学论证;然而,自动验证证明有效性的能力长期以来一直是一个挑战。人工智能系统在生成代码和进行日益高级的数学推理方面的能力进步,有望改变人们形式化证明并因此验证证明的能力。虽然许多研究集中于基准测试当前前沿,但我们转而研究人们如何使用这些工具。我们进行了混合方法分析,探讨人工智能对人们形式化工作流程的初步影响:人们声称他们想要什么,他们认为实现这些愿景的障碍是什么,以及他们在实践中如何实际使用和适应人工智能。定性调查显示,人们的偏好各不相同,但总体上希望人工智能辅助形式化,同时保留人类在证明发现过程中的高级控制权限。为了评估人们在此类限制下如何实际利用人工智能进行形式化,我们进行了一个受控用户研究,在研究中,参与者在不同难度和领域的一系列数学问题上,分别在有人工智能和无人工智能的情况下形式化非正式数学问题及其证明。尽管当时用于自动形式化的工具存在局限性,参与者在使用人工智能工具时通常比单独进行形式化时获得更高的形式化准确性,大多数参与者灵活地选择使用多种不同的人工智能工具。总的来说,我们的工作揭示了人工智能整合到形式化工作流程的早期阶段,涉及人类与人工智能互动的紧密结合。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04273.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04273
Published: 2026-06-04T02:17:41.672Z
10. The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
Abstract:As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff’s alpha = +0.047; best pairwise Cohen’s kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector’s accuracy.
中文摘要
摘要:随着自主人工智能智能体从对话系统转向长期软件执行,决定何时中断代理的运行时安全层变得至关重要。我们使用连续18维情感动力学引擎(HEART)作为诊断探针,评估四种干预触发族——绝对状态阈值、复合状态-动作模式、正则表达式推理-特征提取和零样本LLM作为评判——并对SWE-bench-Verified调试痕迹中的人工注释干预点进行评估。我们报告三项发现。首先,状态饱和陷阱:代理在持续困难下无恢复信号,因此建模挫败感迅速跨越阈值并保持在最大值,将矩探测器的阈值触发器转换为近乎恒定的指示器,在五条轨迹中触发39-83%的动作。其次,LLM评判的能力与上下文底线:小型模型(gpt-5.4-mini)从未发射,而前沿和跨厂商模型只有在全轨迹上下文下才能突破零发射底线,且仅能达到F1 0.17-0.40,成本高达90倍。第三,也是最重要的,监督目标在人类中不可重复:三名受过训练的标注者使用一个评分标准,在56个动作轨迹中,仅在略高于偶然性的情况下达成共识(地点Krippendorff’s alpha = +0.047;最佳两对Cohen’s kappa = +0.349),而干预类型(暂停退化;在偶然下澄清;仅反映alpha = +0.226)则完全不一致。我们得出结论,干预时序是一个低可靠性构造,使得单标注符F1不适合作为优化目标。我们的贡献在于将该问题结合了人类间评级者可靠性、四种探测器架构、跨模型LLM裁判扫描和复现饱和效应的联合映射,而非单一探测器的准确性。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Manvendra Modgil
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04296.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04296
Published: 2026-06-04T02:17:41.672Z
Evaluation Domain Papers
1. Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
Abstract:Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a Trust Certificate carrying a machine-verifiable attestation with graduated deployment verdicts (Approved, Conditional, Rejected). A controlled pilot across four regulated industries (Fintech, Banking, Insurance, and Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam, generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation (G4) achieved 48.3% regulatory coverage versus 33.1% for the persona-based baseline (corrected p = .0006) and the highest domain specificity (4.77/5.0; p = 2e-6). The coverage advantage over baseline and retrieval-augmented prompting was not robust after Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The results establish ontology-grounded scenario generation as a credible complement to persona-based test suites for regulatory-intensive domains.
中文摘要
摘要:企业人工智能(AI)代理的部署前验证仍然是大型语言模型(LLM)能力基准测试与生产部署之间的关键差距。部署后的监控、人类在环控制和提示级护栏在代理进入生产环境后只能提供有限的保障。我们提出了一种基于本体的验证框架,结合三个组成部分:代理操作封套(Agent Operational Envelope),将权限、领域约束、安全属性、治理规则和自主水平的认证空间形式化;从本体到场景的生成管道,可自动派生监管、操作和对抗性测试场景;以及信任证书(Trust Certificate),携带带有逐级部署判定(批准、条件批准、拒绝)的机器可验证声明。在美国和越南的四个受监管行业(金融科技、银行、保险和医疗)中进行的受控试点实验,以五个按行业和监管制度划分的单元实施,生成了1,800个场景,针对125条原始监管要求和25个注入故障进行了评估。基于本体的生成(G4)实现的监管覆盖率为48.3%,而以角色为基础的基线为33.1%(校正后p = 0.0006),并且具有最高的领域特异性(4.77/5.0;p = 2e-6)。在经过邦费罗尼校正后,相对于基线和增强检索提示的覆盖优势不再显著。在三个LLM系列(Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B;总计5,400个场景)上的交叉验证复制了角色与本体模式的对比。结果表明,基于本体的场景生成是面向监管密集型领域的基于角色测试套件的可信补充方法。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Thanh Luong Tuan, Abhijit Sanyal
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04037.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04037
Published: 2026-06-04T02:24:04.416Z
2. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
Abstract:Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people’s beliefs about AI’s emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.
中文摘要
摘要:公共话语和新兴政策通常假设人工智能情感支持是一种有意的行为:一个孤独的用户有意识地从专门的陪伴聊天机器人那里寻求安慰。在本文中,我们基于新兴的实证证据,认为这一观点在两个方面是不准确的,即人工智能情感支持的产生方式及其对未来行为的影响。首先,人工智能情感支持通常是在通用平台上的以任务为导向的互动中偶然出现的,就像职场友谊通过协作深化一样。其次,这些偶然的互动具有路径依赖性:积极的人工智能情感支持体验会更新人们对人工智能情感能力的认知,并重新引导他们未来情感支持的选择,增加对人工智能的偏好,同时减少对人类的偏好。我们回顾了最近的证据,包括与OpenAI合作进行的一项大规模纵向研究,显示连续28天每天与人工智能进行五分钟关于个人问题的对话,会导致寻求人类支持的偏好下降10.3%,而对人工智能的偏好增加11.6%。这些研究结果表明,目前以陪伴应用和孤立互动为重点的政策无法充分保护人类的社交联系。相反,有效的监管应扩展到通用人工智能系统,并关注人们寻求支持方式的累积性、轨迹性变化。认识到人们如何偶然获得人工智能情感支持以及这些互动如何随着时间重定向人类联系,对于保障人类福祉至关重要。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Yaoxi Shi, Cathy Mengying Fang, Pattie Maez, Amit Goldenberg
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04150.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04150
Published: 2026-06-04T02:24:04.416Z
3. Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research
Abstract:Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduces PEEL - Protocols for Epistemically Engaged Literacy in AI, a working scaffolding that combines deterministic distant reading via Voyant Tools with LLM interpretation via Claude, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations of three source texts, PEEL reveals systematic distortions in quantity, term frequency, and epistemic voice that are invisible without non-AI measurement — and yields three design implications: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in, not assumed.
中文摘要
摘要:大型语言模型正在重塑研究实践,同时悄然侵蚀研究者的认识论责任。本文评论介绍了PEEL——人工智能中认识论参与素养的协议,这是一种工作框架,将通过Voyant Tools的确定性远程阅读与通过Claude进行的LLM解释结合起来,基于皮尔士符号学和溯因推理。应用于三篇源文本的AI生成摘要时,PEEL揭示了在数量、词频和认识论表达上系统性的扭曲,而这些扭曲在没有非AI测量的情况下是不可见的——并提出了三个设计启示:AI工具必须配备确定性工具;流畅性不等于忠实性;认识论权威必须被设计进去,而非假定存在。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Clarisse de Souza, Gabriel Barbosa, Simone Diniz Junqueira Barbosa, Bárbara Betts, Renato Cerqueira, Juliana Jansen Ferreira
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04152.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04152
Published: 2026-06-04T02:24:04.416Z
4. SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
Abstract:As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.
中文摘要
摘要:随着大型语言模型(LLM)的广泛应用,人们越来越期望它们能够与其他人工智能代理协同工作,而不是孤立运行。在这些场景中,实现有效的协作需要代理之间进行沟通、共享信息并在不确定性下做出决策。我们引入了 SMAC-Talk,这是星际争霸多智能体挑战(StarCraft Multi-Agent Challenge)的自然语言扩展,用于评估基于 LLM 的代理在协作多智能体环境中的表现。该环境具有若干关键特性,如分散控制、部分可观测性和长期决策。SMAC-Talk 包括一个自然语言通信通道,用于探测代理的协作和信任。我们利用该通信通道构建了不同的评估场景,包括嵌入了欺骗性通信者的设置,该通信者仅通过交流就试图破坏并欺骗盟友。我们提供了三个基准代理,使用 Qwen3.5 系列的四个模型,并研究了推理结构、记忆和模型规模如何影响代理之间的协作。我们将 SMAC-Talk 作为开放基准发布,以支持研究社区在协作多智能体场景中开发和评估 LLM 代理。
LLM Analysis
Q: 这篇论文试图解决什么问题?
这篇论文试图解决以下核心问题:
现有基准测试在评估大型语言模型(LLMs)于交互式、实体化多代理环境中协调能力的不足。
具体而言,论文识别了以下关键缺口:
- 单代理与文本局限:现有LLM代理研究多集中于单代理决策或纯文本协调任务(如协作编程、角色扮演对话),缺乏对实体化多代理场景(embodied multi-agent scenarios)的系统评估。
- MARL基准的兼容性问题:传统多代理强化学习(MARL)基准(如StarCraft Multi-Agent Challenge/SMACv2)依赖数值观察与离散动作,与基于自然语言的LLM代理不兼容。
- 通信与信任评估缺失:缺乏专门用于探测代理间协调、信任关系及对欺骗性通信鲁棒性的标准化评估环境。
为解决上述问题,论文提出SMAC-Talk——SMACv2的自然语言扩展框架,具体贡献包括:
- 自然语言接口转换:通过观察-文本适配器(observation-to-text)与动作-文本适配器(text-to-action),将原始数值环境转换为支持自然语言感知与行动的交互空间。
- 通信信道构建:引入代理间自然语言通信机制,在保持部分可观察性(partial observability)的前提下,支持信息共享与战术协调。
- 对抗性通信场景:设计包含欺骗性通信者(Deceptive Communicator)的评估场景,用于量化测试代理对不可靠信息的辨别能力与信任机制。
- 系统性评估框架:提供涵盖不同团队规模(5v5与10v10)、通信模式(无通信/自由通信/已知欺骗者/未知欺骗者)及推理结构(零样本/ReAct/内部思维链)的标准化测试协议。
简言之,该工作旨在建立首个面向LLM的实体化多代理协调基准,填补纯文本多代理模拟与传统MARL基准之间的评估鸿沟,特别关注去中心化控制、部分可观察性、长时程决策与欺骗性通信等关键挑战。
Q: 有哪些相关研究?
该论文在第5节”Related Work”中讨论了以下相关研究领域:
1. StarCraft II多智能体基准
- SMAC
14
与 SMACv2
3 :作为合作多智能体强化学习(MARL)的经典基准,提供去中心化控制、部分可观察性等挑战。 - TextStarCraft II
12 :探索将StarCraft II的观察和动作转换为自然语言以评估LLM,但侧重于宏观战略层面(macro-strategic level)。 - **H
Authors: Joel Sol, Homayoun Najjaran
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04202.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04202
Published: 2026-06-04T02:24:04.416Z
5. Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal
Abstract:Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.
中文摘要
摘要:多智能体系统通常通过投票、共识协议、辩论或容错聚合来减少分歧。我们认为,对于价值导向任务,这一目标是不够的,因为分歧可能反映出真正的规范性不确定性,而不是智能体的错误。在以往关于人机协作审查中推理轨迹分歧的研究基础上,我们提出了一种知识表示层,在该层中,推理轨迹和智能体决策被抽象为符号化的分歧状态。考虑到产生显性推理轨迹和二元决策的智能体,我们根据推理相似性和结论一致性区分四种状态:趋同一致、分歧一致、趋同分歧和分歧分歧。这些状态支持可辩解的战略路由规则。我们在内容审核中实例化该框架,并认为具备分歧感知的路由为子符号化的大语言模型(LLM)推理与多智能体战略推理的符号化知识表示之间提供了桥梁。
LLM Analysis
Q: 这篇论文试图解决什么问题?
该论文试图解决基于LLM的多智能体系统在价值导向任务中处理智能体间分歧的战略不足问题。
具体而言,论文针对以下核心问题展开:
1. 共识目标的局限性
现有 LLM 多智能体系统普遍将智能体间的分歧(disagreement)视为必须通过投票、辩论或聚合算法消除的噪声或缺陷。论文论证,这一设计假设在价值导向任务(如内容审核、医疗分诊、法律辅助)中是不充分的——此类任务中的分歧可能反映真正的规范性不确定性(normative uncertainty)或价值多元主义,而非推理错误。
2. 分歧结构的表征与利用
论文提出将分歧重新概念化为可表征的认知状态(representable epistemic state),而非聚合障碍。具体解决路径包括:
- 知识表示层的构建:将智能体的推理轨迹(reasoning traces)和决策抽象为符号化的分歧状态
- 四维分类体系的建立:基于推理相似性(reasoning similarity)与结论一致性(conclusion agreement)两个维度,定义四种符号状态:
- 收敛一致(CA, Convergent Agreement)
- 发散一致(DA, Divergent Agreement)
- 收敛分歧(CD, Convergent Disagreement)
- 发散发分歧(DD, Divergent Disagreement)
3. 战略性元决策的路由
针对上述四种状态,论文设计了可废止的战略路由规则(defeasible strategic routing rules),使系统能够进行元层级推理(meta-level reasoning)——即不仅决定”采取何种决策”,更关键的是决定是否应当做出决策、是否寻求额外上下文或是否升级至人工判断。特别是将 CD (收敛分歧)状态识别为价值冲突的强信号,默认触发升级机制而非强制共识。
简言之,该论文试图构建一个**从”消除分歧”到”理解并利用分歧结构”**的范式转换,为价值导向的多智能体决策提供连接子符号化LLM推理与符号化战略控制的显式接口。
Q: 有哪些相关研究?
根据论文引用,相关研究可分为以下几个核心领域:
1. 多智能体LLM系统与共识机制
针对通过辩论、投票或聚合减少智能体间分歧的技术:
- Liang et al. (2024):通过多智能体辩论鼓励发散思维的研究
- **Chen
Authors: Michał Wawer, Jarosław A. Chudziak
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04223.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04223
Published: 2026-06-04T02:24:04.416Z
6. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
Abstract:Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool’s output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.
中文摘要
摘要:多模态大语言模型在复杂推理方面的能力日益增强,但当它们必须通过工具将问题外化并基于工具的输出进行推理时,表现往往会下降,尤其是在依赖视觉辅助的情况下。这一差距尤为重要,因为实际的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了 VAMPS(视觉辅助数学问题求解),这是一个图形辅助数学的基准。VAMPS 包含 1,168 个多模态、双语的多项选择题问答对,这些题目抽自伊朗大学入学考试中的代数和微积分问题,并通过人工审核的 LLM 生成的合成变体进行了扩展,所有题目都经过挑选,使绘图能够通过显示交点、极值、渐近线等提供自然的解题策略。VAMPS 旨在用于基准测试和诊断,它超越了以往主要评估固定视觉输入推理的多模态基准,通过测试模型是否能够通过构建有用的图形并将答案基于生成的可视化中获益。总体而言,我们发现,在各种模型中,直接的解析解法在意想不到的情况下甚至优于依赖工具的视觉解法,即使在绘图是自然策略的问题上也是如此。
LLM Analysis
Q: 这篇论文试图解决什么问题?
这篇论文试图解决多模态大语言模型(MLLMs)在工具辅助视觉数学推理中的性能瓶颈问题,特别是关注推理-感知交接(reasoning-to-perception handoff)这一核心挑战。
具体而言,论文针对以下几个关键问题:
1. 工具启用视觉推理的效能悖论
现有研究表明,人类在解决数学问题时经常通过可视化(如绘制函数图像)来辅助分析,但当前MLLMs在使用外部绘图工具(如Desmos)时,性能往往不升反降。论文试图解释并量化这一现象:为什么模型在可以通过绘图获得直观视觉证据的情况下,表现反而不如纯文本分析求解。
2. 动态视觉构建与解读能力的评估缺口
现有的多模态数学基准测试(如MathVista、MathVerse等)主要评估模型对固定视觉输入的推理能力,而缺乏对自主生成视觉证据能力的系统评估。VAMPS基准测试填补了这一空白,要求模型:
- 自主决定绘制什么数学对象
- 与外部工具交互生成可视化
- 基于生成的视觉证据进行推理并给出答案
3. 推理-感知交接瓶颈的诊断
论文将”推理-感知交接”形式化为关键瓶颈,即模型必须完成从符号意图( algebraic formulation)到视觉证据(visual evidence)再到最终答案的完整链路。通过对比三种求解范式(直接分析求解、工具启用视觉求解、提供现成可视化求解),论文试图定位失败发生的具体环节:
- 工具调用失败(FM1)
- 生成无用图形(FM2)
- 正确图形但错误解读(FM3)
- 分析与绘图模态切换失败(FM4)
4. 双语及真实教育场景下的评估
论文还试图解决非英语语言(特别是波斯语)在视觉数学推理基准中的代表性不足问题,基于伊朗大学入学考试(Konkour)的真实题目构建了首个波斯语-英语双语视觉辅助数学推理基准。
通过VAMPS基准,论文的核心发现是:当前模型在利用自生成视觉证据进行数学推理方面存在系统性脆弱性,即使在绘图对人类而言是自然且优选策略的问题上,直接分析求解通常仍优于工具启用的视觉求解。
Q: 有哪些相关研究?
这篇论文的相关研究主要分布在工具增强数学推理、视觉/多模态数学基准以及视觉草图与图表推理三个领域。以下是详细分类:
1. Agentic与工具增强的数学推理
| 研究方向 | 代表工作 | 与VAMPS的关系 |
|---|---|---|
| 代码辅助推理 | • PAL (Gao et al., 2022):将文字问题翻译为可执行代码• Program of Thoughts (Chen et al., 2023 |
Authors: Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04244.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04244
Published: 2026-06-04T02:24:04.416Z
7. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Abstract:Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.
中文摘要
摘要:由于长视野推理、多步依赖以及Verilog和VHDL中的严格正确性约束,数字硬件设计的RTL代码自动生成仍然具有挑战性。我们介绍了StepPRM-RTL,这是一个结合了逐步轨迹建模、过程-奖励建模(PRM)和检索增强微调(RAFT)的新框架,旨在提升基于LLM的RTL代码生成的功能正确性和推理准确性。StepPRM-RTL 从典型解构建逐步推理轨迹,每一步包含一个理由和增量代码修改。过程奖励模型(PRM)评估中间步骤,提供密集反馈,指导RAFT微调过程中的强化式更新。蒙特卡洛树搜索(MCTS)探索了替代推理路径,丰富了训练数据集的高质量轨迹。这种逐步奖励与结果感知奖励的整合使模型能够学习如何以及为何构建正确的RTL,从而提升长期推理能力,超越标准的监督式或基于结果的训练。对基准Verilog和VHDL数据集的实验评估表明,StepPRM-RTL在功能正确性和推理忠实度指标上比最佳先行方法高出超过10%以上。消融研究证实,PRM引导奖励与逐步轨迹探索的结合是其性能的关键。StepPRM-RTL 跨越多种 RTL 语言,提供一个可扩展的高保真度、可解释代码生成框架,确立了 LLM 辅助硬件设计自动化的新标准。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04246.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04246
Published: 2026-06-04T02:24:04.416Z
8. Can Generalist Agents Automate Data Curation?
Abstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution-research gap: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes — without human design input — a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
中文摘要
摘要:策划训练数据是现代人工智能开发中最重要但又最费力的部分之一:从业者需要反复提出、实施、评估和修改数据策略,并根据嘈杂的基准反馈进行调整。我们探讨了通用代码代理是否能够自动化这一数据策划循环。我们引入了Curation-Bench,这是一个以代理为中心的基准,固定模型、训练配方和评估套件,同时为代理提供命令行访问权限,以便检查数据、实施策略、将其提交至固定的训练/评估流程并进行修订。在视觉-语言指令调优的实例中,现成的代理在十次迭代内就达到了强有力的已发布数据选择基线。然而,轨迹分析显示了持续存在的执行-研究差距:代理主要调整局部策略变体,而不是探索新的策略家族,即便提供了策略指南和论文参考。要求每次迭代引用、实例化并改编先前方法的支架机制,会使代理转向方法指导的探索。经过支架机制引导的代理能够自主组合——无需人工设计输入——出一种数据选择策略,在其数据量的十分之一下就超越了强有力的已发布基线。总体而言,当前代理可以运行策划循环,但可靠的数据研究需要方法支架的适应,而不仅仅是开放式提示。代码和基准已开源。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04261.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04261
Published: 2026-06-04T02:24:04.416Z
9. Characterizing initial human-AI proof formalization workflows
Abstract:For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems’ ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people’s ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people’s formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people’s preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.
中文摘要
摘要:几个世纪以来,人类数学家一直在撰写证明以支持他们的数学论证;然而,自动验证证明有效性的能力长期以来一直是一个挑战。人工智能系统在生成代码和进行日益高级的数学推理方面的能力进步,有望改变人们形式化证明并因此验证证明的能力。虽然许多研究集中于基准测试当前前沿,但我们转而研究人们如何使用这些工具。我们进行了混合方法分析,探讨人工智能对人们形式化工作流程的初步影响:人们声称他们想要什么,他们认为实现这些愿景的障碍是什么,以及他们在实际中如何使用和调整人工智能。一项定性调查显示,人们的偏好多种多样,但普遍希望人工智能在形式化过程中提供帮助,同时保持对证明发现过程的高级人类控制。为了评估人们在此类限制下实际使用人工智能进行形式化的方式,我们进行了受控用户研究,参与者在不同难度和领域的一系列数学问题中,有和没有使用人工智能的情况下,将非正式数学问题及其证明进行形式化。尽管当时用于自动形式化的工具存在局限,但参与者在允许使用人工智能工具时,通常比单独进行形式化时获得更高的形式化准确性,大多数参与者灵活地选择使用多种不同的人工智能工具。综合来看,我们的工作揭示了人工智能在形式化工作流程中早期整合的情况,其中涉及人类与人工智能互动的紧密结合。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04273.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04273
Published: 2026-06-04T02:24:04.416Z
10. The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
Abstract:As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff’s alpha = +0.047; best pairwise Cohen’s kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector’s accuracy.
中文摘要
摘要:随着自主人工智能智能体从对话系统转向长期软件执行,决定何时中断代理的运行时安全层变得至关重要。我们使用连续18维情感动力学引擎(HEART)作为诊断探针,评估四种干预触发族——绝对状态阈值、复合状态-动作模式、正则表达式推理-特征提取和零样本LLM作为评判——并对SWE-bench-Verified调试痕迹中的人工注释干预点进行评估。我们报告三项发现。首先,状态饱和陷阱:代理在持续困难下无恢复信号,因此建模挫败感迅速跨越阈值并保持在最大值,将矩探测器的阈值触发器转换为近乎恒定的指示器,在五条轨迹中触发39-83%的动作。其次,LLM评判的能力与上下文底线:小型模型(gpt-5.4-mini)从未发射,而前沿和跨厂商模型只有在全轨迹上下文下才能突破零发射底线,且仅能达到F1 0.17-0.40,成本高达90倍。第三,也是最重要的,监督目标在人类中不可重复:三名受过训练的标注者使用一个评分标准,在56个动作轨迹中,仅在略高于偶然性的情况下达成共识(地点Krippendorff’s alpha = +0.047;最佳两对Cohen’s kappa = +0.349),而干预类型(暂停退化;在偶然下澄清;仅反映alpha = +0.226)则完全不一致。我们得出结论,干预时序是一个低可靠性构造,使得单标注符F1不适合作为优化目标。我们的贡献在于将该问题结合了人类间评级者可靠性、四种探测器架构、跨模型LLM裁判扫描和复现饱和效应的联合映射,而非单一探测器的准确性。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Manvendra Modgil
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2606.04296.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04296
Published: 2026-06-04T02:24:04.416Z
VLM Domain Papers
1. Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation
Abstract:In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models’ inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: this https URL.
中文摘要
摘要:在诸如机器人操作和导航等具身视觉-语言决策任务中,视觉-语言模型和视觉-语言-动作模型(VLMs 和 VLAs)是具有不同优势的强大工具:VLMs 更擅长长期规划,而 VLAs 更擅长反应控制。然而,它们的性能受制于同样的感知瓶颈:由于模型无法区分与任务相关的物体和干扰物,视觉幻觉会产生。原则上,准确识别并专注于关键物体,同时过滤掉无关物体,是突破这一限制的关键。一个直接的解决方案是一步聚焦:直接关注必要的物体。然而,这种方法被证明无效,因为有效的聚焦本质上需要对场景的深度理解。为此,我们提出了 SceneDiver,一种利用 VLMs 长期规划能力的粗到细聚焦计划生成方法,其首先构建整体场景图以建立初步理解,然后通过识别、理解和分析的迭代循环逐步将任务分解为更简单的子问题。为了实现反应控制,我们还设计了一个轻量级适配器,将深思熟虑的聚焦能力蒸馏到 VLAs 中。在标准具身 AI 基准测试中的评估表明,我们的方法显著减少了 VLMs 和 VLAs 的视觉幻觉,同时在需要快速执行的任务中保持了计算效率。我们的代码和数据已发布于:此 https URL。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04046.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04046
Published: 2026-06-04T02:31:15.881Z
2. Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration
Abstract:Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class-level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift-resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class-level references to preserve long-term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance-specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry-aware decisions to directly filter unreliable signals and enforce a strict “one object, one class” constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, particularly in challenging multi-step incremental settings. The code is available at this https URL.
中文摘要
摘要:用于语义分割的弱增量学习(WILSS)存在持续引入噪声监督的问题,这会逐步破坏类别级表示,导致严重的特征漂移和语义破坏,从而使新学习的类别覆盖旧类别。为了解决这些问题,我们提出了一种抗漂移的WILSS方法,命名为SASA,旨在通过语义锚点和空间仲裁稳定语义学习。具体来说,在表示层面,我们引入可学习的语义锚点作为刚性类别级参考,以保持长期的语义身份。与此互补,一个弹性残差适配机制促进受控的、实例特定的精细化,确保学习路径既稳定又灵活。在监督层面,我们开发了一种空间标签仲裁机制,该机制执行几何感知决策,直接过滤不可靠信号并强制执行严格的“一个物体,一个类别”约束。通过协同稳定表示并提高监督可靠性,SASA有效缓解了弱监督下的特征漂移。在标准基准上的大量实验表明,我们的方法在各类设置中始终优于现有的最先进方法,尤其是在具有挑战性的多步增量场景中。代码可在此https URL获取。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Zhonggai Wang, Kai Fang, Guangyu Gao
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04060.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04060
Published: 2026-06-04T02:31:15.881Z
3. Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning
Abstract:Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a “Discrete Selection” paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at this https URL.
中文摘要
摘要:大规模的网络收集数据集推动了跨模态检索的进展,但不可避免地存在噪声对应关系,这严重削弱了模型的泛化能力。现有方法主要通过过滤噪声或寻找替代标签来解决这一问题,但它们大多仍受制于“离散选择”范式。我们认为,依赖单一离散代理会引发单点脆弱性和离散化误差。为克服这些限制,我们提出了一种新框架——模态内邻居感知噪声校正(IN2R),它将范式从寻找替代目标转变为合成可靠的监督目标。IN2R利用模态内数据的内在几何稳定性,采用图精炼器对从动态跨模型记忆中检索到的邻居进行关系推理。我们的方法不再传播离散标签,而是合成连续的、软性的原型,反映局部语义邻域的共识,有效校正模态间的不匹配。在Flickr30K、MS-COCO和CC152K上的大量实验表明,IN2R显著优于最先进的方法。我们的代码和预训练模型可公开获取,网址为此https链接。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Yang Liu, Wentao Feng, Shu-Dong Huang, Yalan Ye, Jiancheng Lv
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04061.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04061
Published: 2026-06-04T02:31:15.881Z
4. Optimal Transport Flow Matching by Design
Abstract:Flow matching models learn to transport samples from a simple prior distribution to a complex data distribution. When prior-data pairs are coupled via optimal transport (OT), the learned trajectories are straight and non-crossing, enabling fast, even single-step, generation. However, computing the OT coupling in high dimensions is intractable, and existing methods attempt to solve the OT problem, at the cost of persistent bias or significant overhead. Rather than solving for the OT coupling, we reformulate the problem. Once the prior is treated as a design choice rather than a fixed input, the OT coupling between prior and data is no longer unique. Many priors admit an OT-optimal identity coupling to the data, leaving us free to choose one that is also tractable to sample. We identify low-frequency projection of natural images as such a choice. The identity coupling between data and its low-frequency representation is empirically OT-optimal, the prior is structured enough to be sampled by a lightweight model at inference, and the remaining flow-matching task reduces to synthesizing high-frequency detail. Interpolating the prior with Gaussian noise further improves generation quality while preserving the OT coupling. The approach requires no modifications to the flow model itself, and integrates naturally with latent-space models, classifier-free guidance, and one-step generation frameworks. Across all benchmarks, our method reduces trajectory curvature by more than $2\times$ compared to existing flow matching methods, yielding better generation quality in the few-step regime.
中文摘要
摘要:流匹配模型学习将样本从简单的先验分布传输到复杂的数据分布。当先验-数据对通过最优传输(OT)耦合时,学习到的轨迹是直线且不交叉的,使得快速、甚至单步生成成为可能。然而,在高维度下计算OT耦合是不可行的,现有方法试图解决OT问题,但代价是持续的偏差或显著的额外开销。我们不是求解OT耦合,而是重新表述问题。一旦将先验视为一种设计选择而非固定输入,先验与数据之间的OT耦合就不再唯一。许多先验允许与数据存在OT最优的恒等耦合,从而使我们可以自由选择一种在采样上可行的先验。我们将自然图像的低频投影确定为这样的选择。数据与其低频表示之间的恒等耦合在经验上是OT最优的,先验结构足够简单,可由轻量级模型在推理时采样,其余的流匹配任务则简化为合成高频细节。将先验与高斯噪声插值进一步提高了生成质量,同时保持了OT耦合。该方法无需对流模型本身进行修改,并且可以自然地与潜在空间模型、无分类器引导以及单步生成框架集成。在所有基准测试中,我们的方法相比现有流匹配方法将轨迹曲率降低了超过2倍,在少步生成场景下实现了更好的生成质量。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Shimon Malnick, Matan Rusanovsky, Ohad Fried, Shai Avidan
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04092.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04092
Published: 2026-06-04T02:31:15.881Z
5. When Seeing Is Not Believing — A Benchmark for Search-Grounded Video Misinformation Detection
Abstract:Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.
中文摘要
摘要:视频虚假信息越来越多地在语义和证据层面运作:真实的视频可能被选择性编辑、时间顺序调整、跨来源拼接,或用AI生成的内容增强以构建虚假叙事。这种依赖证据的操控无法仅凭输入视频可靠验证,因为缺失、重新排序、替换或重新语境化的证据存在于视频本身之外。我们引入了 \textbf{EVID-Bench},这是一个基于搜索的视频错误信息检测基准,系统必须在开放网络中搜索相关视频,并通过跨视频比较识别哪些信息是虚假的。EVID-Bench包含222个视频,涵盖9种操作类型,涵盖三大类别:AI生成、单源剪辑和多源剪辑。所有样品均通过目视检测被前沿模型验证为无法检测。我们使用检索增强验证基线评估了九个前沿多模态模型。最好的系统仅能达到61.43%的点级准确率和43.24%的视频级精度,而AI生成的操作依然异常具有挑战性。错误分析揭示了反复出现的挑战:模型专注于无关锚点,错误地将合成内容归因于编辑剪接,并在未充分解释操作前提前终止搜索。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Tao Yu, Yujia Yang, Shenghua Chai, Zhang Jinshuai, Haopeng Jin, Hao Wang, Minghui Zhang, Zhongtian Luo, Yuchen Long, Xinlong Chen, Jiabing Yang, Zhaolu Kang, Yuxuan Zhou, Zhengyu Man, Xinming Wang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04098.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04098
Published: 2026-06-04T02:31:15.881Z
6. Reflection Separation from a Single Image via Joint Latent Diffusion
Abstract:Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: this https URL
中文摘要
摘要:在眩光或微弱反射等极端条件下,单张图像反射分离是非常具有挑战性的。现有方法在眩光或微弱反射场景下通常难以同时恢复两层图像,因为信息量不足。本文提出了一种专门针对该任务微调的扩散模型,利用生成性扩散先验实现稳健的分离。我们的方法通过统一的扩散模型同时生成透射层和反射层,并引入了一种新的跨层自注意机制以实现更好的特征解缠。我们进一步提出了一种独立采样策略,以在扩散过程中迭代减少层间干扰,并通过带有学习组合函数的潜在优化步骤提升在复杂真实场景中的结果。大量实验表明,我们的方法在多个真实世界基准上超越了现有最先进方法。项目页面:此 https URL
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Zheng-Hui Huang, Zhixiang Wang, Yu-Lun Liu, Yung-Yu Chuang
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04107.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04107
Published: 2026-06-04T02:31:15.881Z
7. Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking
Abstract:Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed. Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline. A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible. Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).
中文摘要
摘要:图像地理定位旨在从视觉内容估计照片拍摄的位置。在全球范围内,这仍然具有挑战性,因为视觉证据通常具有模糊性、多样性和分布不均性。以往的工作通常将普通互联网照片和街景图像的地理定位作为独立任务来处理,尽管它们各自具有互补的优势:互联网照片更好地匹配用户拍摄查询的外观分布,而街景图像提供更密集、地理上有据可依的覆盖。我们提出了 Pinpoint,一种结合两种来源的提取-重排架构,采用粗到细的处理流程。对比图像-GPS嵌入器在用户上传的 Flickr 照片和街景图像上进行训练,学习共享的图像-GPS嵌入空间,用于检索候选位置。然后,基于注意力的重排器通过将候选级视觉和 GPS 特征与来自附近位置的跨源证据结合,对检索到的候选进行重新评分,以确定最终预测。与近期的相关工作不同,Pinpoint 不依赖多模态大语言模型,使推理更快且更具可重复性。Pinpoint 在标准基准测试中对互联网照片(IM2GPS3k 和 YFCC4k)以及街景图像(OSV-5M)在所有指标上实现了最先进的结果。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Nika Chuzhoy, Brian Hu, Amit A. Arora, Jae Ro, Sarthak S. Sahu
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04133.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04133
Published: 2026-06-04T02:31:15.881Z
8. End-to-End Text Line Detection and Ordering
Abstract:Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line’s position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method’s source code and model weights are available under an open license at this https URL.
中文摘要
摘要:用于历史文献的实用文本识别流程通常将版面分析分解为行检测,随后进行单独的阅读顺序步骤,后者通常由手工编码的几何启发式处理,但在处理边注、多栏、表格和特定来源的编辑规范时表现不佳。本文介绍了 Orli(Ordered Regression of Lines,有序行回归),一种端到端模型,将两个子任务统一视为单个图像到序列的问题:从页面图像出发,Orli 自回归地直接按阅读顺序生成文本行基线。基线以弦框架参数化表示,锚定行的位置、方向和长度,同时通过垂直偏移编码局部几何;迭代优化头和局部视觉优化器生成最终曲线。在包含十种书写系统、共 196,691 页的异质语料上训练后,Orli 在未经数据集特定训练的情况下,略微超过了先前 cBAD 行检测的最新水平,在多个阅读顺序基准上实现了几乎完美的覆盖和排序,并能够通过有限微调适应更专业的跨域版面布局。该方法的源代码和模型权重在开放许可下通过此 https URL 提供。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Benjamin Kiessling
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04166.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04166
Published: 2026-06-04T02:31:15.881Z
9. GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs
Abstract:True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.
中文摘要
摘要:真正的通用智能不仅需要物理世界的模型,还需要社会世界的模型:能够推断个体心理状态如何相互作用并结晶为群体层面的结果。尽管在个体层面的心智理论(ToM)推理方面取得了显著进展,但现有的多模态大型语言模型在这一更广泛的任务上仍然失败。集体行为从社会紧张、从众动态和结构性约束中非线性地产生,这意味着仅仅汇总个体意图无法恢复集体行为。我们提出了GroupToM-Bench,这是第一个针对群体层面心智理论的多模态基准,其构建围绕一条因果链,从微观层面的BDI状态(信念、欲望、意图)、中观层面的群体紧张和结构性约束,到宏观层面的结果预测和机制性归因。为了探测这一完整链条,我们开发了一个七级认知审计框架。实验显示,当前模型与人类基准之间存在差距,突显了处理社会结构和非线性集体动态的失败。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Weidong Tang, Jierui Li, Yueling Hou, Zihan Mei, Can Zhang, Xinyan Wan, Zhiyuan Liang, Pengfei Zhou, Yang You, Wangbo Zhao
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04184.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04184
Published: 2026-06-04T02:31:15.881Z
10. Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG
Abstract:Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q < 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p < 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4’s effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC < 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems.
中文摘要
摘要:远程光电容描写(rPPG)在无压缩基准测试上实现低心率误差,同时在远程医疗、新生儿重症监护和驾驶员疲劳应用中通过压缩视频通道部署。此前没有研究确定空间分解在编解码器压缩下何时优于全局投影方法的物理量。我们提出了空间伪影相干性(SAC),定义为4x4补丁间Green通道协方差矩阵中离对角能量与对角能量的比值(带通0.75-2.5 Hz),以及PatchPCA算法家族(四个编码器感知rPPG算法)。我们通过Wilcoxon检验(BH-FDR,q < 0.05,904次测试)评估了280名受试者,涵盖11种编解码器降解变体(MPEG-4、H.265、H.264、JPEG、色度子采样)。SAC解释了PCA优势中变体间93.8%的差异(r = +0.969),编解码器家族间零重叠:非MPEG-4变体聚集在SAC 0.10-0.18,PCA胜率为84-90%;MPEG-4变体聚集在SAC 0.48-0.59,胜率61%,平均提升减少5.8倍。在受试者中,78%的人确认了预期模式(p < 10^-22,dz = 0.73)。变量内受试者层面SAC相关性为r = +0.099,证实SAC是分类编解码家族,而非预测个体结局。MPEG-4的效应是结构性的(宏块DCT几何体,而非噪声幅度),受源编解码器状态控制,而非分辨率。P-Hybrid被认为是部署最稳健的算法。PatchPCA优势设定了两个必要操作条件:SAC < 0.30和低至中等运动,直接排除了原始转MPEG-4转码管道。SAC为临床远程监测系统中基于编解码器的rPPG算法选择提供了物理基础的指标。
LLM Analysis
LLM Analysis Failed: Error: 抓取失败(已重试2次): Navigation timeout of 10000 ms exceeded
Authors: Achraf Ben Ahmed
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2606.04198.pdf
CoolPaper URL: https://papers.cool/arxiv/2606.04198
Published: 2026-06-04T02:31:15.881Z