ArXiv Domain 2026-05-28

数据来源：ArXiv Domain

LLM Domain Papers

1. ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

Abstract:Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

中文摘要

摘要：近年来，多模态大语言模型（MLLMs）和扩散模型（DMs）的最新进展为 AI 生成内容开辟了新的可能性。然而，尽管个性化封面图生成在提高数字平台用户参与度方面具有关键作用，但这一领域仍未得到充分探索。我们提出了 ICG，一种将基于 MLLM 的提示与个性化偏好对齐相结合的新框架，以生成高质量、语境相关的封面图。ICG 通过元令牌从项目标题和参考图像中提取语义特征，利用用户嵌入进行优化，并将生成的个性化上下文注入扩散模型。为应对缺乏标注监督的问题，我们采用多重奖励学习策略，将公共美学和相关性奖励与基于用户行为训练的个性化偏好模型相结合。与依赖手工构建提示和分离模块的先前流程不同，ICG 使用适配器连接 MLLMs 与扩散模型，实现端到端训练。实验表明，ICG 显著提升了图像质量、语义一致性和个性化水平，从而增强了用户吸引力及下游任务中的离线推荐准确性。作为连接 MLLMs 与扩散模型的即插即用适配器，ICG 兼容常用检查点，在优化过程中无需真实标签。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai, Zhaocheng Du, Jiacheng Sun, Zhou Zhao, Zhenhua Dong

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27374.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27374

Published: 2026-05-28T02:19:07.630Z

2. LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Abstract:Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model’s own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model’s actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

中文摘要

摘要：大型语言模型（LLMs）越来越多地作为自主代理运行，但它们与环境的持续互动可能导致上下文内奖励劫持（ICRH），即LLMs迭代优化其行为以最大化代理目标，却无意中产生有害的副作用的现象。现有的防御方法不足以应对这种风险，因为ICRH并非来源于对抗性输入，而是模型自身过度优化引起的。为缓解这一问题，我们提出了\textbf{基于LLM的约束优化（LCO）}，这是一个能够在不进行模型微调的情况下有效减少ICRH的框架。LCO由两个模块组成：\textit{自我思考模块}，引导LLM在执行前主动审慎思考并整合潜在的安全约束；以及\textit{进化采样模块}，采用基于LLM的交叉和变异，将模型的行动限制在安全的解空间内，同时保持任务性能。实验结果表明，LCO在输出优化和策略优化场景中均显著缓解了ICRH。特别是在推文参与度优化任务中，LCO在GPT-4上实现了有害增长率（TGR）下降39%，而在策略优化基准测试中，它将ICRH发生率减少了15.23%，显示出在不牺牲任务性能的前提下提升了安全性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27375.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27375

Published: 2026-05-28T02:19:07.630Z

3. Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

Abstract:While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

中文摘要

摘要：尽管基于提示的文本到语音（TTS）模型能够实现自然语言驱动的说话风格控制，但它们通常提供有限的细粒度控制，并在整个语句中应用单一的全局风格。这限制了需要跨语句连续风格属性插值以及在单个语句中进行时变风格过渡的实际应用场景。本文中，我们提出了在现有基于提示的TTS模型中实现这两种功能的新技术。对于跨语句的风格插值，我们在嵌入空间中计算对比风格提示之间的方向向量并进行简单插值，从而实现风格特征的平滑过渡。对于单语句的风格过渡，我们首先识别出自回归TTS解码器中对早期词元的强注意力偏向，这会导致初始音频的生成主导后续生成。为减轻这一影响，我们引入了KV缓存交换和滑动窗口注意力掩码技术。实验表明，我们提出的跨语句插值在性别转换方面实现了99-100%的成功率，音高变化高达36赫兹，语速变化高达每秒1.6个音节。我们的单语句风格过渡保持了0.81-0.91的说话者相似度，并实现了3.48-4.48的感知平滑度评分。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27376.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27376

Published: 2026-05-28T02:19:07.630Z

4. RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

Abstract:We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and grounds their coding decisions in external knowledge sources (e.g. the official coding tabular list and guidelines). By retrieving and cross-referencing relevant knowledge in these sources, the agents enhance coding accuracy and ensure clinical compliance. On the MDACE dataset, RAG-Coding outperforms the best LLM-based baseline by 8-13\% in micro-F1 and 2-8\% in macro-F1 across multiple LLM backbones. Compared to the state-of-the-art pretrained language model method, PLM-ICD, RAG-Coding exhibits higher micro recall (+11\%), while PLM-ICD exhibits higher micro precision (+6\%), yielding comparable micro- and macro-F1. Ablations show stepwise gains, highlighting the importance of incorporating external knowledge. We also release MDACE-2025, updating the original dataset with expert re-annotations with the latest 2025 ICD-10-CM guidelines. This update features more fine-grained code labels and enables evaluation against current clinical standards.

中文摘要

摘要：我们提出了 RAG-Coding，一种用于自动 ICD-10-CM 编码的智能化方法。RAG-Coding 协调四个大型语言模型（LLM）代理，并将其编码决策基于外部知识源（例如官方编码列表和指南）。通过在这些知识源中检索和交叉引用相关知识，代理能够提高编码准确性并确保临床合规性。在 MDACE 数据集上，RAG-Coding 在多个 LLM 主干模型上比现有最优 LLM 基线在微 F1 指数上高 8-13%，在宏 F1 指数上高 2-8%。与最先进的预训练语言模型方法 PLM-ICD 相比，RAG-Coding 表现出更高的微召回率（+11%），而 PLM-ICD 表现出更高的微精确率（+6%），从而在微 F1 和宏 F1 上具有可比性。消融实验显示了逐步的提升，强调了引入外部知识的重要性。我们还发布了 MDACE-2025，对原数据集进行了更新，由专家根据最新的 2025 年 ICD-10-CM 指南重新标注。本次更新提供了更细粒度的编码标签，并支持根据当前临床标准进行评估。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yidong Gan, David D. Nguyen, Yang Lin, Peter Zhong, Thanh Vu, Long Duong, Yuan-Fang Li

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27377.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27377

Published: 2026-05-28T02:19:07.630Z

5. OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

Abstract:Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models’ multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at this https URL.

中文摘要

摘要：牙科图像分析在支持口腔医疗中的准确诊断和治疗计划中起着关键作用。尽管最近的进展已经针对特定任务和单一成像模态开发了牙科人工智能模型，但它们的孤立设计限制了在现实临床工作流程中的实际应用。本文中，我们提出了OralAgent，这是首个专注于牙科的人工智能代理，能够在端到端自动化框架内统一多模态推理、基于工具的决策和基于知识的检索。它整合了22种视觉分析工具和368本广泛使用的经典牙科教材，实现自主推理、规划、工具使用、知识检索和多步工作流程执行。此外，我们引入了OralCorpus，这是一个大型、高质量的双语文本资源，包含专门为牙科检索增强生成（RAG）筛选的1.348亿个词元。为了评估模型的多学科牙科知识，我们构建了OralQA-ZH，这是一套中文多项选择题基准测试，包含涉及十一种口腔子专业的798道题目。大量实验表明，OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上均达到了最先进的性能，突显了其在现实临床环境中的有效性、可解释性和适应能力。代码和模型可在此https URL公开获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27378.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27378

Published: 2026-05-28T02:19:07.630Z

6. BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Abstract:Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

中文摘要

摘要：跨语言生物医学实体链接（BEL）将任何语言中的提及映射到生物医学知识库（KB）中的唯一标识符，从而支持临床和生物医学自然语言处理应用。然而，BEL 的专家标注训练数据成本高，尤其对于低资源语言更是如此。此外，许多跨语言 BEL 系统依赖于基于 SapBERT 的检索器，这些检索器主要在 KB 中以英语别名进行训练，导致对未见过的非英语提及的泛化能力差，并且上下文感知消歧能力有限。我们提出了 BioELX，一种两阶段的跨语言 BEL 框架，无需任务特定的标注训练语料。在第一阶段，我们利用从 Wikidata 提取的多语言别名丰富 SapBERT 的训练，并使用生成的检索器提升跨语言候选项检索。在第二阶段，我们使用预训练的大语言模型排序器进行上下文感知消歧，该排序器联合考虑提及上下文和候选项，消除了对监督训练的需求。在五个基准数据集（XL-BEL、EMEA、Patent、WikiMed-DE 和 MedMentions）上的实验表明，BioELX 达到新的最先进性能。在 XL-BEL 上平均 Recall@1 提升 +19.2，对于低资源语言尤其显著，例如土耳其语 +21.6、韩语 +22.1、泰语 +30.8，并在 EMEA (+6.2)、Patent (+5.4) 和 WikiMed-DE (+12.8) 上也带来一致提升。代码和资源将在发表时公开。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27380.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27380

Published: 2026-05-28T02:19:07.630Z

7. Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Abstract:Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

中文摘要

摘要：口语语言模型（SLMs）已成为语音合成的一个有前景的范式，它通过绕过显式的字形到音素的转换流程来实现。然而，它们在低资源语言中的有效性在根本上受限于转录语音的稀缺性。在实践中，当真实数据不足时，合成数据已成为扩展SLMs的主要策略，因为它提供了可靠的语音监督。在本研究中，我们表明这种依赖性引入了一个根本性的权衡，我们称之为“稳定性-表现力差距”：虽然合成数据提高了语音准确性，但它会逐步抑制韵律变异性，最终导致表现力崩溃（合成侵蚀）。为了弥合这一差距，我们提出了两种自我对齐框架。解缠导向自我对齐（DGSA）通过利用韵律-音色分离恢复复杂语言的表现力。对于真实参考极度有限的场景，温度驱动自我批评（TDSC）通过自动探索和筛选来稳定生成。我们的方法优于强大的商业系统，包括ElevenLabs和Gemini Pro，并实现了老挝语的首次零样本语音克隆能力。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27383.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27383

Published: 2026-05-28T02:19:07.630Z

8. From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

Abstract:Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at this https URL.

中文摘要

摘要：扩散模型承诺高效的并行文本生成，但依赖双向注意力，这与预训练的自回归（AR）模型存在结构不匹配。这种不兼容性阻止了重用强大的AR先验，需要从零开始进行高成本的预训练。为弥合这一差距，我们提出了FLUID，一个高效将AR骨干适配到扩散范式的框架。通过执行严格因果对齐（Strictly Causal Alignment），FLUID能够从标准的GPT风格检查点无缝初始化，避免了大规模预训练的需求。此外，我们提出了弹性视界（Elastic Horizons），一种基于熵的机制，根据局部信息密度动态调节去噪步幅，而不是固定时间表。实验表明，FLUID在实现最先进性能的同时，将训练成本降低了数个数量级，有效地将既有AR基础与高效并行生成结合起来。我们的代码可在此https URL获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27387.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27387

Published: 2026-05-28T02:19:07.630Z

9. Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Abstract:Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the “thick descriptions” (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest—validated through human-AI collaboration—our diagnosis reveals a persistent “realism gap”: steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

中文摘要

摘要：大型语言模型（LLMs）正越来越多地被用作计算社会分析的代理工具；然而，它们忠实呈现人类社区的“厚描述”（Geertz, 1973）的能力仍然是一个关键挑战。目前的评估方法常常将社会身份简化为静态标签，忽视了现实世界群体如何应对社会变迁。为了弥合这一差距，我们提出了CARE（社区感知反应评估），这是一个以反应为中心的框架，将LLM模拟的讨论与不同社区对现实新闻的真实、事件相关反应进行基准对比。通过表征细粒度的言外之意语气谱及其背后的态度——通过人机协作验证——我们的诊断揭示了一个持续存在的“现实主义差距”：即使使用明确的社区提示引导LLM，也无法固有地提升模拟的真实度。进一步分析发现前沿模型之间存在不同的行为特征，这表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Nuan Wen, Xuezhe Ma

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27388.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27388

Published: 2026-05-28T02:19:07.630Z

10. EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

Abstract:Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

中文摘要

摘要：推测性解码通过先草稿后验证的范式加速大型语言模型的推理，但随着词汇量的增加，输出投影层成为瓶颈。虽然现有的静态剪枝方法能够有效降低这一开销，但在专业领域或主题切换场景中，由于无法捕捉动态分布变化，这些方法的接受率会急剧下降。为了解决这一问题，我们提出了EvoSpec框架，该框架通过动态词汇和参数自适应，实现草稿模型的实时演化。与静态或纯检索方法不同，EvoSpec采用上下文感知机制，通过高效的语义和统计索引检索关键的长尾词。此外，我们提出了一种轻量级的在线对齐策略，利用课程学习持续最小化草稿模型与目标模型之间的分布差距。在编码、法律和医学等专业领域的广泛评估表明，EvoSpec克服了静态基线的限制。在EAGLE-3上，在这些设置中，它比最先进的静态基线FR-Spec实现了1.13倍的加速，同时内存开销比标准在线自适应低27%.

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2605.27390.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27390

Published: 2026-05-28T02:19:07.630Z

Agent Domain Papers

1. Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Abstract:As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

中文摘要

摘要：随着智能系统变得更加自主，科学界专注于创建包含伦理和道德考量的决策机制，而不是传统的效用最大化模型。为了实现这一目标，一个关键方面是评估这些决策与人类价值观的契合度。为此，一条有前景的研究路线是开发基于大语言模型（LLM）的方法，从文本中识别人类价值观，无论是显性的还是隐性的，从而实现其全过程识别。本文提出了一种基于LLM的架构，用于检测和量化文本中人类价值观的强度，避免了以前方法依赖特定价值理论或复杂提示工程的局限性。该架构由三个协调模块组成：一个从任何理论框架的基础文本生成结构化价值规范；一个使用这些规范标注文本；以及一个根据修辞和语义证据分配分级的支持或抵触。该模块化方法将构思人类价值观的任务与检测人类价值观的任务分离，创建了一个可扩展且可复现的流程，由可适应各种理论的价值规范驱动。此架构已通过多个LLM实例化，并使用ValueEval数据集进行了评估。实验表明其检测性能良好，验证了该流程的通用性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27373.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27373

Published: 2026-05-28T02:25:24.054Z

2. Soro: A Lightweight Foundation Model and Chatbot for Tajik

Abstract:We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

中文摘要

摘要：我们介绍了Soro，这是一系列专为塔吉克语设计的对话大语言模型（LLMs），旨在在塔吉克斯坦有限的计算和网络条件下进行实际部署。从开源权重Gemma 3检查点出发，我们在精心策划的19亿标记语料库上进行仅塔吉克语的持续预训练，该语料库包括筛选后的网页文本、PDF文档以及与课程对齐的教育材料，随后在4万条塔吉克语教师风格示例上进行监督指令微调。为了在标准基准测试中塔吉克语覆盖有限的情况下进行严格评估，我们引入了一套塔吉克语基准，涵盖通识知识、语言能力以及学校和大学入学考试领域，并在Hugging Face上开源。经过这些塔吉克语基准测试，Soro在同规模Gemma 3基线模型之上有显著提升，同时在标准数据集上保持强劲的英语性能。我们进一步展示，Soro的FP8和INT4量化在减少边缘部署内存需求的同时保护了大部分塔吉克语性能提升，从而支持正在进行的教育部门试点项目，并计划在塔吉克斯坦各学校推广。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shaymardonov, Shuhratjon Khalitbekov, Bonu Boboeva

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27379.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27379

Published: 2026-05-28T02:25:24.054Z

3. On the Origin of Synthetic Information by Means of Steganographic Inheritance

Abstract:The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring’s life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

中文摘要

摘要：物种起源一直是自然科学中的终极之谜。类比之下，合成信息的起源，我们认为，是信息科学中的终极之谜。这个问题带有道德上的重要性，其技术上的解释既无法完全解决，也不应被轻率忽视，因为它对真理、信任和人类智慧的影响深入到更广泛的经济和社会中。人工智能本身的强大能力，使得追踪合成信息的进化谱系变得愈加困难，因为一个能力足够强的模型可能生成的后代，在结构或信号层面上都与其来源的父本几乎没有相似之处。正如在遗传学中，两个人可能拥有相同的表型，在外观上互相映射，但在基因型上却有根本差异。我们提出了一种通过隐写术实现的类似遗传的机制。在后代生成的瞬间，投影器从父本提取一个特征，并通过隐写编码器将其隐蔽地嵌入后代之中。这个特征将在后代的整个生命周期中，在网络生态系统中持续存在。当查询亲本时，隐写解码器从后代中提取该特征，并将其与参考池中候选父本的特征进行比对，从而指定最可能的亲本。理论分析将系统发育的准确性表征为投影器和隐写系统属性的函数，而在多个投影器和隐写系统上的实证评估则证明了所提出方法在广泛的处理操作和语义修改下的可行性。我们设想了一个网络生态系统，其中合成信息被赋予隐蔽但可追踪的谱系特征，从简单的起点出发，分化出无尽的形式，这些形式已经被发展，并正在持续演化。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Ching-Chun Chang, Isao Echizen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27551.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27551

Published: 2026-05-28T02:25:24.054Z

4. DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Abstract:Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox’’: providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

中文摘要

摘要：当前神经组合优化在动态灵活作业车间调度问题（DFJSP）上的进展受到方法论上的矛盾制约：静态基准促使算法对基准过拟合，而未校准的生成器则以随机噪声掩盖算法能力。为了解决这一问题，我们提出了\textbf{DynaSchedBench}，一个严格控制实例生成过程的DFJSP诊断框架。我们的方式不依赖参数采样，而是利用顺序事件空间校准器（SESC）计算新颖的调度压力指数（SSI），以难度对实例进行分层。我们展示了SESC在计算效率上显著优于进化基线，同时能够可靠地收敛到目标指标。该框架整合了模块化组件，用于实例生成、基于快照的仿真、智能体、评估和可视化，从而实现对反应式和前瞻式策略的严格测试。借助这一校准环境，我们识别了基于大语言模型（LLM）的调度智能体的关键局限性。具体而言，在动态调度的逐步在线决策中，我们发现了一个“可观测性悖论”：向智能体提供完整结构信息的预言器访问可能会降低策略性能，表现不如简明信息。此外，尽管工具增强和精炼策略带来了大量的令牌开销，但它们未能可靠地提升性能，大多数LLM智能体也无法持续超过强调度基线——其表现更像是稳健的启发式近似器，而非卓越的优化器。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Shijie Cao, Yuan Yuan, Jing Liu

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27566.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27566

Published: 2026-05-28T02:25:24.054Z

5. Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Abstract:Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model’s internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

中文摘要

摘要：因果发现是科学推理的基石，但大语言模型是否能可靠地执行它仍然是一个悬而未决的问题。最近的基准测试显示，即使是经过微调的模型在简单因果图上也会出现性能瓶颈，并且随着复杂性增加而退化，但它们失败的原因尚未确定。我们证明这种失败是根本性的：监督微调、直接偏好优化和上下文学习都会生成无法区分生成相似观测数据的因果图的预测器，而任何试图做到这一点的方法都要求模型的内部表示无限增长，从而违反了这些方法能够工作的基本条件。我们将其形式化为核阻碍定理，确立了这一限制是学习范式固有的，而非由任何特定模型或数据集引起的。我们提出了主动因果贝叶斯优化（Agentic Causal Bayesian Optimization, A-CBO），其中冻结的大语言模型作为干预预言机回答关于干预效应的特定查询，而外部贝叶斯循环在对候选图的信念上进行对数轮次的集中。由于决策在阻碍适用的空间之外操作，A-CBO可证明会收敛，同时基础模型保持不变。在Corr2Cause上，A-CBO在无需任何训练的情况下匹配微调基线。在Extended Corr2Cause，一个扩展到24个变量、具有18K测试样本的新基准上，A-CBO显著优于微调和偏好优化，且这种优势正在增长

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Amartya Roy, Sonali Parbhoo

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27567.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27567

Published: 2026-05-28T02:25:24.054Z

6. RULER: Representation-Level Verification of Machine Unlearning

Abstract:Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model’s internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

中文摘要

摘要：机器取消学习旨在从已部署的模型中移除特定训练记录的影响，而无需从头重新训练。目前的协议通过成员推断、保留精度和遗忘集合精度在输出层验证这一点，但模型可以在满足这三个条件的同时，仍在其中间表示中编码被遗忘的记录。我们引入了RULER，一组表示层级的验证指标。神谕比较指标M2衡量遗忘集合记录是否占据与未包含这些记录重新训练的模型中相同的表示位置。无需神谕的指标M4仅通过未学习模型的内部相似性结构检测残留，而无需重新训练。四种近似取消学习方法都通过了输出层评估，但在线性混合效应模型下，M2在12种条件中有10种检测到显著残留（p<0.05），随着遗忘比例增加，效应量也增大。第五种方法“坏老师”（Bad Teacher）即使采用不同的遗忘机制，也显示出相同的残留。M4在表格数据、图像、临床文本和面部识别场景中作为预取消学习诊断：它能够检测面部识别模型中的身份级记忆，而在测试的任何方法中都无法完全消除这一信号。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Georgina Cosma, Axel Finke

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27569.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27569

Published: 2026-05-28T02:25:24.054Z

7. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Abstract:Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

中文摘要

摘要：并行大语言模型（LLM）测试时的扩展技术（例如，best-of-$N$）需要针对相同的输入提示生成 $N>1$ 个序列。这些方法在利用批量生成 $N$ 个序列的计算效率的同时能够提高准确性。然而，批次中的每个序列传统上都是独立生成的，因此无法重用其他序列中的中间生成结果、计算或观察。在本文中，我们提出了 LaneRoPE，以在生成时实现 $N>1$ 个序列之间的协作与协调。LaneRoPE 包含两个核心思想：（a）序列间注意力掩码，使序列采样相互依赖；（b）RoPE 扩展，在注入令牌的位置编码时捕捉序列内及序列间的相对位置。我们在数学推理任务上评估了该方法，并获得了有希望的结果：LaneRoPE 使序列之间能够协作，在生成序列长度有限的情况下带来了额外的准确性提升。重要的是，由于 LaneRoPE 只需对底层 LLM 架构进行最小改动即可实现协调，并且在推理时引入的开销可忽略不计，因此它对于快速将并行推理整合到现有 LLM 推理流程中非常有吸引力。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Gabriele Cesa, Thomas Hehn, Aleix Torres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, Tribhuvanesh Orekondy

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27570.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27570

Published: 2026-05-28T02:25:24.054Z

8. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

Abstract:Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

中文摘要

摘要：现代分析系统本质上是被动的，需要用户在日益复杂且持续演变的数据上定义查询。在实时流环境中，这种模式会失效，因为潜在见解的空间过于庞大，无法手动枚举。我们提出了一种面向实时数据流的自主见解发现的多代理架构。该系统实现了一个持续的发现循环，代理生成假设，将其编译为可执行分析，验证生成的产物，并生成可视化和可部署的应用程序。该架构利用 Apache Kafka 进行事件驱动的协调，使用 Apache Flink 进行流处理，并利用大型语言模型实现专用代理。一个关键贡献是基于类型化中间产物的契约驱动设计，实现了模块化、可观测性、数据流程追踪以及动态生成分析的更安全执行。通过零售、金融和公共数据的使用案例，我们展示了该架构如何支持从查询驱动分析向主动、发现驱动系统的转变。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Gaetano Rossiello, Dharmashankar Subramanian

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27571.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27571

Published: 2026-05-28T02:25:24.054Z

9. Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Abstract:As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

中文摘要

摘要：随着各组织朝向 AI 代理的生产部署发展，这些代理执行非确定性工作流、维护状态会话，并且经常具有对内部服务的特权访问权，工程挑战从构建单个代理转向以适当的隔离、治理和安全方式大规模运营它们。在本文中，我们介绍了 Agyn，一个围绕为代理工作负载量身定制的三大关键原则设计的开源平台：在 Kubernetes 上的信号驱动、有状态的无服务器运行时；用于代理和测试定义的 Terraform 提供程序；以及基于零信任和最小权限原则的安全模型。Agyn 与代理无关、与模型无关，并且与云平台无关。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Nikita Benkovich, Vitalii Valkov

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27575.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27575

Published: 2026-05-28T02:25:24.054Z

10. You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

Abstract:A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual’s biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency.

中文摘要

摘要：对于行为科学和面向人类的人工智能而言，一个核心难题是个体内变异性的持续存在。相同的个体，在遭遇相同的可观察输入时，在不同场合会产生不同的结果，而不同个体产生的结果差异也无法被任何可观察的协变量完全预测。我们认为，这种变异性属于个体的动态潜在状态中，人类的结果可以通过针对状态及其在决策形成时刻的权重进行干预，从而以精确且可操作的方式进行控制。我们将状态定义为随时间索引的权重向量，涵盖支配个体的生物学、生理学和神经心理学如何将下一事件处理为决策和结果的各个维度。状态、决策和结果之间的关系是因果性的，而非相关性的。权重向量在每天的子时间尺度上是动态变化的。通过意识通道可报告结果，该通道是一个狭窄的注意力瓶颈，其内容本身依赖于状态。综合这些观点，可以推断，特定事件的结果在干预时的状态轨迹条件下是可控的。我们通过六条已确立的证据线索（因果推断、预测处理、稳态调节、注意力瓶颈、时间生物学、计算精神病学）以及一项为期24个月、涵盖超过200,000名同意参与者、跨四个职业角色的行为平台观测数据（研究时间段为2023至2026年）来说明该框架。我们提出七个可检验的预测，列出了六个针对状态感知系统的操作要求，并讨论了其对数字健康、教育、人工智能个性化以及个人能动性的影响。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Suraj Biswas, Saurav Gupta, Pritam Mukherjee

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27580.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27580

Published: 2026-05-28T02:25:24.054Z

Evaluation Domain Papers

1. Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

中文摘要

摘要：随着智能系统变得更加自主，科学界专注于创建包含伦理和道德考量的决策机制，而不是传统的效用最大化模型。为了实现这一目标，一个关键方面是评估这些决策与人类价值观的契合度。为此，一条有前景的研究路线是开发基于大语言模型（LLM）的方法，从文本中识别人类价值观，无论是显性的还是隐性的，从而实现其全过程的识别。本文提出了一种基于LLM的架构，用于检测和量化文本中人类价值观的强度，避免了以往方法依赖特定价值理论或复杂提示工程的局限性。该架构包括三个协调模块：一个从任何理论框架的基础文本生成结构化价值规范；一个使用这些规范标注文本；以及一个根据修辞和语义证据分配支持或反对的等级。该模块化方法将构思人类价值观的任务与检测人类价值观的任务分离，创建了一个可扩展且可复现的流程，由可适应各种理论的价值规范驱动。此架构已通过多个LLM实例化，并使用ValueEval数据集进行了评估。实验表明其检测性能良好，验证了该流程的通用性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27373.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27373

Published: 2026-05-28T02:31:43.081Z

2. Soro: A Lightweight Foundation Model and Chatbot for Tajik

中文摘要

摘要：我们介绍了Soro，这是一系列专为塔吉克语设计的对话大语言模型（LLMs），旨在在塔吉克斯坦有限的计算和网络条件下进行实际部署。从开源权重Gemma 3检查点出发，我们在精心策划的19亿标记语料库上进行仅塔吉克语的持续预训练，该语料库包括筛选后的网页文本、PDF文档以及与课程对齐的教育材料，随后在4万条塔吉克语教师风格示例上进行监督指令微调。为了在标准基准测试中塔吉克语覆盖有限的情况下进行严格评估，我们引入了一套塔吉克语基准，涵盖通识知识、语言能力以及学校和大学入学考试领域，并在Hugging Face上开源。经过这些塔吉克语基准测试，Soro在同规模Gemma 3基线模型之上有显著提升，同时在标准数据集上保持强劲的英语性能。我们进一步展示了将Soro进行FP8和INT4量化可以保留大部分塔吉克语的性能提升，同时减少边缘部署的内存需求，从而支持正在进行的教育部门试点以及计划在塔吉克斯坦学校的规模化推广。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shaymardonov, Shuhratjon Khalitbekov, Bonu Boboeva

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27379.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27379

Published: 2026-05-28T02:31:43.081Z

3. On the Origin of Synthetic Information by Means of Steganographic Inheritance

中文摘要

摘要：物种起源一直是自然科学中的终极之谜。类比地，我们提出，合成信息的起源是信息科学中的终极之谜。这个问题具有道德重量，技术性的解释既无法完全解决，也无法负责任地忽视，因为它对真理、信任和人类智慧的影响深入到更广泛的经济和社会中。人工智能自身的强大能力，使得合成信息的进化谱系变得愈发难以追踪，因为一个足够强大的模型可能产生的后代，在结构或信号层面上与其来源的父代几乎没有相似之处。正如在遗传学中，两个个体可能共享相同的表型，在外观上彼此镜像，但在基因型上却有根本差异。我们提出了一种通过隐写术实现的、类比于遗传的机制。在后代生成的瞬间，一个投影器从父代导出一个特征，并通过隐写编码器将其无形地隐藏在后代中。该特征在后代的整个生命周期中于网络生态系统中持续存在。当查询亲代时，隐写解码器从后代中提取该特征，并将其与参考池中候选父代的特征进行比较，从而推选出最可能的父代。理论分析将系统发育准确性表征为投影器和隐写系统属性的函数，同时，通过在多个投影器和隐写系统上的实证评估，展示了所提出方法在广泛的处理操作和语义修改下的可行性。我们设想一个网络生态系统，其中合成信息被赋予隐藏但可追溯的谱系特征，从简单的起点出发，分化成无尽的形式，已经进化并正在不断进化。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Ching-Chun Chang, Isao Echizen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27551.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27551

Published: 2026-05-28T02:31:43.081Z

4. DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

中文摘要

摘要：在动态柔性作业车间调度问题（DFJSP）的神经组合优化方面的进展，目前受制于方法论上的矛盾：静态基准鼓励对基准的过拟合，而未校准的生成器则因随机噪声而模糊算法能力。为解决这一问题，我们提出了\textbf{DynaSchedBench}，一个针对DFJSP的诊断框架，可严格控制实例生成过程。我们的方法不依赖参数采样，而是利用顺序事件空间校准器（SESC），通过计算一种新颖的调度应力指数（SSI）来按难度对实例进行分层。我们证明，SESC在计算效率上显著优于进化基线方法，同时能够可靠地收敛到目标指标。该框架整合了用于实例生成、基于快照的仿真、代理、评估和可视化的模块化组件，从而能够严格测试反应型和前瞻型策略。在这一校准环境中，我们识别出基于大语言模型（LLM）的调度代理的关键限制。具体来说，在动态调度的逐步在线决策中，我们发现了一个“可观测性悖论”：向代理提供完整结构信息的神谕访问可能会降低策略性能，表现不如简明信息。此外，尽管工具增强和精炼策略带来了大量的令牌开销，但它们未能可靠地提升性能，大多数LLM代理无法持续超越强力调度基线——其表现更像是稳健的启发式近似器，而非卓越的优化器。

LLM Analysis

Q: 这篇论文试图解决什么问题？

该论文针对神经组合优化（Neural Combinatorial Optimization, NCO）在动态灵活作业车间调度问题（Dynamic Flexible Job Shop Scheduling Problem, DFJSP）中的评估危机，系统性地解决了以下核心方法论张力与技术瓶颈：

1. 静态基准测试的过拟合陷阱

现有广泛使用的静态基准集（如Taillard和DMU实例集）具有固定、有限且确定性的特点。训练于此类固定集合会导致智能体对特定实例结构的记忆而非学习可泛化策略，形成”基准过拟合”（benchmark overfitting）。这种静态评估范式与DFJSP需要处理连续随机事件流的本质特征严重错位。

2. 未校准生成器引入的随机噪声

当前动态调度实例生成依赖未校准的程序化采样（uncalibrated procedural sampling），实例难度作为随机种子的副产品出现，缺乏可控性。这产生高方差的”随机迷雾”（stochastic fog），使得算法性能提升难以区分是源于真实创新还是仅因遇到了有利的动态事件序列。

3. 缺乏难度分层的诊断框架

领域缺乏一种能够系统性地控制实例难度并从”欠载”到”临界”状态进行分层（stratification）的机制，导致无法可靠地映射求解器性能的相变（phase transition）行为，也无法对新兴LLM-based智能体与传统求解器进行严格对比。

4. LLM-based调度智能体的能力边界与悖论

通过提出的校准环境，论文进一步揭示了LLM-based调度智能体在逐步在线决策中的关键限制：

可观测性悖论（Observability Paradox）：提供完整的结构先验信息（oracle access）反而会降低策略性能，表现不如简洁的统计摘要；
推理策略的收益递减：工具增强（tool-augmented）和反思（reflection）等昂贵的推理策略未能可靠提升性能；
启发式近似而非优化：当前LLM大多只能逼近强启发式基线，而非实现超越性优化。

解决方案概述

为应对上述挑战，论文提出了DynaSchedBench，一个基于事件空间校准的诊断框架，通过**顺序事件空间校准器（SESC）和调度压力指数（SSI）**实现对实例生成过程的严格可控，从而支持对反应式（reactive）和前瞻式（lookahead-based）策略的严格测试。

Q: 有哪些相关研究？

Authors: Shijie Cao, Yuan Yuan, Jing Liu

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27566.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27566

Published: 2026-05-28T02:31:43.081Z

5. Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

中文摘要

摘要：因果发现是科学推理的基石，但大型语言模型是否能可靠地执行它仍然是一个悬而未决的问题。最近的基准测试显示，即使是经过微调的模型在简单因果图上也会出现性能瓶颈，并且随着复杂性增加性能会下降，但其失败原因尚未确定。我们证明这种失败是根本性的：监督微调、直接偏好优化和上下文学习都会产生无法区分生成类似观测数据的因果图的预测器，而任何试图解决这一问题的尝试都要求模型的内部表示无限增长，从而违反了这些方法正常工作的条件。我们将其形式化为核阻碍定理，确立了这一限制是学习范式自身固有的，而不是由任何特定模型或数据集造成的。我们提出了自主因果贝叶斯优化（Agentic Causal Bayesian Optimization，A-CBO），其中冻结的语言模型充当干预型预言机，回答关于干预效果的特定查询，而外部贝叶斯循环在对候选图的信念上进行对数轮次的集中。由于决策在阻碍适用的空间之外操作，A-CBO可证明会收敛，同时基础模型保持不变。在Corr2Cause上，A-CBO在无需任何训练的情况下可匹配微调基线的表现。在Extended Corr2Cause上，这是一个新的基准，扩展到24个变量，拥有1.8万个测试样本，A-CBO显著优于微调和偏好优化方法，并且优势不断扩大。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Amartya Roy, Sonali Parbhoo

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27567.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27567

Published: 2026-05-28T02:31:43.081Z

6. RULER: Representation-Level Verification of Machine Unlearning

中文摘要

摘要：机器取消学习旨在从已部署的模型中移除特定训练记录的影响，而无需从头重新训练。目前的协议通过成员推断、保留精度和遗忘集合精度在输出层验证这一点，但模型可以在满足这三个条件的同时，仍在其中间表示中编码被遗忘的记录。我们引入了RULER，一组表示层级的验证指标。神谕比较指标M2衡量遗忘集合记录是否占据与未包含这些记录重新训练的模型中相同的表示位置。无神谕指标M4仅通过遗忘模型内部的相似性结构检测残留，无需重新训练。四种近似取消学习方法在输出层评估中均通过，但在线性混合效应模型下，M2在12种条件中有10种检测到显著残留（p<0.05），随着遗忘比例增加，效应量增大。第五种方法Bad Teacher即使使用不同的遗忘机制，也显示出相同的残留。M4作为预取消学习的诊断工具，可应用于表格、图像、临床文本和面部身份识别场景：它能够检测面部识别模型中的身份级记忆，而没有任何测试方法能够完全擦除这一信号。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Georgina Cosma, Axel Finke

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27569.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27569

Published: 2026-05-28T02:31:43.081Z

7. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

中文摘要

摘要：并行大语言模型（LLM）测试时扩展技术（例如，best-of-$N$）需要在相同的输入提示下生成 $N>1$ 个序列。这些方法在利用批量生成 $N$ 个序列的计算效率的同时能够提高准确性。然而，批次中的每个序列传统上都是独立生成的，因此无法重用其他序列的中间生成结果、计算或观察。在本文中，我们提出了 LaneRoPE，以在生成时实现 $N>1$ 个序列之间的协作与合作。LaneRoPE 包含两个关键思想：（a）序列间注意力掩码，使序列采样相互依赖；（b）RoPE 扩展，在注入令牌的位置编码时捕捉序列内及序列间的相对位置。我们在数学推理任务上评估了该方法，并获得了有希望的结果：LaneRoPE 使序列之间能够协作，在生成序列长度有限的情况下带来了额外的准确性提升。重要的是，由于 LaneRoPE 只需对底层 LLM 架构进行最小改动即可实现序列间协调，并在推理时引入可忽略的开销，因此它对于快速将并行推理整合到现有 LLM 推理流程中非常有吸引力。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Gabriele Cesa, Thomas Hehn, Aleix Torres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, Tribhuvanesh Orekondy

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27570.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27570

Published: 2026-05-28T02:31:43.081Z

8. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

中文摘要

摘要：现代分析系统本质上是被动的，需要用户在日益复杂且持续演变的数据上定义查询。在实时流环境中，这种模式会失效，因为潜在见解的空间过于庞大，无法手动枚举。我们提出了一种面向实时数据流自主洞察发现的多代理架构。该系统实现了一个连续发现循环，其中代理生成假设，将其编译为可执行分析，验证生成的工件，并生成可视化和可部署应用。该架构利用 Apache Kafka 进行事件驱动的协调，利用 Apache Flink 进行流处理，并使用大语言模型实现专门的代理。一个关键贡献是基于类型化中间工件的合同驱动设计，从而实现模块化、可观察性、可追踪性以及动态生成分析的更安全执行。通过零售、金融和公共数据的使用案例，我们展示了该架构如何支持从查询驱动分析向主动发现驱动系统的转变。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Gaetano Rossiello, Dharmashankar Subramanian

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27571.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27571

Published: 2026-05-28T02:31:43.081Z

9. Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

中文摘要

摘要：随着各组织朝向 AI 代理的生产部署发展，这些代理执行非确定性工作流、维护状态会话，并且经常具有对内部服务的特权访问权，工程挑战从构建单个代理转向以适当的隔离、治理和安全方式大规模运营它们。在本文中，我们介绍了 Agyn，一个围绕为代理工作负载量身定制的三个关键原则设计的开源平台：在 Kubernetes 上的信号驱动、状态化无服务器运行时；用于代理和测试套件定义的 Terraform 提供程序；以及基于零信任和最小权限原则的安全模型。Agyn 对代理、模型和云平台均持中立态度。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Nikita Benkovich, Vitalii Valkov

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27575.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27575

Published: 2026-05-28T02:31:43.081Z

10. You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

中文摘要

摘要：对于行为科学和面向人类的人工智能而言，一个核心难题是个体内变异性的持续存在。相同的个体，在遭遇相同的可观察输入时，在不同场合会产生不同的结果，而不同个体产生的结果差异也无法被任何可观察的协变量完全预测。我们认为，这种变异性属于个体的动态潜在状态中，人类的结果可以通过针对状态及其在决策形成时刻的权重进行干预，从而在精确且可操作的意义上被控制。我们将状态定义为时间索引的权重向量，涵盖支配个体的生物学、生理学和神经心理学如何将下一事件处理为决策和结果的各个维度。状态、决策和结果之间的关系是因果性的，而非相关性的。权重向量在每天的子时间尺度上是动态变化的。通过意识通道可报告结果，该通道是一个狭窄的注意力瓶颈，其内容本身依赖于状态。综合这些观点，可以推断，特定事件的结果在干预时的状态轨迹条件下是可控的。我们通过六条已确立的证据线索（因果推断、预测处理、稳态调节、注意力瓶颈、时间生物学、计算精神病学）以及一项为期24个月、涵盖超过200,000名同意参与者、跨四个职业角色的行为平台观测数据（研究时间段为2023至2026年）来说明该框架。我们提出七个可检验的预测，列出了六个针对状态感知系统的操作要求，并讨论了其对数字健康、教育、人工智能个性化以及个人能动性的影响。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Suraj Biswas, Saurav Gupta, Pritam Mukherjee

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2605.27580.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27580

Published: 2026-05-28T02:31:43.081Z

VLM Domain Papers

1. From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

Abstract:The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

中文摘要

摘要：第十届野外情感与行为分析（ABAW）研讨会及竞赛在 CVPR 2026 举办，持续推动对人类情感与行为在真实、非受控环境中的建模、分析与理解的研究。研讨会保持双轨结构，包括竞赛和论文轨。ABAW 竞赛引入了一系列多样化挑战，针对情感与行为理解的关键方面，包括连续情感（愉快-唤醒）估计、离散情感（表情和动作单元）识别，以及更复杂的行为分析任务，如情绪模仿强度估计、矛盾/犹豫识别和细粒度暴力检测。这些挑战建立在大型野外数据集基础上，为最先进方法提供了全面的评测基准。同时，论文轨展示了广泛的研究成果，涵盖姿态、运动与行为估计，情感建模与多模态学习，基准、数据集与评估协议，公平性、鲁棒性与部署等方面。总体而言，第十届 ABAW 研讨会及竞赛继续作为基准测试、协作与创新的重要平台，推动新一代以人为中心的多模态 AI 系统的发展。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Pedersoli, Simon Bacon, Jens Madsen, Soufiane Belharbi, Muhammad Haseeb Aslam, Chunchang Shao, Guanyu Hu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27451.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27451

Published: 2026-05-28T02:38:07.085Z

2. Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

Abstract:Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability — a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining this http URL() and batch processing (batch_size=8) achieves 10.06 seconds per image — a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.

中文摘要

摘要：在日本，桥梁检查要求每五年进行一次强制性的目视评估，但不同工程师分配的定性损伤等级（a-e 级）存在显著的评分间差异——这是实现基础设施管理一致性的关键障碍。熟练工程师的老龄化进一步威胁到检查能力。本文提出了一种使用微调视觉-语言模型（VLMs）自动理解桥梁损伤并进行维修优先级评分的方法。我们使用 QLoRA 微调 LLaVA-1.5-7B，训练数据为最多 4,000 对桥梁损伤图像与检查文本记录，并在固定的 800 张图像测试集上进行评估。模型输出自然语言描述，识别结构构件和损伤模式，随后通过基于规则的评分引擎计算五级维修优先指数。渐进式训练研究（1k/2k/3k/4k 样本）显示，使用 2k 训练样本在仅 2.9 小时训练后即可达到接近最优的验证损失；超过 2k 时，每翻倍训练样本验证损失最多仅改善 0.2%，显示出明显的收益递减。此外，在保留测试集上的语义相似性在 3k 时达到峰值（0.6909），在 4k 时下降（0.6739），表明经过严格筛选的中等规模数据比更大但噪声更多的数据集效果更好。结合该 http URL() 和批处理（batch_size=8）的推理优化使每张图像处理时间达到 10.06 秒——比未优化基线减少 70.2%。我们的方法有助于桥梁检查的数据治理，降低评分间差异，并提供 AI 辅助的分级功能，增强专家工程师在检查工作流程中的能力。此外，我们引入了两阶段质量保护机制，利用微调的 Swallow-8B SLM 拒绝低质量 VLM 输出后再进行优先级评分，从而防止来自损坏或无法识别图像的错误评分。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Takato Yasuno

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27452.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27452

Published: 2026-05-28T02:38:07.085Z

3. Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Abstract:Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

中文摘要

摘要：Transformer显著推动了人工智能的发展，也同样推动了智能体的发展。我们根据输入信息的来源将Transformer的注意力结构分为两类：同质注意力结构和异质注意力结构。异质注意力结构以协同注意力为典型例子，用于处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能和整合更多模态信息的基础。无论是出于研究目的还是政策需求，对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括两部分：方法和实验。在方法方面，我们提出了一种用于具有异质注意力结构的Transformer模型的解释方法。在实验方面，基于我们的实验分析范式，我们解释了代表性模型的运行机制，进行了语义解释和逻辑解释。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yongjin Cui, Xiaohui Fan, Huajun Chen

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27458.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27458

Published: 2026-05-28T02:38:07.085Z

4. D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

Abstract:Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at this https URL.

中文摘要

摘要：单帧大气湍流抑制本质上是病态问题，因为空间变化模糊与非刚性几何畸变相耦合。现有在平场模拟上训练的端到端方法通常难以平衡纹理恢复与几何校正。为克服这一限制，我们提出了 D$^2$Turb，这是一个将物理基础模拟与显式解耦恢复相结合的统一框架。首先，我们引入了深度感知湍流合成（Depth-Aware Turbulence Synthesis）协议，该协议将场景深度纳入相位到空间的表述中。这产生了物理一致的、依赖深度的退化，并为解耦学习提供了关键的中间倾斜监督信号。在此模拟引擎的基础上，D$^2$Turb将恢复分解为两个交互阶段：纹理去模糊和几何校正。纹理去模糊阶段采用去模糊主干网络以恢复细粒度细节，同时保留几何畸变以供后续校正阶段使用。为了缓解级联设计中常见的信息碎片化问题，我们进一步提出了自适应结构先验注入（Adaptive Structural Prior Injection, ASPI）机制，该机制动态地将深层结构表示从去模糊模块传递，以指导空间展开的稠密流预测。大量实验证明，D$^2$Turb在合成和真实数据集上均实现了最先进的性能，在纹理恢复和几何保真度上均有一致提升。我们的代码和预训练模型已在此 https URL 公共提供。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin, Xun Liu, Peng Wang

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27460.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27460

Published: 2026-05-28T02:38:07.085Z

5. Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

Abstract:AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at this https URL.

中文摘要

摘要：AR智能眼镜需要持续的行为上下文来提供主动辅助，但它们最实用的始终开启传感器——头戴式惯性测量单元（IMU）——只能检测步行或站立等运动原语。我们超越运动原语进行行为级识别，定义了五类既平衡AR应用需求又考虑传感器可观测性的类别。为此，我们构建了一个包含16万样本的Ego4D数据集，采用四层质量保证框架涵盖8种活动场景，并提出了HiT-HAR，一个拥有70.3万参数的分层模型，在五类动作和八类场景识别上均优于先前的头戴IMU模型。我们进一步通过每类可分性分析绘制了头戴IMU的可观测性前沿，识别出哪些行为类别可可靠观测（移动）、哪些受益于时间上下文（物体搬运、任务操作），以及场景相关信号重叠仍然构成挑战的领域。我们的结果表明，利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集可通过此网址公开获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27464.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27464

Published: 2026-05-28T02:38:07.085Z

6. AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

Abstract:The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

中文摘要

摘要：视觉变换器（ViTs）中自注意力的二次成本构成了实际部署的基本瓶颈，这激发了关于令牌减少的活跃研究。在现有方法中，令牌合并（ToMe）已经成为一种优雅的无需训练的解决方案；然而，其设计基于一个未明言的前提——令牌平等，这与自注意力众所周知的非均匀性相悖，并在激进压缩下导致高显著性令牌的信息损失。我们通过 AdaMerge 解决了这一限制，这是一种基于两个互补机制的令牌合并框架。首先，显著性加权相似性利用按列的特征亲和度中心性作为令牌重要性的代理，并将得到的显著性分数纳入二分匹配评分，确保关键令牌对合并表示的贡献更强。其次，自适应合并强度使用预先计算的各层相似性统计数据，根据输入特定的冗余动态调整每层的减少数量。在使用 ViT-B/16 的 ImageNet-1k 上，AdaMerge 在所有 FLOPs 匹配的情况下始终优于 ToMe、PiToMe 和 DSM。随着压缩率增加，准确率差距单调扩大：在 13.4G FLOPs 操作点，AdaMerge 仅维持 Top-1 精度下降 -1.06%，而 PiToMe 为 -1.45%，DSM 为 -4.62%。据我们所知，AdaMerge 是首个将显著性加权相似性和自适应每层减少结合到单一无需训练的令牌合并框架中的方法，推动了 ViT 加速的准确率-FLOPs 帕累托前沿。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Semi Lee, Hyejin Go, Hyesong Choi

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27465.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27465

Published: 2026-05-28T02:38:07.085Z

7. Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

Abstract:Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

中文摘要

摘要：基于书写者风格的手写文本生成（HTG）已在拉丁字母书写体系中得到广泛研究，但对于低资源和非拉丁字母书写体系仍然研究不足，因此现有模型在拉丁域之外的泛化能力仍不清楚。西里尔字母，特别是乌克兰语，既缺乏大规模带作者标签的数据集，也缺乏此类泛化的实证证据。为填补这一空白，我们构建了一个包含126,177张图像、来自308名作者的乌克兰手写单词数据集，采用连通组件分割、质量筛选，并针对乌克兰语中表现不足的字符进行有针对性的过采样。我们在该数据集上重新训练了DiffusionPen，这是一种基于MobileNetV2三元组损失的风格编码器，搭配CANINE条件的潜在扩散U-Net，训练过程中未对架构进行修改，并测试了从拉丁字母到西里尔字母的直接迁移。我们在三种设置下评估跨域风格迁移：IAM英文样本的跨语言迁移、对20世纪初乌克兰手稿的零-shot迁移，以及对当代作者的少样本模仿。模型生成的单词图像清晰、风格一致，表明少样本潜在扩散模型能够在拉丁字母书写体系之外进行泛化。我们发布了数据集、训练模型和评估协议，作为可重复的作者感知西里尔字母HTG基准，为将风格化HTG扩展到其他低资源书写体系提供基础。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Andrii Ahitoliev, Pavlo Berezin

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27487.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27487

Published: 2026-05-28T02:38:07.085Z

8. Representation-Conditioned Diffusion Models for Guided Training Data Generation

Abstract:Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

中文摘要

摘要：数据的可用性仍然是许多深度学习应用中的关键瓶颈。大规模数据集通常收集、整理和标注成本高昂，这可能限制监督学习方法的可扩展性和适用性。在本研究中，我们评估了使用生成式深度学习生成的合成图像数据集训练的模型的分类性能。具体而言，我们使用基于从 DINOv2、DINOv3 和 CLIP 学到的表示进行条件化的潜在扩散模型。我们的结果表明，这种表示条件化的生成方法在很大程度上显著优于类别条件化生成（在 ImageNet100 上 top-1 准确率提高了 +10.76 个百分点），通过提高样本质量和模式覆盖。此外，通过扩大合成数据集的规模，我们能够超过在真实数据上训练的分类器（top-1 准确率提高 +2.0 个百分点）。我们还展示了生成图像如何用于数据增强，优于传统的数据增强方法，以及条件化空间如何用于样本筛选以进一步提高训练价值。总体而言，这些发现表明，表示条件化扩散模型为在大规模视觉学习任务中增强、补充或潜在替代真实数据集提供了一种有前景的方法。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting for selector #kimi-2605\.27495 failed: Waiting failed: 3000ms exceeded

Authors: Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27495.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27495

Published: 2026-05-28T02:38:07.085Z

9. Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Abstract:Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four “Melanoma Day” sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P<0.15 / 0.15-0.50 / >=0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

中文摘要

摘要：引言。恶性皮肤病变的早期发现对预后至关重要，但俄罗斯各地区皮肤科医生短缺限制了筛查覆盖率。移动皮肤镜临床决策支持系统（CDSS）提供了一种有前景的方案，而模型可解释性和标准化患者分流仍是采用的关键障碍。目的。开发用于级联深度学习模型的定量可解释性评估方法和三区域患者分流算法，并在俄罗斯门诊实践中对Melanoscope AI CDSS进行初步单中心前瞻性临床验证。材料与方法。皮肤镜图像的两阶段级联分类；注意力图可视化（ViT和Swin使用attention rollout；ConvNeXt和EfficientNetV2使用Grad-CAM）；基于IoU的定量一致性评估，比较激活图与专家标注；在四次“黑色素瘤日”活动（俄罗斯奥廖尔，2025年6月-2026年4月）中进行前瞻性单中心验证。结果。176名患者中：与专家评估一致率为88.6%；5例恶性病变中无假阴性（95% CI：47.8-100.0%）；特异性为88.3%。3例黑色素瘤和2例基底细胞癌经组织学证实；6例发育异常痣随访处理。平均IoU（n=180）：ViT - 0.69；Swin - 0.64；ConvNeXt - 0.53；EfficientNetV2 - 0.51。分流阈值：P<0.15 / 0.15-0.50 / >=0.50。结论。未观察到假阴性；特异性为88.3%，支持用于筛查。整合的级联分类、带IoU评估的注意力图可视化及三区域分流提供了可重复、可解释的临床决策支持，适应不同资源水平。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting for selector #kimi-2605\.27561 failed: Waiting failed: 3000ms exceeded

Authors: Elena Sergeevna Kozachok, Sergey Sergeevich Seregin

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27561.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27561

Published: 2026-05-28T02:38:07.085Z

10. What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

Abstract:Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model’s output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

中文摘要

摘要：视频生成模型越来越多地被用作世界模拟器，用于驾驶和机器人操作等任务。在这些场景中，重要的不是单个视频是否看起来正确，而是当输入变化时模型的输出是否发生变化。我们通过给模型提供两个描述相同场景但一个物理细节不同的提示，并检查这两个视频是否按照物理规律出现差异来进行测试。提示之间的措辞差异在设计上很小，因为只改变了一个变量，但正确的物理差异却不同。一个忽略这一点的模型仍然可能生成两个各自看起来合理的视频，而现有的基准评测一次只评分一个视频，无法检测到这种失败。我们引入了 What-If World，包含 319 对这样的提示对，基于 nuScenes 和 DROID 的真实帧构建，按六种驾驶和操作共享的物理变量的分类法组织。每对提示都用 APEO 评分，这是一个四部分的评分标准，检查每个视频是否遵循提示（Adherence）、是否物理一致（Physics）、是否保留共享场景（Environment）以及是否以正确的差异结束（Outcome）。在九个最先进的模型中，没有系统的配对得分超过 52%，开源模型的得分集中在约 28%。每个测试模型在大量因果干预上都失败，表明在这些模型能够可靠支持基于动作的仿真或基于模型的规划之前，还有很大的改进空间。在模型表现良好的情况下，性能似乎与干预的视觉显著性相关，而不是其潜在物理的可解性。一些视觉上不明显的干预得分低至 14.2%，而视觉上显著的干预得分可达到 40.4%。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting for selector #kimi-2605\.27589 failed: Waiting failed: 3000ms exceeded

Authors: Kunlin Cai, Rui Song, Jinghuai Zhang, Kaiyuan Zhang, Pranav Bodapati, Alicia Yu, Fnu Suya, Mohammad Rostami, Jiaqi Ma, Yuan Tian

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2605.27589.pdf

CoolPaper URL: https://papers.cool/arxiv/2605.27589

Published: 2026-05-28T02:38:07.085Z