ArXiv Domain 2026-06-03

数据来源：ArXiv Domain

LLM Domain Papers

1. IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

Abstract:Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

中文摘要

摘要：习用表达仍然是自然语言处理中持续存在的挑战，因为它们的含义往往非组合性、依赖语境且难以跨语言对齐。现有的习语资源通常在规模、语境多样性或多语言覆盖方面有限，限制了它们在现代语言模型中的实用性。我们介绍IdiomX，一个大规模多语言的成语理解、检索和解释基准测试，通过可重复的多阶段流程构建，结合词汇资源提取、大规模归一化、受控大型语言模型丰富和结构化验证。最终数据集包含超过19万个上下文化示例，涵盖12K+个习语，具有对齐的英语、阿拉伯语和法语语义表示，使用标签和惯用和字面用法，以及丰富的语言元数据。基于该资源，我们定义了一个统一的四任务基准，涵盖习语检测、语境到习语检索、阿拉伯语到英语习语检索和习语解释，将评估从比喻识别延伸到语义基础和可解释意义检索。实验表明，上下文变换器模型显著提升习语检测能力，而混合检索和重新排序架构显著增强了单语和跨语言习语检索。结果进一步表明，习语解释可以有效地建模为语义检索任务，将可解释性引入作为互补的基准维度。总体而言，IdiomX 为从检测到检索和语义解释的习语研究提供了一个可扩展的基准，并提供了一个可扩展至其他语言和比喻推理任务的模块化框架

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文旨在解决**习语理解（idiomatic language understanding）**在自然语言处理（NLP）领域中面临的系统性挑战，具体包括以下核心问题：

1. 习语语义的非组合性与上下文依赖性

习语表达（如 “kick the bucket” 或 “spill the beans”）具有非组合性语义（non-compositional semantics），即其整体含义无法从 constituent words 的字面意义直接推断。这要求模型具备超越字面分析的上下文和文化推理能力，而现有系统常在以下方面失败：

区分字面用法与比喻用法（literal vs. figurative usage disambiguation）
从上下文中隐含的语义线索推断习语含义
处理跨文化、跨语言的语义映射差异

2. 现有基准测试与数据资源的局限性

尽管已有 VNC-Tokens、SemEval-2013 Task 5b、PIE 和 MAGPIE 等资源，但现有数据集普遍存在以下瓶颈：

规模受限：覆盖的习语数量和上下文实例有限
上下文多样性不足：缺乏足够的语境变化来支撑稳健的模型评估
多语言覆盖缺失： predominantly monolingual，缺乏跨语言对齐的语义表示
任务单一：大多仅支持单一任务（如习语检测），缺乏对检索、跨语言映射和语义解释的统一评估框架

3. 缺乏从识别到解释的渐进式评估框架

现有工作多将习语理解视为孤立的分类问题（检测），而缺乏对语义检索（semantic retrieval）、**跨语言对齐（cross-lingual alignment）和可解释语义落地（interpretable semantic grounding）**的系统评估。这限制了模型在以下应用场景中的发展：

机器翻译中的习语处理
多语言语义搜索
语言学习系统中的可解释习语教学
人机交互中的 figurative reasoning

解决方案概述

为应对上述挑战，论文提出 IdiomX，一个大规模多语言基准测试，通过以下方式解决问题：

构建包含 190K+ 上下文实例、覆盖 12K+ 习语 的三语对齐资源（英-阿-法）
设计四任务统一评估框架：习语检测 → 上下文到习语检索 → 跨语言习语检索 → 习语语义解释
建立可复现的多阶段数据构建流程，结合词典资源提取、大规模归一化、受控 LLM 增强和结构化验证

通过这一工作，论文将习语理解重新定位为一个从**表面消歧（surface disambiguation）到深度语义检索与解释（deep semantic

Authors: Ayman Ali Sharara

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02584.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02584

Published: 2026-06-03T02:12:57.338Z

2. Greener Than Humans? Environmental Attitudes in Large Language Models

Abstract:Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.

中文摘要

摘要：大型语言模型（LLMs）在可持续发展相关的决策支持、报告和公众交流中越来越多地被使用，但关于其输出中所嵌入的环境态度的系统性证据仍然很少。本文开发了一个用于评估LLMs中环境认知、情感和行为推荐的基准，并将其应用于31个广泛使用的专有和开放权重模型。借鉴已有环境意识调查的问题和额外的可持续发展相关行为测量，我们比较了LLM的回应：1）在模型之间的差异；2）模型与来自德国的人类调查基准之间的差异。我们评估了它们在不同提示条件下的稳健性。研究发现，许多LLMs与环境进步态度更为一致，相较于平均调查受访者，表现出更高的环境情感和认知水平，并推荐与显著减排潜力相关的行为。同时，我们观察到可持续导向的回应与模型来源、规模或发布背景之间不存在系统性关系。然而，模型表现出情境敏感性，可通过基于人格的提示进行控制，并显示出拍马屁式的变化以反映用户指定的意识形态立场，这引发了对现实部署中可操控性和规范性可靠性的担忧。我们的研究结果提供了一个可重用的评估框架，用于评估LLMs在可持续发展相关价值取向上的一致性，并强调了在AI系统越来越多地嵌入可持续转型和公共决策过程中，治理、透明性和关键监督的重要性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Stefanie Kunkel, Tilman Hartwig, Marcus Voss, Emma K. Schütt, Angelika Gellrich

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02741.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02741

Published: 2026-06-03T02:12:57.338Z

3. On the Persistent Effects of Lexicality in Large Language Mod

Abstract:Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

中文摘要

摘要：从大型语言模型（LLMs）中提取的表示在许多下游应用中起着重要作用。然而，这些表示的结构往往受到词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间关系的理解，以及其对下游任务的影响，仍然有限。在本工作中，我们研究了表示，以量化词汇重叠相对于语义内容的影响。我们考虑了几种对抗性的语义压力测试，并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度，在不同架构、训练机制和目标函数（包括为语义相似性训练的模型）中均一致。此外，我们观察到一个中等深度区域，在该区域词汇和语义信号同时衰减，表明这是一个表示在表面形式和意义上都较差的过渡阶段。我们进一步通过摘要和模型编辑作为案例研究，展示了词汇影响对LLM下游使用的影响。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A.B. Siddique, Hassan Sajjad

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02750.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02750

Published: 2026-06-03T02:12:57.338Z

4. Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

Abstract:When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

中文摘要

摘要：当大型语言模型（LLM）被用于高风险场景，例如法律、医疗和金融咨询时，即便仅有一次对话历史，也足以导致用户之间在结果上产生差异。先前的研究表明，这会导致不同社会人口群体之间的结果差异，一些群体获得的结果比其他群体更有利。在本研究中，我们展示了LLM实际上难以仅通过一次对话历史推断用户的社会人口信息，虽然不同社会人口群体之间存在差异，但其幅度非常小。为了调查这些差异的主要驱动因素，我们将用户的社会人口信息与对话的各种（心理）语言特征进行比较，包括对话主题、情绪和可读性。我们发现，在对话情境下，对话主题最能预测LLM生成的建议，并在一定程度上充当社会人口群体的代理变量，且常常以不可预测的方式影响建议。这引发了关注，并强调了未来研究需要更好地理解，并在必要时减轻对话情境对高风险场景中LLM输出的影响。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02776.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02776

Published: 2026-06-03T02:12:57.338Z

5. Do Value Vectors in Deep Layers Need Context from the Residual Stream?

Abstract:The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.

中文摘要

摘要：变压器架构作为现代大型语言模型（LLM）主干的成功，在很大程度上归功于其使用了注意力层。注意力层遵循标准神经网络范式：它以残差流作为输入，从而产生依赖上下文的查询、键和值向量。然而，我们发现，当深层只学习上下文无关的值向量以保留原始标记信息，而不从残差流中获取任何上下文时，模型性能会显著提升。当模型能够访问该上下文无关的值向量时，重新加入依赖上下文的部分对整体基准性能几乎没有额外提升。这类上下文无关的值向量可以作为稀疏模型参数存储，避免重新计算或持续缓存这些值。通过对这种上下文无关值向量的关键设计选择进行系统消融研究，我们提出了价值库（Bank of Values，BoV），这是一种通过为最后三分之一层的每个标记学习标记特定值向量的查找表来计算注意力中的值向量的新方法。在135M和780M模型中，BoV在验证损失方面优于标准注意力，并且在780M模型中，在21个基准测试的平均得分上，与此前将标记信息添加到值向量的最佳方法持平，同时计算和内存开销更低。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Muyu He, Yuchen Liu, Qingya Huang, Li Zhang

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02780.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02780

Published: 2026-06-03T02:12:57.338Z

6. Translating Classical Poetry into Modern Prose

Abstract:We introduce Padyam2Gadyam, a dataset for the task of poem-to-prose translation from 13th-17th Century Telugu Classical Poetry to contemporary Telugu and English prose. The dataset consists of 600 poems and their human-verified Telugu and English prose translations. We evaluated 5 contemporary Large Language Models (LLMs) on their ability to do poem-to-prose translation into Telugu and English. Our results indicate that while there are differences across LLMs, their overall performance leave a large room for improvement in both languages. Through qualitative analysis, we discuss the the capabilities and limitations of contemporary MT evaluation approaches for this task.

中文摘要

摘要：我们介绍了 Padyam2Gadyam，这是一个用于将 13 至 17 世纪的泰卢固经典诗歌翻译为当代泰卢固语和英语散文的诗歌到散文翻译任务的数据集。该数据集包含 600 首诗及其经过人工验证的泰卢固语和英语散文翻译。我们评估了 5 个当代大型语言模型（LLMs）在将诗歌翻译为泰卢固语和英语散文时的能力。我们的结果表明，尽管不同 LLM 之间存在差异，它们在两种语言上的整体表现还有很大提升空间。通过定性分析，我们讨论了当代机器翻译评估方法在该任务中的能力和局限性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Chalamalasetti Kranti, Sowmya Vajjala

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02806.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02806

Published: 2026-06-03T02:12:57.338Z

7. Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Abstract:Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential — yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

中文摘要

摘要：从自然语言到一阶逻辑（NL-to-FOL）的准确翻译是神经符号 AI 系统和自然语言推理（NLI）的基础，这使得 NL-to-FOL 基准的质量至关重要——然而，这些数据集从未经过严格审计。我们的第一个贡献是对 \textsf{FOLIO} 的验证集以及 \textsf{MALLS} 测试实例的一个子集进行了系统的人工检查，发现分别约有 39% 和 36% 的条目包含错误的一阶逻辑形式化（即真实标签），同时 \textsf{FOLIO} 中还存在额外的歧义自然语言句子（16.4% 和 48%）以及错误的 NLI 标签（8.4%）。我们的第二个贡献是开发并发布这些数据集的修正真实标签，表明标注错误会扭曲模型在参考基准任务上的评估：使用修正后的真实标签测试三个最先进的大语言模型（Gemma~4 31B-it、Qwen3-30B-A3B 和 GPT-4o-mini），准确率提升在 9 到 22 个百分点之间。基于这些发现，我们提出了一个基于 LLM 的框架，以支持人工对 NL-to-FOL 数据集进行手动审核。通过将审核者引导至最易出错的实例，我们的实证结果表明，在审核不到 24% 的实例后，可以实现 90% 的数据集准确率，而无指导审核需要超过 70%。我们发布了所有人工验证的标注以及框架的代码。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02837.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02837

Published: 2026-06-03T02:12:57.338Z

8. Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

Abstract:How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek’s economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

中文摘要

摘要：一个群体的智能体如何在没有集中控制的情况下，自我协调并自我适应，形成更强的集体智能？受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发，我们通过一个智能体经济来研究这个问题，在该经济中，智能体通过拍卖竞争行动权、交换支付并从环境奖励中积累财富。这些简单的经济信号引发去中心化的信用分配，驱动规划而无需全局协调或显式通信协议。该群体通过经济选择进行演化：有效的智能体积累财富并通过利用进行变异，而无效的智能体破产并通过探索被替换。我们表明，当初始智能体较弱时，该经济能够产生涌现的多步推理策略，并在包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化的五项智能体任务中，超越更强的单体基线。我们进一步提供了理论见解，说明经济动态如何塑造智能体行为，将局部激励与长期整体表现联系起来。我们的结果提出了一条通向多智能体智能的新途径：与其工程化设计协调，不如设计去中心化的激励结构，使其自动涌现。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02859.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02859

Published: 2026-06-03T02:12:57.338Z

9. Adaptive Latent Agentic Reasoning

Abstract:Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent’s actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

中文摘要

摘要：大型推理模型通过生成扩展的连锁思维（CoT）推理来提高性能，但当应用于大型语言模型（LLM）智能体时，这种行为变得效率低下。当前的LLM智能体通常在每个决策步骤生成冗长的文本推理，并在各轮之间几乎均匀地分配推理努力，从而在多轮智能体轨迹中导致显著的低效率。我们提出了自适应潜在智能体推理（Adaptive Latent Agentic Reasoning，ALAR），这是一种双模框架，在常规轮次使用紧凑的潜在推理，并在需要更深入考虑时选择性地升级为显式连锁思维。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理，并进一步优化以在潜在推理足以完成任务时使用潜在推理，同时将显式CoT保留用于更困难的决策。在智能体搜索和工具使用基准上的实验表明，ALAR在保持相当或更高任务准确性的同时，大幅减少生成的文本量，在搜索中最高可减少43.6%，在工具使用中最高可减少84.6%。这些结果表明，ALAR通过减少不必要的文本推理，同时保留更困难决策步骤的显式思考，改善了LLM智能体的准确性与效率之间的权衡。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02871.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02871

Published: 2026-06-03T02:12:57.338Z

10. Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Abstract:Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

中文摘要

摘要：大型语言模型（LLM）隐藏状态的线性探测被广泛用于宣称模型会为不同推理类型学习不同的表示。我们通过在三个跨越经典三分法的基准测试Qwen3-14B来测试：LogiQA 2.0（演绎）、ARC-Challenge（归纳）和$\alpha$NLI（溯因）。在40层中的第32层，线性探针实现100%交叉验证准确率，几何形状分离良好（内在维度：20.6、28.5、33.6;凸包污染$1.5\%）。然而，这种区分完全是由格式混淆因素驱动的。残留源身份、选项数和响应长度，降低了准确率与偶然性。迹锚相似性表明任务间推理大致共享（42.5%同意率对33.3%概率），随机对照的因果引导（$n=20$）显示几何与推理模式之间无功能联系（$p=0.286$）。因此，高探测精度反映的是任务格式而非计算结构，促使常规格式在机制解释性上产生混杂。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2606.02907.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02907

Published: 2026-06-03T02:12:57.338Z

Agent Domain Papers

1. Visual Graph Scaffolds for Structural Reasoning in Large Language Models

Abstract:Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning. Inspired by how humans use graph-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance. We study this question on multi-hop question answering tasks, where teacher-provided reasoning traces are rewritten as graph mind maps and used to guide a student model. Our experiments reveal a clear modality gap. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine-tuning and KL-based distillation. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning.

中文摘要

摘要：图结构已被用于增强大型语言模型（LLM）的结构化推理能力，主要是作为在测试时向模型提供的外部知识源。在本文中，我们采取了不同的视角：图对于LLM的价值不仅在于提供信息，还在于组织推理。受到人类使用图形结构心智图来组织分支和汇聚思维的启发，我们提出一个问题：图是否可以作为一种内部的推理辅助形式。我们在多跳问答任务上研究了这个问题，其中教师提供的推理轨迹被重写为图形心智图并用于指导学生模型。我们的实验揭示了明显的模态差距。当图结构被展开为文本时，一旦直接答案提示被移除，其益处就变得有限。在这种抽象指导设置下，推理效率和答案质量都显著下降。相比之下，视觉图引导在没有直接答案线索的情况下仍然有效，并且其优势在经过监督微调和基于KL的蒸馏后仍然存在。上述发现支持这样的观点：图不仅应被研究作为LLM的外部知识结构，也应作为组织推理的视觉支架。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Runlin Lei, Xiaokui Xiao, Zhewei Wei

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02673.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02673

Published: 2026-06-03T02:19:27.851Z

2. AURA: Action-Gated Memory for Robot Policies at Constant VRAM

Abstract:The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

中文摘要

摘要：KV缓存是数据中心的合适内存，但对于机器人来说却是不合适的内存。数据中心在推理时批量处理许多短请求并重置它们，从而在众多请求中摊销注意力缓存。而具身代理则在带宽受限的边缘硬件上运行一个长时间、非重置的任务，其中高带宽内存和闪存稀缺，闪存具有有限的写入寿命，并且内存写入而非计算可能成为约束瓶颈。AURA-Mem（动作效用递归自适应内存）针对这一场景设计。它用一个固定的视觉-语言-动作骨干网络封装一个固定大小的递归内存和一个学习到的门控，仅在当前观察会改变下一次动作时才写入：一种知道何时保持沉默的内存。与基于重建的内存不同，该门控直接针对闭环动作误差信号进行训练。其推理状态固定为4224字节，与任务长度无关，而KV缓存在100,000步时会增长到其6,061倍。在受控的合成基准测试中，AURA-Mem在准确性上匹配最佳O(1)基线，同时写入次数减少了5.19-6.13倍，在较简单配置下最多减少9.19倍。匹配预算的随机和周期性写入策略无法获得此增益，将收益孤立到动作意外信号上。在训练好的闭环OpenVLA-OFT 7B面板上的LIBERO-Long（每个手臂n=60集）测试中，门控并不影响成功率：AURA-Mem匹配无门控的基本策略(0.233)，并略优于始终写入的KV分支(0.217)，同时写入次数减少7.0倍且内存保持恒定。我们还实例化了一个近似信息状态值损失界限作为方法学演示；在此规模下，该界限是无效的，而非保证。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Josef Chen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02775.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02775

Published: 2026-06-03T02:19:27.851Z

3. Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

Abstract:Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures

中文摘要

摘要：流域网络表现出汇聚拓扑特征，多条支流汇入下游水道，整合多样的上游水文过程。在无观测的流域中，缺乏直接观测增加了不确定性，并限制了预测极端事件的能力。本研究评估了在有限水文信息情况下，仅编码器的Transformer在上游流量推断上是否优于LSTM，使用来自美国国家海洋和大气管理局（NOAA）国家水文模型（NWM）的回顾性模拟。在仅上游和组合配置中，LSTM在整体性能上均优于Transformer模型。引入下游信息进一步提升了所有模型的性能，中位NNSE提高了超过60%。我们没有将此作为排行榜式比较，而是将实验解读为对水文序列推断架构归纳偏置的测试。结果表明，相比仅编码器的Transformer，循环记忆更适合这一上游重建任务，而下游水文背景提供了强有力的辅助约束，显著提升了不同架构的预测技能。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Taye Akinrele, James Halgren, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02791.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02791

Published: 2026-06-03T02:19:27.851Z

4. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

Abstract:Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user’s final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.

中文摘要

摘要：许多决策支持场景需要能够适应个体用户的系统，但针对这一问题的评估数据仍然有限。现有的用户理解基准通常依赖于模拟用户或模型生成的行为，尽管近期研究警告说基于模型的模拟可能系统性地偏离人类行为。我们引入了\textsc{BehaviorBench}，这是一个用于从真实世界行为轨迹评估个性化决策建模的基准。\textsc{BehaviorBench}从观察到的公共预测市场和链上记录中重建钱包级决策历史，并将其组织为两个互补的任务层：\emph{信念预测}，即预测用户在市场中的最终显性立场和信心；\emph{交易预测}，即预测个别交易的方向和数量。在2000个评估钱包中，该基准包含141,445个信念实例和1,485,972个交易实例，并为基于检索的评估提供不相交的支持池。我们在四种历史接口下评估前沿和开放权重生成模型：无个性化、直接的近期历史、生成的用户资料以及检索到的支持钱包证据。个性化对信念预测的提升比分交易预测更为稳定，模型在不同任务层和指标下的排名有所变化，不同的历史接口也揭示出不同的失败模式。\textsc{BehaviorBench}提供了一个评估环境，用于研究个性化方法是否能够利用真实世界的行为证据，而不仅仅依赖于模拟用户。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Liangwei Yang, Jielin Qiu, Zixiang Chen, Ming Zhu, Juntao Tan, Zhiwei Liu, Wenting Zhao, Zhujun Lan, Akshara Prabhakar, Silvio Savarese, Huan Wang, Shelby Heinecke

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02798.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02798

Published: 2026-06-03T02:19:27.851Z

5. ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

Abstract:Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records (EHRs). In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language-based reasoning. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task-aware resampler. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural-language reasoning while maintaining accurate patient prediction. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction.

中文摘要

摘要：大型语言模型（LLMs）在临床决策支持方面展示了强大的自然语言推理能力，但在有效建模结构化纵向电子健康记录（EHRs）方面存在困难。相比之下，EHR基础模型能够学习患者的预测表示，但缺乏可解释的基于语言的推理。为弥合这一差距，我们提出了ChatHealthAI，一种多模态推理框架，它通过任务感知重采样器，将预训练EHR基础模型的结构化EHR表示与冻结的LLM的语义空间对齐。通过整合纵向患者表示与精炼的临床事件描述，ChatHealthAI能够在保持准确患者预测的同时，实现基于临床的自然语言推理。我们在EHRSHOT基准上的三个临床预测任务中对ChatHealthAI进行了评估。结果显示，ChatHealthAI在提高推理质量和可解释性的同时，保持了具有竞争力的预测性能。这些发现突显了将EHR基础模型与预训练LLM集成用于可解释临床预测的潜力。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Bo-Hong Wang, Baicheng Peng, Ruilin Wang, Jun Bai, Ziyang Song, Yue Li

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02802.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02802

Published: 2026-06-03T02:19:27.851Z

6. Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Abstract:Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent’s prediction loss converges quickly while the worker agents’ temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

中文摘要

摘要：从纵向电子健康记录（EHRs）中建模患者轨迹需要对稀疏、噪声多且长上下文的多模态序列进行推理。现有基于大型语言模型（LLM）的多智能体系统虽然解决了上下文长度问题，但对患者的处理是孤立进行的，未能反映临床医生如何利用类似过往病例的累积经验。我们提出了Traj-Evolve，一种具有两种互补演化机制的自我演化多智能体系统。首先，经验池（ExPool）作为非参数记忆，通过索引拒绝采样的推理轨迹来检索相似患者作为少样本上下文。其次，通过奖励排序微调的多智能体强化学习（MARL）在参数上优化智能体之间及智能体与记忆的协作。留一交叉检索策略将两者统一，使训练时和推理时的行为在检索增强下保持一致。在利用最多五年多模态EHR数据的肺癌预测任务中，Traj-Evolve在总体人群和具有挑战性的从未吸烟人群中均优于9个强基线方法。对演化动态的分析揭示了三项关键发现：（1）扩展ExPool后，最优检索从多样化样本转向特定样本；（2）在MARL下，管理智能体的预测损失迅速收敛，而工作智能体的时间推理继续从更多已验证患者中受益；（3）两种机制在预测风险方面互为补充，其中ExPool提高了特异性，而MARL提高了敏感性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02812.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02812

Published: 2026-06-03T02:19:27.851Z

7. An Exploration of Collision-based Enemy Morphology Generation

Abstract:Despite a great deal of prior research into Procedural Content Generation (PCG), relatively little prior work has explored generating enemies for video games. In particular, there is almost no work on generating enemy morphologies, the basic body plan or collision information for in-game enemies, despite the existence of related morphology generation work in robotics. In this paper, we explore three different novel approaches to generate enemy morphologies based on player collision information. We found that each approach provides different strengths and weaknesses, but all had equivalent or better performance than an evolutionary baseline adapted from prior robotics morphology work.

中文摘要

摘要：尽管之前已经有大量关于程序内容生成（PCG）的研究，但针对视频游戏敌人的生成却相对较少。尤其是在生成敌人形态学方面几乎没有相关工作，即游戏中敌人的基本身体结构或碰撞信息，尽管在机器人学中已有相关的形态学生成研究。本文中，我们探索了三种不同的新方法来基于玩家碰撞信息生成敌人形态。我们发现，每种方法都有不同的优缺点，但所有方法的表现都与改编自之前机器人形态学研究的进化基线相当或更好。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Johor Jara Gonzalez, Matthew Guzdial

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02832.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02832

Published: 2026-06-03T02:19:27.851Z

8. Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Abstract:Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: “Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?” To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at this https URL.

中文摘要

摘要：大型推理模型（LRMs）通过在测试时增加计算量生成显式的中间推理过程，从而提高性能，但“更长的推理总是有益”的假设仍未得到充分检验。尽管近期的证据表明额外的推理可能导致模型过度思考，我们提出问题：“一旦模型得出了正确答案，进一步的推理是精炼解决方案，还是使其偏离？”为了研究正确性之后的动态，我们引入了一种基于推理充分性的前缀级轨迹评估协议，定义了模型首次生成正确答案所需的最小推理预算。这使我们能够区分冗长的过度思考（额外的推理是多余但无害的）与有害的过度思考（持续的推理破坏了已经正确的轨迹）。从多模态基准开始，我们发现许多被认为是推理密集的实例实际上所需推理惊人地少。此外，在首次正确前缀处停止比标准推理可以将准确率提高最多21%，这表明当前模型的限制不仅在于推理能力，还在于无法在恰当时间停止。此外，尽管像早停这样的常见效率策略能显著减少冗长的过度思考（最多50%），但它们无法缓解有害的过度思考。故障分析显示，正确性偏离主要由逻辑漂移和视觉重新解释驱动。最后，我们表明我们的发现可以推广到仅语言推理基准，突出了有害过度思考作为更广泛的可靠性风险。代码可通过此 https URL 获取。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02835.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02835

Published: 2026-06-03T02:19:27.851Z

9. Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

Abstract:The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

中文摘要

摘要：大型语言模型（LLM）的兴起使得具有复杂推理和工具使用能力的自主人工智能成为可能；然而，由于嵌入式微控制器的严格内存和能量限制，在普适计算环境中部署这种自主性仍然具有挑战性。现有框架通常假设服务器级资源或持续连接，这在深度嵌入式系统中存在空白。本文提出了一种用于嵌入式智能体系统的模块化参考架构，弥合了确定性实时控制与自主智能之间的差距。我们引入了一种分层设计，将执行高度压缩神经网络和基于规则逻辑以实现低延迟、隐私关键任务的设备端智能体，与利用小型语言模型（SLM）进行更高层次推理和规划的云增强智能体解耦。一个关键贡献是整合了跨层治理层，确保对分布式自主设备群体的可观测性、策略执行和安全性。本文并非仅展示纯经验性基准，而是分析了架构设计原则及在资源受限环境下关于延迟、能量和可靠执行的权衡。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Marcus Rüb, Michael Gerhards

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02862.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02862

Published: 2026-06-03T02:19:27.851Z

10. Don’t Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

Abstract:AI-Driven Research Systems (ADRS) — systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs — are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\mathcal{A}$, discovery mechanism $\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\text{eff}} = \mathcal{A} \circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.

中文摘要

摘要：人工智能驱动的研究系统（ADRS）——将大型语言模型与自动化评估结合以发现算法、证明和设计的系统——正在各领域被优化和采用，但用于分析这些系统的工具尚未跟上进度。ADRS的性能依赖于理解不足、探索成本高昂且（如我们所示）未能被标准收敛保证很好地捕捉的组件交互。这些保证依赖于我们正式提出的ADRS流程中不成立的结构性假设。我们引入了GAMBLe，这是一个将ADRS行为分解为四个参数（生成器$G$，评估器$\mathcal{A}$，发现机制$\mathcal{M}$，预算$B$）和一个组合对象——有效景观$L_{\text{eff}} = \mathcal{A} \circ G$，揭示了不同的生成器-评估器对会在每个问题上引入结构上不同的优化景观。我们在760+次重复运行（>46,000次迭代）上进行了该框架的演练，涵盖了从单个大型语言模型到动态自适应集合的生成器，从贪婪选择到共进化元搜索的机制，以及三个NP难问题，其评估对象涵盖连续评分到悬崖函数。实验未显示生成元或机制的完全排序：前沿模型可能表现不及开源替代方案，最简单的机制有时甚至优于最先进的元搜索。结果显示，即使在预算有限（每次迭代60次）下，选择合适的组件也能提升性能13-67%，搜索效率提升6-39倍。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Marquita Ellis, Paul Castro

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02863.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02863

Published: 2026-06-03T02:19:27.851Z

Evaluation Domain Papers

1. Visual Graph Scaffolds for Structural Reasoning in Large Language Models

中文摘要

摘要：图已被用于增强大型语言模型（LLMs）结构化推理，主要是因为测试时外部知识源会被提供给模型。本文持不同观点：大型语言模型的图的价值不仅在于提供信息，还在于逻辑推理的组织。受人类利用图结构思维图组织分支和汇聚思维方式的启发，我们探讨图是否可以作为一种内部推理辅助。我们在多跳题答任务中研究这个问题，教师提供的推理痕迹被重写成图心智图，并用来指导学生模型。我们的实验揭示了明显的模态差距。当图结构被扁平化成文本时，一旦去除直接答案提示，其益处就有限了。在这种抽象指导框架下，推理效率和答案质量均大幅下降。相比之下，视觉图指导即使没有直接的答案线索也能有效，其优势在监督微调和基于KL的蒸馏后依然存在。上述发现支持了这样一种观点：图不仅应作为大型语言模型的外部知识结构来研究，还应作为组织推理的视觉支架。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决的核心问题是：图结构能否作为内部推理组织工具，帮助大型语言模型（LLMs）进行结构化推理，而不仅仅作为外部知识检索的来源。

具体而言，论文针对以下关键问题展开研究：

1. 超越外部知识库的图角色

现有研究主要将图视为外部支持结构，用于检索证据、验证答案或组织模型可能不具备的记忆。本文探讨了另一种可能性：图是否可以像人类使用思维导图（mind maps）一样，作为**认知支架（cognitive scaffolds）**来组织分支、汇聚、层级和局部关系，从而辅助模型内部的推理过程。

2. 视觉图 vs. 文本形式的推理指导

论文比较了两种将教师模型成功推理过程传递给学生模型的方式：

视觉图指导：将推理轨迹渲染为图结构的思维导图（graph-structured mind maps），利用视觉模态保留拓扑结构；
文本指导：将相同的推理结构展平为线性文本。

研究特别关注抽象指导设置（abstract guidance setting）——即指导中禁止包含答案提示、关键事实或中间结论，只能描述一般性推理策略和结构关系。在此设置下，论文探究学生模型是否能够真正利用图结构作为推理支架，而非仅仅作为获取答案的捷径。

3. 推理结构的可内化性

论文进一步研究了这种基于图的结构化推理指导是否可以被学生模型内化，通过监督微调（SFT）或基于KL散度的蒸馏（KL-based distillation）转化为模型的固有能力，而非仅仅作为测试时的提示辅助。

简言之，该研究旨在验证：在需要复杂结构推理的场景中，保留拓扑结构的视觉图是否比线性化的文本更能有效地传递和组织推理过程，从而使较弱的学生模型能够学习并模仿较强教师模型的结构化思维方式。

Q: 有哪些相关研究？

根据论文第2节（Related Work）及引言部分，相关研究可分为以下三个主要方向：

1. 图作为外部知识结构

这类研究将图视为外部支持，用于检索证据、验证答案或组织模型不具备的记忆：

GraphRAG (Edge et al., 2024)：从语料库构建实体-关系图，支持局部事实和语料库级结构的检索
G-Retriever (He et al., 2024)：在生成前检索与推理相关的紧凑子图
ToG-2 (Ma et al., 2025)：交替进行知识图遍历与文本上下文检索，实现深度多跳推理
GNN-RAG (Mavromatis

Authors: Runlin Lei, Xiaokui Xiao, Zhewei Wei

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02673.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02673

Published: 2026-06-03T02:25:58.567Z

2. AURA: Action-Gated Memory for Robot Policies at Constant VRAM

中文摘要

摘要：KV缓存是数据中心的合适内存，但对于机器人来说却是不合适的内存。数据中心在推理时批量处理许多短请求并重置它们，从而在众多请求中摊销注意力缓存。而具身代理则在带宽受限的边缘硬件上运行一个长时间、非重置的任务，其中高带宽内存和闪存稀缺，闪存具有有限的写入寿命，并且内存写入而非计算可能成为约束瓶颈。AURA-Mem（动作效用递归自适应内存）针对这一场景设计。它用一个固定的视觉-语言-动作骨干网络封装一个固定大小的递归内存和一个学习到的门控，仅在当前观察会改变下一步动作时才写入：一种知道何时保持沉默的内存。与基于重建的内存不同，该门控直接针对闭环动作误差信号进行训练。其推理状态固定为4,224字节，不受时间步长影响，而KV缓存在100,000步时会增长到其6,061倍。在受控的合成基准测试中，AURA-Mem在准确率上匹配最佳O(1)基线，同时写入次数减少5.19-6.13倍，在更简单的配置上最多减少9.19倍。预算匹配的随机和周期性调度无法恢复该收益，将优势孤立到动作意外信号上。在训练好的闭环OpenVLA-OFT 7B面板上的LIBERO-Long（每个操作臂n=60集）测试中，门控并不影响成功率：AURA-Mem匹配无门控的基本策略(0.233)并略超始终写入的KV臂(0.217)，同时减少写入7.0倍且内存占用固定。我们还实例化了一个近似信息状态值损失上界作为方法演示；在这一规模下，该上界是空洞的，而非保证。

LLM Analysis

Q: 这篇论文试图解决什么问题？

该研究针对机器人策略在边缘硬件上的内存墙问题，具体而言是标准Transformer KV缓存在具身智能体（embodied agents）部署中的根本不适配性。

核心问题：KV缓存的线性增长与机器人连续运行的矛盾

在数据中心的大语言模型（LLM）服务中，KV缓存是合适的，因为：

请求被批处理（batch-N）且会话重置，缓存成本可在多个短请求间摊销

然而，对于机器人等具身智能体：

系统运行单一的、无休止的、不重置的episode（batch-1 regime）
标准KV缓存大小随控制步数 t 线性增长：
Cache size propto O(T)
在带宽受限的边缘硬件上，读取不断增长的缓存成为延迟的绑定约束，而非计算

硬件约束与经济性

该研究强调内存写入（而非计算）已成为物理AI规模化部署的瓶颈：

高带宽内存（HBM）稀缺：2026年三大供应商（Micron、SK Hynix）产能售罄，资本支出超$450亿
DRAM价格飙升：2026年Q1合约价季度环比暴涨90–95%
闪存耐久性限制：新型高带宽闪存（HBF）标准受限于有限的编程/擦除周期，写入最小化算法直接延长使用寿命

机器人记忆的功能性需求

机器人不需要重建每一帧历史观测（reconstruction objective），而只需要足够选择下一个动作的压缩状态：

每个自回归推理步骤都触发内存写入：策略读取压缩世界状态→选择动作→将更新后的状态向量写回高带宽内存
这些写入操作消耗稀缺、高价的内存带宽，成为物理AI部署的主导成本

AURA-Mem的解决方案方向

为应对上述挑战，该研究提出需同时解决三个子问题：

常数空间约束：内存必须占用恒定空间，与episode长度无关（ O(1) VRAM）
写入稀疏性：必须稀疏写入以限制每秒内存写入次数（writes/sec），从而约束带宽成本
动作效用对齐：必须针对闭环动作目标（closed-loop action objective）训练，而非通用重建损失，确保保留的状态反映动作效用（action utility）而非token级保真度

该研究提出的AURA-Mem（Action-Utility Recurrent Adaptive Memory）通过动作门控记忆（action-gated memory）实现上述目标：仅当当前观测会改变下一个动作时才执行写入，从而在100,000步的

Authors: Josef Chen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02775.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02775

Published: 2026-06-03T02:25:58.567Z

3. Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

中文摘要

摘要：流域网络呈现汇聚拓扑特征，多条支流汇入下游水道，整合多样的上游水文过程。在无观测的流域中，缺乏直接观测增加了不确定性，并限制了预测极端事件的能力。本研究评估在有限水文信息下，仅编码器的Transformer在上游流量推断中是否优于LSTM，使用来自美国国家海洋和大气管理局（NOAA）国家水模型（NWM）的回顾性模拟。在仅上游和组合配置中，LSTM在两种配置下表现出比Transformer模型更强的整体性能。引入下游信息进一步提升了所有模型的性能，使中位NNSE提升超过60%。我们并不将此视为排行榜式比较，而是将实验解读为对水文序列推断中架构归纳偏置的测试。结果表明，循环记忆仍比仅编码器的Transformer更适合此上游重构任务，同时下游水文上下文提供了强有力的辅助约束，显著提高了各类架构的预测技能。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文旨在解决无资料流域（Prediction in Ungauged Basins, PUB）情境下的上游径流预测问题，具体聚焦于架构归纳偏置（architectural inductive bias）的比较分析。核心研究目标可归纳为以下方面：

1. 核心科学问题

针对上游无测站流域的径流推断（upstream streamflow inference），评估**长短期记忆网络（LSTM）与仅编码器Transformer（encoder-only Transformer）**在信息受限条件下的预测能力差异。在无资料情境中，由于缺乏直接观测数据，模型必须依赖气象强迫数据和流域静态属性来重建上游水文过程，这对序列建模架构的时序依赖捕捉能力提出了特定要求。

2. 研究假设验证

论文通过两个研究问题（Research Questions）构建分析框架：

RQ1（架构适应性）：在仅有上游本地气象信息（上游唯一配置）的约束条件下，LSTM的递归记忆机制是否比基于自注意力机制的Transformer更适合捕捉具有**滞后效应（lagged response）和状态依赖性（state-dependent dynamics）**的水文动态过程。
RQ2（网络情境增益）：引入下游水文情境（下游测站观测的径流和气象数据）作为网络级约束（network-level constraint），是否能跨架构地提升上游径流重建精度，以及不同架构对这种辅助信息的响应差异。

3. 方法论目标

基于NOAA国家水模型（NWM）回溯模拟数据，构建上游-下游流域对（upstream-downstream basin pairs）的对比实验场景；
量化分析**上游唯一配置（upstream-only）与组合配置（combined upstream-downstream）**下的性能差异，以分离”信息可用性”与”架构归纳偏置”对预测技能的相对贡献；
超越简单的性能排行榜（leaderboard-style comparison），从水文信息动力学角度解释架构差异的物理意义。

简言之，该研究试图确定在水文信息稀缺与网络情境丰富两种条件下，递归架构与注意力架构哪种更符合水文序列的底层生成机制，为无资料流域预测中的深度学习模型选择提供理论依据。

Q: 有哪些相关研究？

Authors: Taye Akinrele, James Halgren, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02791.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02791

Published: 2026-06-03T02:25:58.567Z

4. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

中文摘要

摘要：许多决策支持场景需要能够适应个体用户的系统，但针对这一问题的评估数据仍然有限。现有的用户理解基准通常依赖于模拟用户或模型生成的行为，尽管近期研究警告说基于模型的模拟可能系统性地偏离人类行为。我们引入了\textsc{BehaviorBench}，这是一个用于从真实世界行为轨迹评估个性化决策建模的基准。\textsc{BehaviorBench}从观察到的公共预测市场和链上记录中重建钱包级决策历史，并将其组织为两个互补的任务层：\emph{信念预测}，即预测用户在市场中的最终显性立场和信心；\emph{交易预测}，即预测个人交易的方向和金额。在2,000个评估钱包中，该基准包含141,445个信念实例和1,485,972个交易实例，并为基于检索的评估提供不相交的支持池。我们在四种历史接口下评估前沿和开放权重生成模型：无个性化、直接的近期历史、生成的用户资料以及检索到的支持钱包证据。个性化对信念预测的提升比分交易预测更为稳定，模型在不同任务层和指标下的排名有所变化，不同的历史接口也揭示出不同的失败模式。\textsc{BehaviorBench}提供了一个评估环境，用于研究个性化方法是否能够利用真实世界的行为证据，而不仅仅依赖于模拟用户。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决个性化决策建模（personalized decision modeling）缺乏基于真实世界行为痕迹的评估基准这一问题。具体而言，论文针对以下核心挑战：

1. 现有评估数据的局限性

模拟与真实的差距：现有用户理解基准测试通常依赖模拟用户或模型生成的行为（simulated users or model-generated behavior），但近期研究警告称，基于模型的模拟可能与真实人类行为产生系统性偏差（systematically diverge from human behavior）。
缺乏真实行为痕迹：真实偏好、信念和决策倾向往往隐含在行为序列中（如重复选择、回避、修正、积累或放弃），而非直接陈述。现有基准多基于明确的人格描述、对话历史或标注画像，难以评估系统从隐式行为中推断用户未来决策的能力。

2. 个性化决策建模的评估需求

双层抽象决策预测：论文指出，需要区分两种不同层次的行为目标：
信念预测（Belief prediction）：预测用户在特定市场中的最终立场（YES/NO）和置信度，反映相对稳定的揭示性偏好（revealed preferences）。
交易预测（Trade prediction）：预测单个交易的买卖方向和金额，反映受时机、市场上下文影响的局部行为。
历史表示的比较：论文评估了四种不同的历史信息界面（history interfaces），以测试模型如何利用先验行为证据：
无个性化（仅目标上下文）
直接近期历史（DirectGen）
生成的结构化用户画像（ProfileGen）
从不相交支持池中检索的相似钱包证据（RetrievalGen）

3. 数据构建挑战

论文通过重建公共预测市场和链上记录（on-chain records）中的钱包级别决策历史，构建了包含2,000个评估钱包、141,445个信念实例和1,485,972个交易实例的基准数据集，并确保检索评估使用的支持池与测试集不相交（disjoint support pools），以避免身份泄露。

简言之，该论文旨在提供一个基于真实世界行为痕迹（而非模拟数据）的评估环境，以检验个性化方法能否有效利用行为证据来预测个体决策，并揭示不同历史表示方式在不同决策层次（稳定信念 vs. 局部交易）上的效能差异。

Q: 有哪些相关研究？

Authors: Liangwei Yang, Jielin Qiu, Zixiang Chen, Ming Zhu, Juntao Tan, Zhiwei Liu, Wenting Zhao, Zhujun Lan, Akshara Prabhakar, Silvio Savarese, Huan Wang, Shelby Heinecke

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02798.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02798

Published: 2026-06-03T02:25:58.567Z

5. ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

中文摘要

摘要：大型语言模型（LLMs）在临床决策支持方面展示了强大的自然语言推理能力，但在有效建模结构化纵向电子健康记录（EHRs）方面存在困难。相比之下，EHR基础模型能够学习预测性的患者表示，但缺乏可解释的基于语言的推理。为弥合这一差距，我们提出了ChatHealthAI，一种多模态推理框架，它通过任务感知重采样器，将预训练EHR基础模型的结构化EHR表示与冻结的LLM的语义空间对齐。通过整合纵向患者表示与精炼的临床事件描述，ChatHealthAI能够在保持准确患者预测的同时，实现基于临床的自然语言推理。我们在EHRSHOT基准上的三个临床预测任务中对ChatHealthAI进行了评估。结果显示，ChatHealthAI在提升推理质量和可解释性的同时，保持了具有竞争力的预测性能。这些发现突显了将EHR基础模型与预训练LLM集成用于可解释临床预测的潜力。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文旨在解决结构化纵向电子健康记录（EHR）表征与自然语言临床推理之间的整合难题，具体而言：

核心问题

当前临床预测领域存在两个互补但分离的技术路线，形成显著的方法论鸿沟：

大语言模型（LLMs）的局限性
LLMs具备强大的自然语言推理能力，但难以有效建模结构化的纵向EHR数据。直接将EHR事件序列化为文本输入会导致：

上下文长度超限（context limit exceedance）
时间结构与临床语境的丢失
无法捕捉潜在纵向时序模式

EHR基础模型的局限性
诸如CLMBR-T-Base、MedBERT等EHR基础模型虽能从大规模结构化临床轨迹中学习预测性患者表征，但其输出通常为**潜在嵌入（latent embeddings）**或风险分数，缺乏可解释的自然语言推理能力。

关键挑战：表征空间不对齐

EHR嵌入与LLM词嵌入源于异构输入空间与训练目标，二者并非自然对齐于共享表征空间。因此，EHR嵌入无法直接作为LLM的有效输入，导致：

frozen LLM难以解读纵向EHR表征
简单线性投影无法建立语义有意义的连接
临床推理缺乏基于患者轨迹的实质性证据支撑

解决方案目标

论文提出通过显式对齐机制（任务感知重采样器），将EHR基础模型学习的结构化纵向表征映射至frozen LLM的语义空间，从而在保持预测准确性的同时，实现基于临床证据的、可解释的自然语言推理生成。

Q: 有哪些相关研究？

该论文的相关研究主要分为以下两个方向：

1. EHR表征学习（EHR Representation Learning）

先前的工作主要聚焦于从结构化EHR数据中学习预测性患者嵌入：

CEHR-BERT (Pang et al., 2021): 基于Transformer的EHR基础模型，通过大规模预训练学习纵向患者表征
Med-BERT (Rasmy et al., 2021): 针对结构化电子健康记录预训练的上下文嵌入模型，用于疾病预测
CLMBR-T-Base (Wornow et al., 2023): 在257万例去标识化EHR上预训练的EHR基础模型，提供患者级嵌入
EHRSHOT (Wornow et al., 2023): 用于评估EHR基础模型的标准化少样本评测基准

局限性: 这些模型主要关注预测性表征学习，能够输出潜在嵌入或风险分数，但不明确支持基于自然语言的临床推理生成。

Authors: Bo-Hong Wang, Baicheng Peng, Ruilin Wang, Jun Bai, Ziyang Song, Yue Li

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02802.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02802

Published: 2026-06-03T02:25:58.567Z

6. Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

中文摘要

摘要：从纵向电子健康记录（EHRs）中建模患者轨迹需要对稀疏、噪声多且长上下文的多模态序列进行推理。现有基于大型语言模型（LLM）的多智能体系统虽然解决了上下文长度问题，但对患者的处理是孤立进行的，未能反映临床医生如何利用类似既往病例的累积经验。我们提出了Traj-Evolve，一种自我进化的多智能体系统，具备两种互补的进化机制。首先，经验池（ExPool）作为一种非参数化记忆，索引经拒绝采样的推理轨迹，以检索相似患者作为少量示例上下文。其次，通过基于奖励排序的微调进行多智能体强化学习（MARL），参数化优化智能体间以及智能体与记忆池的协作。通过留一法交叉检索策略将两者统一，使训练和推理阶段的行为在检索增强下保持一致。在一个利用最长五年多模态EHR的肺癌预测任务中，Traj-Evolve在总体人群和具有挑战性的从不吸烟人群中，均优于9个强基线模型。进化动态分析揭示了三个关键发现：（1）扩大ExPool会将最优检索从多样化样本转向特定样本；（2）在MARL下，管理智能体的预测损失快速收敛，而工作智能体的时间推理仍能从更多已验证患者受益；（3）这两种机制在预测风险上具有互补性，其中ExPool提高了特异性，而MARL提高了敏感性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting failed: 30000ms exceeded

Authors: Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02812.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02812

Published: 2026-06-03T02:25:58.567Z

7. An Exploration of Collision-based Enemy Morphology Generation

中文摘要

摘要：尽管之前已经有大量关于程序内容生成（PCG）的研究，但关于为视频游戏生成敌人的相关研究相对较少。特别是，几乎没有研究涉及生成敌人的形态学，即游戏中敌人的基本身体结构或碰撞信息，尽管在机器人领域已有相关的形态学生成研究。在本文中，我们探索了三种基于玩家碰撞信息生成敌人形态学的不同新方法。我们发现，每种方法都有各自的优缺点，但所有方法的性能都与基于以往机器人形态学工作的进化基准相当或更优。

LLM Analysis

Q: 这篇论文试图解决什么问题？

该论文旨在解决视频游戏中敌人形态（enemy morphology）的自动生成问题，特别是基于玩家-敌人碰撞交互数据来生成能够”门控”（gate）特定玩家机制的敌人身体结构。

具体而言，该研究试图解决以下核心问题：

核心问题定义 在缺乏现有敌人形态生成技术的背景下，如何创建一种新的生成方法，能够产生具有以下特性的敌人形态：

当玩家拥有特定额外机制（如双跳、传送等）时能够可靠击败敌人
当玩家仅拥有基础移动机制（左右移动、跳跃）时无法击败该敌人

关键挑战

领域差异：机器人学中的形态生成方法优化的是代理自身的运动性能（如速度、稳定性），而游戏敌人需要优化的是与玩家的交互关系——即如何结构化玩家必须采取的行动来击败敌人
交互驱动：与固定设计师手工制作的敌人不同，需要基于实际的玩家-敌人交互轨迹（碰撞信息、胜负结果）来驱动形态优化
机制门控：生成能够强制要求使用特定游戏机制（如垂直传送、水平传送、双跳等）才能击败的敌人形态，用于教学特定机制或控制游戏世界访问权限

技术层面的问题 该研究将敌人形态表示为 4×4 的离散网格，每个单元格包含三种碰撞类型之一（脆弱/致命/空），并探索如何基于强化学习、A*搜索和神经网络的交互数据来优化这些网格结构，以实现上述门控行为。

Q: 有哪些相关研究？

根据论文第2节（Related Work），相关研究主要分为以下两个领域：

1. 敌人生成（Enemy Generation）

该领域研究主要集中在三个方面，但均假设敌人形态（碰撞体积/身体结构）由设计师固定：

行为生成（Behaviour Generation）

研究如何生成敌人的行为策略，而非身体结构：

强化学习方法：使用RL训练敌人代理以适应游戏情境
Gutiérrez-Sánchez 等：结合RL与行为树自动测试潜行AI变体
Merrick & Maher：基于动机RL的代理学习演化行为模式
Nämerforslund：使用Unity ML-Agents创建自适应敌对角色
Maurya 等：基于RL优化NPC行为
大语言模型（LLM）方法：利用LLM作为高层控制器生成情境敏感行为
Jennings & Hartmann的GROMIT原型：使用GPT-4根据自然语言提示合成Unity行为代码
Hassan & Aboulhassan：结合玩家行为分析与LLM驱动推荐调整敌对角色类型

特征生成（Feature Generation）

使用搜索算法（进化算法、蒙特卡洛搜索等）优化敌人的数值参数（健康、伤害、速度、武器配置等），但不改变碰撞形态。

机制生成（Mechan

Authors: Johor Jara Gonzalez, Matthew Guzdial

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02832.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02832

Published: 2026-06-03T02:25:58.567Z

8. Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

中文摘要

摘要：大型推理模型（LRMs）通过在测试时增加计算量生成显式的中间推理过程，从而提高性能，但“更长的推理总是有益的”这一假设仍然缺乏充分研究。尽管近期证据表明额外的推理可能导致模型过度思考，我们提出问题：“一旦模型得出正确答案，进一步推理是优化解决方案，还是使其偏离？”为了研究正确性之后的动态，我们引入了一种基于推理充分性的前缀级轨迹评估协议，定义了模型首次生成正确答案所需的最小推理预算。这使我们能够区分冗长的过度思考——额外推理多余但无害，以及有害的过度思考——持续推理会使已经正确的轨迹不稳定。从多模态基准出发，我们发现许多被认为推理密集的实例实际上需要的推理量出乎意料地少。此外，在首次正确前缀处停止推理，比标准推理可以将准确率提高最多21%，这表明当前模型的限制不仅在于推理能力，还在于无法在恰当时间停止推理。此外，虽然常见的效率策略如提前停止可以大幅减少冗长的过度思考（最多50%），但它们无法缓解有害的过度思考。失败分析显示，正确性偏离主要由逻辑漂移和视觉重新解释驱动。最后，我们表明这些发现可以推广到仅语言推理的基准中，突出了有害过度思考作为更广泛可靠性风险的问题。相关代码可在此 https URL 获取。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决**大型推理模型（Large Reasoning Models, LRMs）中的有害过度思考（Harmful Overthinking）**问题。具体而言，论文关注以下几个核心问题：

1. 过度思考的两种形态区分

现有研究主要将过度思考视为效率问题（即生成过长的推理轨迹浪费计算资源），但论文指出这种视角不完整。论文提出需要区分：

冗长过度思考（Verbose Overthinking）：模型在得出正确答案后继续推理，但最终仍保持正确答案，仅造成计算浪费
有害过度思考（Harmful Overthinking）：模型在已经到达正确答案后继续推理，导致最终答案从正确变为错误

2. 推理充分性评估

论文质疑”更长推理总是更好”的假设，提出通过**前缀级轨迹评估协议（prefix-level trajectory evaluation protocol）**来研究：

模型首次生成正确答案所需的最小推理预算（reasoning sufficiency）
正确答案首次出现的位置（first correct index）
继续推理是否会导致轨迹偏离正确性

3. 有害过度思考的量化与特征

论文试图量化并表征以下现象：

许多被认为需要密集推理的问题实际上可以用极少的推理步骤解决
停在第一个正确前缀（Optimal Length）比标准推理行为（Actual Length）准确率提高可达 21%
常见效率策略（如早期停止）虽能减少冗长过度思考（最高 50% ），但无法缓解有害过度思考
有害过度思考的失败主要由**逻辑漂移（logical drift）和视觉重新解释（visual reinterpretation）**驱动，而非计算错误

4. 跨模态普遍性验证

论文还验证了有害过度思考不仅限于多模态推理，在纯语言推理基准（如AIME2025、GPQA）中同样存在，表明这是LRMs的广泛可靠性风险。

简而言之，论文试图解决的核心问题是：当前LRMs不仅受限于推理能力，更受限于”无法在正确时机停止推理”的能力缺陷，导致模型经常在已经正确的情况下”想太多”反而出错。

Q: 有哪些相关研究？

根据论文第5节（Related Work）及引言部分，相关研究主要集中在以下三个方向：

1. 测试时扩展与推理（Test-Time Scaling and Reasoning）

近期推理模型通过测试时扩展

Authors: Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02835.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02835

Published: 2026-06-03T02:25:58.567Z

9. Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

中文摘要

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决在资源受限的嵌入式微控制器（MCU）环境中部署代理式AI（Agentic AI）系统的挑战，具体而言是弥合传统嵌入式实时控制与现代大语言模型（LLM）驱动的自主代理之间的鸿沟。

核心问题可以分解为以下几个方面：

1. 资源约束与智能代理的冲突

现代代理式AI系统依赖于大型语言模型（LLMs）进行复杂推理、规划和工具使用，通常假设运行在拥有充足内存、计算能力和能源的云环境或高性能数据中心。然而，普适计算（Pervasive Computing）环境中的深度嵌入式设备（如微控制器、传感器节点）具有严格的资源限制（通常RAM < 512 KB，Flash存储有限），且运行在电池供电或能量收集条件下。这种**“内存墙”（Memory Wall）**使得在标准微控制器上运行具备完全推理能力的代理在技术上不可行。

2. 架构层面的断层

现有框架通常假设服务器级资源或持续的网络连接，导致：

传统嵌入式开发优先考虑确定性控制循环、静态固件和硬实时保证
现代代理式AI则假设动态上下文、充足的历史内存和灵活的工具使用

两者之间缺乏有效的架构桥梁，导致在边缘设备上部署自主代理存在空白。

3. 异构硬件的统一管理挑战

实际部署环境包含从资源极度受限的MCU（如ESP32、STM32）到具备GB级内存的边缘网关（如Raspberry Pi、NVIDIA Jetson）的连续谱系。论文指出，需要一种统一的架构，能够：

在高端网关上运行本地小型语言模型（SLMs）实现自主推理（Flavor A：自主网关代理）
在低端MCU上作为智能接口，将复杂推理卸载到云端（Flavor B：系留MCU代理）
同时支持混合部署策略，允许系统设计者动态平衡延迟、隐私和推理深度

4. 分布式代理的治理与安全

随着边缘设备获得半自主决策能力，缺乏跨设备的可观察性（Observability）、策略执行（Policy Enforcement）和安全机制成为关键障碍。论文强调需要一种跨领域的治理层（Governance Layer），确保分布式设备群的操作可追溯、合规且安全，防止”黑箱”AI在物理世界中执行危险动作。

5. 语义与通信协议的 bridging

现代代理通常使用冗长的文本协议（如基于JSON的Model Context Protocol, MCP）进行工具调用，而嵌入式设备使用轻

Authors: Marcus Rüb, Michael Gerhards

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02862.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02862

Published: 2026-06-03T02:25:58.567Z

10. Don’t Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

中文摘要

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文旨在解决AI驱动研究系统（AI-Driven Research Systems, ADRS）缺乏系统性分析框架的问题。具体而言，论文针对以下核心挑战：

1. 组件交互理解的缺失

ADRS（如FunSearch、AlphaEvolve等）通过将大语言模型（LLM）与自动评估相结合来发现算法、证明或设计，其性能依赖于生成器（Generator）、评估器（Assessor）和发现机制（Discovery Mechanism）之间的复杂交互。然而，现有研究缺乏对这些交互作用的系统性理解：

不同组件的选择会显著影响系统性能，但缺乏理论指导来预测何种组合在特定问题上表现更佳；
系统的优化景观（optimization landscape）由生成器和评估器共同决定，即 L_(eff) = A circ G ，但此前未形式化这一关系。

2. 标准分析工具的失效

论文证明，传统的收敛保证（convergence guarantees）和分析工具不适用于ADRS：

非马尔可夫性：最佳分数过程 s^*_t 不是马尔可夫过程（Theorem 2）。完整状态 (D_t, M_t) 虽然是马尔科夫的，但其在不断增长的高维空间中演化，违反了标准优化理论中关于固定维度状态空间、平稳目标或稳定转移算子的假设；
路径依赖性：由于上下文构建依赖于历史记录，初始条件可能持续影响轨迹，导致相同配置的运行产生多模态的最终分数分布，而低维进度指标（如当前最佳分数 s^*_t ）不足以预测未来行为。

3. 诊断与优化的实践困境

在实践中，研究者面临以下困难：

无法识别瓶颈：系统可能受限于生成器能力（G-limited）、评估器信号粒度（A-limited）、搜索机制效率（M-limited）或计算预算（Budget-limited），但缺乏方法在不进行穷尽消融实验的情况下诊断哪个组件是约束条件；
组件相互依赖性：生成器与机制之间存在强烈的交互效应（ G × M interaction），在某一问题上表现最优的前沿模型可能在另一问题上被开源替代方案超越，简单的能力排序无法预测ADRS性能。

4. 提出的解决方案：GAMBLe框架

为应对上述挑战，论文引入**GAMBLe（Generator, Assessor, Mechanism,

Authors: Marquita Ellis, Paul Castro

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2606.02863.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02863

Published: 2026-06-03T02:25:58.567Z

VLM Domain Papers

1. COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions

Abstract:Camouflaged object detection has improved substantially, but most standard benchmarks evaluate models only on clean images. This is not realistic because real cameras often capture blur, sensor noise, weather effects, and compression artifacts. We present COD10K-C, a corruption robustness benchmark based on COD10K. It includes 8 corruption types and 5 severity levels, giving 40 conditions and 81,040 evaluation pairs in total. We evaluate three popular camouflaged object detection models, SINet-v2, PFNet, and ZoomNet, as well as a lightweight model called RobustCODLite. All models show clear performance drops on corrupted images. Motion blur and Gaussian blur cause the largest drops, with SINet-v2 losing 18.5 Dice points under motion blur. Brightness and fog are less harmful. RobustCODLite uses corruption augmentation, a frequency-prior branch, and an uncertainty-consistency loss. It retains 92.3% of its clean Dice score under corruption, compared with 87.7% for SINet-v2, 84.8% for ZoomNet, and 84.1% for PFNet. On the hardest corruptions, RobustCODLite matches or outperforms models that perform better on clean data. We will release the COD10K-C GitHub repository to support future research in robust camouflaged object detection.

中文摘要

摘要：伪装对象检测已大幅改进，但大多数标准基准测试仅基于干净图像评估模型。这不现实，因为真实相机经常捕捉到模糊、传感器噪点、天气效果和压缩伪影。我们介绍COD10K-C，这是一个基于COD10K的腐败鲁棒性基准。它包含8种腐败类型和5个严重程度等级，提供40种条件和81,040对评估。我们评估了三种流行的伪装物体检测模型：SINet-v2、PFNet和ZoomNet，以及一种名为RobustCODLite的轻量级模型。所有模型在损坏的图像上都明显显示性能下降。运动模糊和高斯模糊导致的下降最大，SINet-v2在运动模糊下损失了18.5骰子点。明亮和雾气的危害较小。RobustCODLite 使用了腐败增强、频率先验分支以及不确定性-一致性丢失。在损坏情况下，它保持了92.3%的纯净骰子得分，而SINet-v2为87.7%，ZoomNet为84.8%，PFNet为84.1%。在最严重的损坏中，RobustCODLite 能匹配甚至超越在干净数据下表现更好的模型。我们将发布COD10K-C的GitHub仓库，以支持未来在稳健伪装物体检测方面的研究。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Arafat Hossain Sayem

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02603.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02603

Published: 2026-06-03T02:36:48.001Z

2. AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

Abstract:Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: this https URL

中文摘要

摘要：视听说话人追踪旨在通过利用听觉和视觉线索来定位和追踪活跃的说话者，从而实现细粒度、以人为中心的场景理解。这一能力对于智能视频编辑、监控和人机交互等现实应用至关重要。然而，现有的数据集在很大程度上局限于简单或同质化的视听场景，并且标注粗略。这种过于简化的设置会导致评估偏向静态的视听共现，而不是在复杂动态场景中对稳健的时空建模和跨模态推理进行严格评估。为了解决这些局限性，我们引入了 AVTrack，一个为动态真实世界场景设计的人本视听实例分割（AVIS）数据集。AVTrack 具有多样化且具有挑战性的条件，包括摄像机运动、视觉遮挡和位置变化。在 AVTrack 上对代表性的 AVIS 方法进行评估显示性能大幅下降，从而将 AVTrack 建立为复杂环境中稳健的人本视听场景理解的挑战性基准。我们还提供了一个简单但有效的基线，以促进未来的研究。项目网站：此 https URL

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02724.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02724

Published: 2026-06-03T02:36:48.001Z

3. Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

Abstract:Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2—10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{this https URL}{here}

中文摘要

摘要：空间推理对于机器人技术、自主系统和具身人工智能至关重要，但现代视觉-语言模型（VLMs）在测量距离查询上仍不可靠。一个常见的假设是，不同视角下的一致预测能够反映几何基础。我们测试了这一假设，结果发现情况恰恰相反：领先的VLMs即使在预测错误时，也经常产生视角不变且一致的答案，这表明预测与特定视角视觉证据之间的耦合较弱。我们介绍了\textbf{ViewDiag}，一个基于Hypersim、ScanNet和KITTI360构建的受控多视角评估协议，包括80个场景中176个物体对轨迹，每条轨迹包含2到10个视角。该协议从三个维度评估模型：度量精度、分布集中性以及区分决策崩溃与表征崩溃的潜在特征探针。在各种模型中，我们观察到一个一致模式：预测稳定性高，但错误率显著，集中在具有强一致性但低精度的区域。\noindent 这些结果对交叉视角一致性作为几何理解代理的常用做法提出了挑战。相反，我们表明，稳定的预测可能反映的是先验驱动的崩溃，而非依赖证据的推理。ViewDiag提供了一个受控的基准和诊断框架，用于评估空间VLMs，不仅限于精度。代码和数据可在\href{this https URL}{此处}找到。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: S Divakar Bhat, Toshihiko Yamasaki

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02742.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02742

Published: 2026-06-03T02:36:48.001Z

4. Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Abstract:Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: this https URL.

中文摘要

摘要：规划记录定义了地理区域的限制，但其源文件通常只提供间接的空间证据，而非机器可读取的边界。我们介绍了 Plan2Map，这是一个包含 208 个案例的多模态基准，用于基于文件的英国规划记录地理空间边界重建。在仅提供源规划文件的情况下，系统必须从公告文本、附表、地图版面、地图标签和边界注释中重建有效的地理空间边界；用于评分的参考 GeoJSON 将被保留。我们提出了 GeoPlanAgent，这是一个以文档为基础、地理空间工具在环的系统，将任务分解为证据提取、定位、地图注册、边界分割、投影和验证。在 Plan2Map 上，GeoPlanAgent 实现了 0.736 的平均 IoU 和 0.904 的中位 IoU，67.8% 的预测 IoU 达到或超过 0.8，远远优于直接 VLM 到 GeoJSON 的基线方法。诊断分析表明，直接 VLM 预测仍然不可靠，而剩余错误主要集中在定位和地图注册环节，并且监督的边界分割显著提高了像素级掩模的质量。Plan2Map 为从公共规划记录进行多模态地理空间重建提供了一个具体的测试平台。项目页面：此 https URL。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02747.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02747

Published: 2026-06-03T02:36:48.001Z

5. MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Abstract:Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator’s ego-motion and the visible subject’s spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

中文摘要

摘要：视频世界模型是具身人工智能和元宇宙的基础生成技术，但现有方法本质上仅限于单个代理从单个视角观察。将这些模型扩展到多代理环境引入了两个关键挑战：数据稀缺（为了通用开放域场景收集协调的多视角记录成本高昂）和世界状态对齐（独立生成的视频流无法保证共享的物理环境和事件在各视角中一致演化）。为了解决这些挑战，我们提出了MetaWorld，这是一种新颖框架，可以直接从单视角视频将多代理视频世界模型扩展到开放域环境。首先，我们引入了单目世界状态展开（MWSU），以显式将单目镜头分解为摄像操作员的自我运动和可见主体的空间轨迹。这种摄像机-轨迹分解自然提取了共享3D空间内的同步多代理运动数据，完全无需多摄像机设置。其次，为了实现精确的视觉控制，我们开发了主体感知世界生成器，以使生成的模拟能够基于每个代理的身份图像进行外观驱动的条件生成。最后，为保证两个视角均扎根于相同的物理现实，我们提出了世界状态对齐（WSA），这是一种在视频DiT的每个Transformer层中插入的逐帧跨分支自注意力机制。通过联合同步去噪过程，WSA在静态几何一致性和动态运动一致性上进行约束，从而确保共享的3D环境和物理事件在两个自我中心视角中保持良好对齐。大量实验表明，MetaWorld在跨视角一致性和身份保真度方面表现优越，建立了一个高度可扩展的、基于物理的多代理视频世界建模范式。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02753.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02753

Published: 2026-06-03T02:36:48.001Z

6. From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry

Abstract:Satellite-derived bathymetry (SDB) from multispectral imagery is cost-effective but scales poorly across regions, especially in optically complex coastal environments. We evaluate machine learning and deep learning for transferable SDB over the 0-20 m depth range using Sentinel-2 imagery. A Random Forest baseline and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) are trained on Pratas Island and selected Great Barrier Reef regions, then evaluated on spatially independent intra- and cross-regional test areas. Preserving spatial continuity during training, by keeping contiguous reef blocks rather than random patches, is the single most impactful design choice; we further introduce a Smooth Weight Function (SWF)-weighted RMSE loss that emphasizes near-surface depths. With these choices, intra-regional RMSE ranges from 1.15 to 1.92 m over 0-20 m and is as low as 0.26 m for depths <= 3 m. Random Forest degrades sharply under cross-regional transfer (RMSE 1.53 m -> 2.99-3.78 m), while the deep models stay more robust (2.46-2.98 m). On the public MagicBathyNet aerial-RGB benchmark (0-16 m) the proposed networks reach 0.19-0.22 m RMSE, outperforming a U-Net baseline and a task-specific transformer architecture with substantially fewer parameters. We further exploit multi-temporal repeat imagery: training on it broadens diversity, and median-aggregating predictions across passes at inference reduces noise from changing sun angles, atmospheric conditions, water properties, and tides. We release optimized architectures and pretrained weights to enable scalable transfer to new sites.

中文摘要

摘要：基于卫星的多光谱影像测深（SDB）具有成本效益，但在区域间的推广性较差，尤其是在光学复杂的沿海环境中。我们评估了机器学习和深度学习在0-20米水深范围内使用Sentinel-2影像进行可迁移SDB的效果。我们在普拉塔斯岛和选定的大堡礁区域训练了一个随机森林基线模型和四个卷积神经网络（ResNet-50、ResNet-101、EfficientNet-B4、ConvNeXt-Large），然后在空间独立的区域内和跨区域测试区进行评估。在训练过程中保持空间连续性（保留连续的礁块而不是随机块）是影响最大的设计选择；我们进一步引入了平滑权重函数（SWF）加权的均方根误差（RMSE）损失，以强调近表层水深。采用这些策略后，区域内RMSE在0-20米范围内为1.15到1.92米，对于<=3米的水深最低可达0.26米。随机森林在跨区域迁移时性能急剧下降（RMSE从1.53米升至2.99-3.78米），而深度模型则更加稳健（2.46-2.98米）。在公开的MagicBathyNet航空RGB基准（0-16米）上，提出的网络达到0.19-0.22米RMSE，优于U-Net基线模型和任务特定的Transformer架构，同时参数显著更少。我们进一步利用了多时相重复影像：在训练中使用多时相数据能够增加多样性，在推理时通过跨成像时间的预测中值聚合减少了因太阳角度变化、大气条件、水体属性和潮汐变化带来的噪声。我们发布了优化的架构和预训练权重，以实现向新站点的可扩展迁移。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Hsiao-Jou Hsu, Joachim Moortgat

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02764.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02764

Published: 2026-06-03T02:36:48.001Z

7. GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

Abstract:Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs’ geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

中文摘要

摘要：用于自动驾驶的视觉-语言模型（VLMs）已显示出有希望的表现，但其处理特定地区交通规则的能力仍未充分探索，这对其在全球不同环境中的部署带来了不确定性。因此，我们提出了GeoDrive-Bench，这是一种新型基准，用于系统地研究VLMs在地理文化背景下的驾驶推理能力。我们整理了涵盖六个国家、涵盖多样驾驶文化的5,053条人工验证的多项选择问答对。具体而言，我们强调四项驾驶任务：感知、预测、规划和地区推理。每个问题都要求模型从视觉证据和地方交通惯例中推断出正确的驾驶行为，而无需明确的国家标签。除了评估之外，我们还设计了一种蒸馏算法，将特定地区的交通规则知识注入VLMs的内部表示，使模型能够更好地将视觉场景理解与本地驾驶策略对齐。在对九种最先进的VLMs进行实验时，每项任务在不同地理驾驶文化中表现出显著差异，而我们提出的基线模型在各地区展示了改进的地理文化推理能力。这些结果表明，当前的VLMs仍缺乏稳健的区域感知驾驶智能，并凸显GeoDrive-Bench作为可用于诊断和训练的可部署自动驾驶基础模型测试平台的重要性。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Yingzi Ma, Chaowei Xiao, Ming Jiang

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02774.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02774

Published: 2026-06-03T02:36:48.001Z

8. Diagnosis of Human Object Interaction Detectors for Real World Educational Applications

Abstract:Human-object interaction (HOI) recognition is critical for automatically analyzing student behavior in complex educational environments. Although state-of-the-art (SOTA) HOI detectors perform well on benchmark datasets, their performance often degrades when deployed in real-world training environments due to domain-specific objects, occlusions, and complex visual conditions. In this paper, we introduce a diagnosis-driven framework that integrates a triplet-level HOI error taxonomy with error-factor attribution analysis for real-world educational video data. We study this problem in the context of Critical Care Air Transport Team (CCATT) mixed-reality medical training. Based on an analysis of HOI failure modes and their causes, we develop a diagnosis-informed refinement strategy for adapting pretrained HOI models to the target domain. Experiments on the CCATT dataset show that this approach improves the macro-F1 score of a pretrained CDN model from 48.6 to 90.2 through targeted refinement guided by diagnosed error factors. These results highlight the value of detailed diagnostic analysis for informing targeted adaptation of HOI models in real-world educational environments.

中文摘要

摘要：人-物互动（HOI）识别对于在复杂教育环境中自动分析学生行为至关重要。尽管最先进（SOTA）的HOI检测器在基准数据集上表现良好，但由于特定领域的物体、遮挡以及复杂的视觉条件，它们在实际培训环境中的性能往往会下降。本文提出了一种诊断驱动框架，将三元组级HOI错误分类与错误因素归因分析相结合，用于现实教育视频数据。我们在重症护理航空运输团队（CCATT）混合现实医疗训练的背景下研究该问题。基于对HOI失败模式及其原因的分析，我们制定了一种基于诊断的优化策略，将预训练HOI模型适应目标领域。在CCATT数据集上的实验表明，通过针对性地根据诊断出的错误因素进行优化，该方法将预训练CDN模型的宏F1分数从48.6提升至90.2。这些结果突显了详细诊断分析在指导HOI模型在真实教育环境中进行针对性适应方面的价值。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Divya Mereddy, Ashwin Tudur Sadashiva, Marcos Quinones-Grueiro, Gautam Biswas

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02789.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02789

Published: 2026-06-03T02:36:48.001Z

9. Cosmos 3: Omnimodal World Models for Physical AI

Abstract:We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI — effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation’s OpenMDW-1.1 this https URL License at this https URL}{this http URL and this https URL . The project website is available at this https URL .

中文摘要

摘要：我们介绍了 Cosmos 3，这是一个全模态世界模型家族，旨在在统一的混合变换器架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3 无缝整合了物理人工智能的关键模态——有效地将视觉-语言模型、视频生成器、世界模拟器和世界动作模型纳入单一框架。我们的评估表明，Cosmos 3 在多种理解和生成任务中建立了新的最先进水平，展示了全模态世界模型作为具身智能体的可扩展通用骨干的潜力。我们的后训练 Cosmos 3 模型在技术报告撰写时，被 Artificial Analysis 评为最佳开源文本到图像和图像到视频模型，并被 RoboArena 评为最佳策略模型。为了加速物理人工智能的开放研究和部署，我们在 Linux 基金会的 OpenMDW-1.1 许可下提供代码、模型检查点、精选合成数据集和评估基准测试，其网址为此 https URL}{此 http URL 和此 https URL。项目网站可在此 https URL 访问。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02800.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02800

Published: 2026-06-03T02:36:48.001Z

10. Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

Abstract:Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.

中文摘要

摘要：在医学影像上评估视觉-语言模型（VLMs）需要临床基础、可扩展且可控评估混杂因素的基准。现有的公开基准在规模上有限，需要人工标注，或者可能泄露到 VLM 预训练语料中。我们提出了一种自动化的代理驱动流水线，能够直接从配对的私有放射学报告和三维肿瘤影像生成多项选择视觉问答（VQA）数据集，并产生两种互补的问题类型：RADS 风格问题，从临床医生定义的报告结构中确定性生成；以及从放射科医生发现生成并根据源报告验证的大型语言模型（LLM）生成的放射学报告衍生问题。应用于四个内部癌症队列，该流水线生成了一个实例污染可控的基准，而无需逐题人工标注。对六个 VLM 进行零样本评估显示没有占优模型，并且各项指标均有显著提升空间。盲消融研究表明视觉依赖高度依赖数据集：肝脏报告衍生的问题确实需要影像，而肺部 CT 基本可以在没有影像的情况下解决——在盲测条件下，领先的封闭模型在肺部 CT 上的准确率超过其有视觉条件下的准确率——表明即使是私有临床数据也不能保证对视觉能力进行污染控制的评估。该流水线作为一个开放的代理技能发布，可用于内部再次部署。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Navigation timeout of 10000 ms exceeded

Authors: Bo Liu, Hanxue Gu, Xiangru Li, Zheren Zhu, Jacob Ellison, Kang Wang, Janine M. Lupo, Yang Yang, Hui Lin

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2606.02809.pdf

CoolPaper URL: https://papers.cool/arxiv/2606.02809

Published: 2026-06-03T02:36:48.001Z