HuggingFace Papers 2026-06-01

数据来源：HuggingFace Papers

Latest Papers

1. LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Abstract:Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents’ ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%—69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at this https URL.

中文摘要

摘要：现实世界的数据分析本质上是迭代的，然而现有的基准测试大多评估孤立或短期的交互任务，未能测试代理在长时间跨度内跟踪不断发展的分析上下文的能力。我们引入了LongDS，这是一个用于长时间、多轮数据分析的基准，其中代理必须维护、更新、恢复并组合不断演变的分析状态。LongDS包括68个任务，这些任务源自现实世界的Kaggle笔记本，涵盖六个领域的2,225轮交互，包括地球科学、商业和教育等。任务设计围绕状态演变模式（例如，反事实扰动、回滚、多状态组合），平均依赖跨度为11.3轮。通过评估五种最先进的模型，我们发现表现最好的模型的平均准确率仅为48.45%，从早期到后期轮次性能下降近47个百分点，长时间跨度错误占失败的52%至69%。进一步分析显示，增加代理步骤并不必然改善性能，这表明关键瓶颈在于维持正确的分析状态，而不是增加交互预算。我们发布LongDS以支持对可靠长时间跨度代理数据分析的研究。代码和数据将通过此https URL发布。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文试图解决现有数据分析智能体基准测试在长程（long-horizon）交互评估方面的关键缺失问题。

具体而言，论文针对以下核心问题展开研究：

1. 基准测试的覆盖盲区

现有的大多数数据分析基准测试侧重于孤立或短交互任务（isolated or short interactive tasks），通常在可重置的环境中评估独立请求。然而，现实世界中的数据分析本质上是迭代和持久的（iterative and persistent），需要在多轮交互中累积和转换分析范围、指标定义、假设和中间结果。现有基准未能充分测试智能体在长程依赖链（long dependency chains）中管理演化分析状态（evolving analytical states）的能力。

2. 分析状态管理的复杂性

论文识别了长程数据分析中六种关键的状态演化模式（state-evolution patterns），这些模式在现有基准中未被充分覆盖：

初始构建（Initial）：建立可重用的分析对象（如队列、指标规则）
状态继承（Inheritance）：复用最近的有效分析状态
状态更新（Update）：覆盖先前的定义或规则，建立新的默认状态
反事实扰动（Counterfactual）：引入仅适用于当前轮次的临时替代假设
状态回滚（Rollback）：基于早期锚定版本而非最新状态回答问题
多状态组合（Composition）：组合两个或更多显式状态操作

这些模式要求智能体能够维护、修订、恢复和组合（maintain, revise, restore, and compose）跨越数十轮交互的分析状态，而现有基准缺乏对此类能力的系统评估。

3. 当前智能体的性能瓶颈

通过构建包含68个任务、2,225轮交互的LongDS基准（平均依赖跨度达11.3轮），论文实证揭示了现有大语言模型智能体的严重局限：

即使是表现最佳的模型（Gemini-3.1-Pro），平均准确率也仅为48.45%；
性能从任务早期到后期轮次下降近47个百分点；
**52%–69%**的失败归因于长程错误（包括级联错误和状态管理错误），而非孤立的编码或推理错误；
增加交互步数（agent steps）并不一定能提升性能，表明瓶颈在于分析状态维护（analytical state maintenance）而非交互预算不足。

简言之，该论文

Authors: Kewei Xu,Xiaoben Lu,Shuofei Qiao,Zihan Ding,Haoming Xu,Lei Liang,Ningyu Zhang

PDF URL: https://arxiv.org/pdf/2605.30434.pdf

Arxiv URL: https://arxiv.org/abs/2605.30434

Arxiv ID: 2605.30434

CoolPaper URL: https://papers.cool/arxiv/2605.30434

Published: 2026-06-01T01:58:41.394Z

Updated: 2026-06-01T01:58:41.394Z

2. Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Abstract:Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $\pi0$, and $\pi{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy—timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

中文摘要

摘要：视觉-语言-动作（VLA）模型使机器人能够遵循自然语言指令并在多样化任务中进行泛化，但它们仍然容易在执行过程中发生故障，从而影响在现实环境中的可靠性。因此，在执行过程中检测此类故障对于具身系统的稳健部署至关重要。现有的故障检测方法要么依赖昂贵的动作重采样或外部模型，要么将轨迹级标签均匀地传播到每个时间步，从而掩盖了局部故障信号。在本文中，我们提出了\textbf{捉迷藏（Hide-and-Seek）}框架，将 VLA 故障检测表述为一个粗监督学习问题。通过结合轨迹间和轨迹内的对比目标，Hide-and-Seek 定位故障指示动作，并仅通过轨迹级监督而无需任何步骤级标注来生成时间结构化的故障信号。我们在 LIBERO、VLABench 以及一个真实机器人平台上对三种代表性的 VLA 策略：OpenVLA、$\pi0$ 和 $\pi{0.5}$ 进行了评估。我们的方法在遵循保形预测的情况下，实现了多任务故障检测的最先进性能，并在准确性与时效性之间取得了实际的权衡，同时在已见和未见任务上均具有良好的泛化能力。

LLM Analysis

Q: 这篇论文试图解决什么问题？

这篇论文旨在解决Vision-Language-Action (VLA) 模型在实时执行过程中的失败检测问题，具体聚焦于以下核心挑战：

1. 核心问题：VLA 模型的运行时失败检测

VLA 模型使机器人能够遵循自然语言指令并在多样化任务中实现泛化，但在实际部署中仍易受执行失败的影响（如抓取失败、动作偏离等）。若未能及时检测这些失败，可能导致连锁反应（如物品损坏、任务失败或安全隐患）。因此，在运行时可靠地检测执行失败对于实体系统的鲁棒部署至关重要。

2. 现有方法的关键局限

论文识别出现有方法在两个关键维度上的不足：

监督成本高昂：
获取**时间步级别（step-level）**的失败注释成本极高，需要专家在长时间、随机的轨迹中精确识别错误发生的时间戳。
近期工作（如 SAFE）通过将**轨迹级（trajectory-level）**失败标签均匀分配给所有时间步来规避此问题，但这会误将失败发生前的正常动作标记为失败，引入大量标签噪声，限制了检测器的有效性。
计算实用性不足：
基于动作重采样（action resampling）或外部视觉-语言模型（VLM）判断的方法推理开销巨大，无法满足实时部署需求。
多数先前方法专为特定任务设计，难以适应现代通用 VLA 模型面临的多样化、开放世界场景。

3. 论文提出的解决方案

为应对上述挑战，论文提出 Hide-and-Seek 框架，将 VLA 失败检测重新表述为**粗监督学习（coarsely supervised learning）**问题：

核心思想：仅从轨迹级监督（成功/失败标签）中自动发现最具失败指示性的动作时刻，无需任何时间步级别的注释。
技术路线：
轨迹间对比（Inter-trajectory contrastive）：确保失败轨迹中最具指示性的失败步骤得分高于成功轨迹中最像失败的步骤。
轨迹内对比（Intra-trajectory contrastive）：在失败轨迹内部，通过鼓励失败发生后的平均得分高于发生前，建立时间结构化的失败信号。

4. 评估与验证

该框架在 LIBERO、VLABench 模拟环境以及真实机器人平台（使用 OpenVLA、π0 和 π0.5 等代表性 VLA 策略）上进行了验证，实现了：

在共形预测（conformal prediction）下具备实用

Authors: Seongheon Park,Wendi Li,Changdae Oh,Samuel Yeh,Zsolt Kira,Michael Hagenow,Sharon Li

PDF URL: https://arxiv.org/pdf/2605.30834.pdf

Arxiv URL: https://arxiv.org/abs/2605.30834

Arxiv ID: 2605.30834

CoolPaper URL: https://papers.cool/arxiv/2605.30834

Published: 2026-06-01T01:59:17.157Z

Updated: 2026-06-01T01:59:17.157Z

3. Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Abstract:Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

中文摘要

摘要：大多数基于文本的3D室内场景生成方法是从以物体为中心的提示生成房间，即询问应放置哪些家具，而不是空间如何被使用。然而，在实际的室内设计中，布局是根据其对使用者的支持程度来评判的，例如他们的活动和身体需求。我们提出了Function2Scene，一个从功能规格生成3D室内布局的框架，即自然语言设计简报，描述谁将使用房间以及他们在其中需要做什么。给定这样的规格，我们的系统解析使用者角色和活动，从涵盖空间、人体工程学、活动和环境考虑的17个标准分类法中推导出定制的功能设计约束集合，并利用这些约束指导布局生成。Function2Scene不是依赖大型语言模型直接生成最终场景，而是通过工具增强的检查和修复循环进行迭代评估和优化，结合几何测量、基于大型语言模型的上下文推理以及基于视觉语言模型的视觉评估。在30个专业撰写的室内设计案例实验中，Function2Scene生成的布局在满足功能需求方面优于近期基于大型语言模型的场景生成基线，我们的结果在94.3%的配对比较中更受青睐。我们的工作将基于文本的室内场景生成从放置合理物体转变为设计支持人类使用的空间。

LLM Analysis

LLM Analysis Failed: Error: 抓取失败（已重试2次）: Waiting failed: 30000ms exceeded

Authors: Ruiqi Wang,Qimin Chen,Daniel Ritchie,Angel X. Chang,Manolis Savva,Kai Wang,Hao Zhang

PDF URL: https://arxiv.org/pdf/2605.30819.pdf

Arxiv URL: https://arxiv.org/abs/2605.30819

Arxiv ID: 2605.30819

CoolPaper URL: https://papers.cool/arxiv/2605.30819

Published: 2026-06-01T01:59:53.332Z

Updated: 2026-06-01T01:59:53.332Z