ArXiv Domain 2026-03-03
数据来源:ArXiv Domain
LLM Domain Papers1. DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data ScienceThe fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training dat ...
ArXiv Domain 2026-03-08
数据来源:ArXiv Domain
LLM Domain Papers1. RoboPocket: Improve Robot Policies Instantly with Your PhoneScaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy’s weaknesses, leading to inefficient coverage of critical state distributions. C ...
ArXiv Domain 2026-03-09
数据来源:ArXiv Domain
LLM Domain Papers1. RoboPocket: Improve Robot Policies Instantly with Your PhoneScaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy’s weaknesses, leading to inefficient coverage of critical state distributions. C ...
ArXiv Domain 2026-03-13
数据来源:ArXiv Domain
LLM Domain Papers1. COMIC: Agentic Sketch Comedy GenerationWe propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction ...
ArXiv Domain 2026-03-22
数据来源:ArXiv Domain
LLM Domain Papers1. NavTrust: Benchmarking Trustworthiness for Embodied NavigationThere are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world setting ...
ArXiv Domain 2026-04-19
数据来源:ArXiv Domain
LLM Domain Papers1. MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage GenerationThe rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are ...