ArXiv Domain 2026-02-11
数据来源:ArXiv Domain
LLM Domain Papers1. Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based DrivingOut of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in {0,1,2,3}$). Using closed loop co ...
ArXiv Domain 2026-02-27
数据来源:ArXiv Domain
LLM Domain Papers1. Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and DatasetsThe reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by e ...
ArXiv Domain 2026-03-03
数据来源:ArXiv Domain
LLM Domain Papers1. DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data ScienceThe fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training dat ...
ArXiv Domain 2026-03-08
数据来源:ArXiv Domain
LLM Domain Papers1. RoboPocket: Improve Robot Policies Instantly with Your PhoneScaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy’s weaknesses, leading to inefficient coverage of critical state distributions. C ...
ArXiv Domain 2026-03-09
数据来源:ArXiv Domain
LLM Domain Papers1. RoboPocket: Improve Robot Policies Instantly with Your PhoneScaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy’s weaknesses, leading to inefficient coverage of critical state distributions. C ...
ArXiv Domain 2026-03-13
数据来源:ArXiv Domain
LLM Domain Papers1. COMIC: Agentic Sketch Comedy GenerationWe propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction ...
ArXiv Domain 2026-03-22
数据来源:ArXiv Domain
LLM Domain Papers1. NavTrust: Benchmarking Trustworthiness for Embodied NavigationThere are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world setting ...