37.2° Blog

HuggingFace Papers 2026-01-24

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic ExperienceThe development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work ...

HuggingFace Papers 2026-01-25

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic ExperienceThe development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work ...

HuggingFace Papers 2026-01-26

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic ExperienceThe development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work ...

HuggingFace Papers 2026-01-27

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. LongCat-Flash-Thinking-2601 Technical ReportWe introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demons ...

HuggingFace Papers 2026-01-28

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMsData preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications. Driven by (i) rising demands for application-ready data (e.g., for analytics, visualization, decision-making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of i ...

HuggingFace Papers 2026-01-29

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and SecurityThe rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally c ...

HuggingFace Papers 2026-01-30

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question ReformulationReinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabiliti ...

HuggingFace Papers 2026-02-01

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific NarrativesAutonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, existing systems largely rely on runtime-centric execution paradigms, repeatedly reading, summarizing, and reasoning over large volumes of scientific literatur ...

HuggingFace Papers 2026-02-02

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific NarrativesAutonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, existing systems largely rely on runtime-centric execution paradigms, repeatedly reading, summarizing, and reasoning over large volumes of scientific literatur ...

HuggingFace Papers 2026-02-03

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement ArenasLarge language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable lo ...

HuggingFace Papers 2026-02-04

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Green-VLA: Staged Vision-Language-Action Model for Generalist RobotsWe introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) ...

HuggingFace Papers 2026-02-05

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CodeOCR: On the Effectiveness of Vision Language Models in Code UnderstandingLarge Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational c ...

HuggingFace Papers 2026-02-08

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World UncertaintyExisting benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and ...

HuggingFace Papers 2026-02-07

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World UncertaintyLLM Analysis Q: 这篇论文试图解决什么问题？论文针对现有大语言模型（LLM）智能体评测基准的两大盲区，提出并解决以下核心问题：现实不确定性下的可靠性缺失现有基准多在“信息完备、工具齐全”的理想条件下评估任务完成率，忽视真实场景（如车载语音助手）中用户请求常出现：工具缺失或参数粒度不足环境查询返回不完整数据导致请求本质不可满足或高度模糊。此时智能体需具备“自知不能”与“主动消歧”能力，而非继续生成看似合理的幻觉结果。一致性评测缺位现有指标仅衡量“至少一次成功”（Pass@k），无法揭示智能体在多回合、多轮次中是否稳定遵守策略、稳定识别自身能力边界。论文提出以 Pass^k（k 次全部成功）作为主要指标，量化部署级一致性。新任务类型的系统化评估空白引入两类真实故障模式并 ...

HuggingFace Papers 2026-02-10

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. F-GRPO: Don’t Let Your Policy Learn the Obvious and Forget the RareReinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating ...