37.2° Blog

HuggingFace Papers 2026-02-01

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific NarrativesAutonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, existing systems largely rely on runtime-centric execution paradigms, repeatedly reading, summarizing, and reasoning over large volumes of scientific literatur ...

HuggingFace Papers 2026-02-02

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific NarrativesAutonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, existing systems largely rely on runtime-centric execution paradigms, repeatedly reading, summarizing, and reasoning over large volumes of scientific literatur ...

HuggingFace Papers 2026-02-03

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement ArenasLarge language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable lo ...

HuggingFace Papers 2026-02-04

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Green-VLA: Staged Vision-Language-Action Model for Generalist RobotsWe introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) ...

HuggingFace Papers 2026-02-05

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CodeOCR: On the Effectiveness of Vision Language Models in Code UnderstandingLarge Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational c ...

HuggingFace Papers 2026-02-07

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World UncertaintyLLM Analysis Q: 这篇论文试图解决什么问题？论文针对现有大语言模型（LLM）智能体评测基准的两大盲区，提出并解决以下核心问题：现实不确定性下的可靠性缺失现有基准多在“信息完备、工具齐全”的理想条件下评估任务完成率，忽视真实场景（如车载语音助手）中用户请求常出现：工具缺失或参数粒度不足环境查询返回不完整数据导致请求本质不可满足或高度模糊。此时智能体需具备“自知不能”与“主动消歧”能力，而非继续生成看似合理的幻觉结果。一致性评测缺位现有指标仅衡量“至少一次成功”（Pass@k），无法揭示智能体在多回合、多轮次中是否稳定遵守策略、稳定识别自身能力边界。论文提出以 Pass^k（k 次全部成功）作为主要指标，量化部署级一致性。新任务类型的系统化评估空白引入两类真实故障模式并 ...

HuggingFace Papers 2026-02-08

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World UncertaintyExisting benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and ...

HuggingFace Papers 2026-02-09

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World UncertaintyExisting benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and ...

HuggingFace Papers 2026-02-10

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. F-GRPO: Don’t Let Your Policy Learn the Obvious and Forget the RareReinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating ...

HuggingFace Papers 2026-02-11

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha MiningFinancial markets are noisy and non-stationary, making alpha mining highly sensitive to noise in backtesting results and sudden market regime shifts. While recent agentic frameworks improve alpha mining automation, they often lack controllable multi-round search and reliable reuse of validated experience. To address these challenges, we propose QuantaAlpha, an evolutionary alpha mining fra ...

HuggingFace Papers 2026-02-12

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every IterationAs high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induce ...

HuggingFace Papers 2026-02-13

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active ParametersWe introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 ...

HuggingFace Papers 2026-02-14

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI SocietiesThe emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment—a combination we term the self-evolution trilemma. However, we demonstrate ...

HuggingFace Papers 2026-02-16

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI SocietiesThe emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment—a combination we term the self-evolution trilemma. However, we demonstrate ...

HuggingFace Papers 2026-02-17

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Less is Enough: Synthesizing Diverse Data in Feature Space of LLMsThe diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we intro ...