37.2° Blog

HuggingFace Papers 2025-10-29

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial RepresentationsHumans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Conc ...

HuggingFace Papers 2025-10-30

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. InteractComp: Evaluating Search Agents With Ambiguous QueriesLanguage agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks ...

HuggingFace Papers 2025-10-31

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code IntelligenceThe scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimo ...

HuggingFace Papers 2025-11-01

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. The End of Manual Decoding: Towards Truly End-to-End Language ModelsThe “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightw ...

HuggingFace Papers 2025-11-03

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. The End of Manual Decoding: Towards Truly End-to-End Language ModelsThe “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightw ...

HuggingFace Papers 2025-11-04

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic WorkflowsComputer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detect ...

HuggingFace Papers 2025-11-02

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. The End of Manual Decoding: Towards Truly End-to-End Language ModelsThe “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightw ...

HuggingFace Papers 2025-11-06

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Don’t Blind Your VLA: Aligning Visual Representations for OOD GeneralizationThe growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their orig ...

HuggingFace Papers 2025-11-07

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Diffusion Language Models are Super Data LearnersUnder strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) ...

HuggingFace Papers 2025-11-08

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct ...

HuggingFace Papers 2025-11-09

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct ...

HuggingFace Papers 2025-11-10

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct ...

HuggingFace Papers 2025-11-11

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Too Good to be Bad: On the Failure of LLMs to Role-Play VillainsLarge Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous cha ...

HuggingFace Papers 2025-11-12

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. HaluMem: Evaluating Hallucinations in Memory Systems of AgentsMemory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which ...

HuggingFace Papers 2025-09-26

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Video models are zero-shot learners and reasonersThe remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today’s generative video models. Could video models be on a trajectory towards gener ...