37.2° Blog

HuggingFace Papers 2026-04-24

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language ModelWe present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model ena ...

HuggingFace Papers 2026-04-25

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to SemanticsComprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoni ...

HuggingFace Papers 2026-04-26

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to SemanticsComprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoni ...

HuggingFace Papers 2026-04-27

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to SemanticsComprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoni ...

HuggingFace Papers 2026-04-28

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Agentic World Modeling: Foundations, Capabilities, Laws, and BeyondAs AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We intro ...

HuggingFace Papers 2026-04-29

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. World-R1: Reinforcing 3D Constraints for Text-to-Video GenerationRecent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To fa ...

HuggingFace Papers 2026-04-30

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal AgentsAbstract:We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around th ...

HuggingFace Papers 2026-05-01

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal AgentsAbstract:We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around th ...

HuggingFace Papers 2026-05-05

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion PriorsAbstract:Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverage ...

HuggingFace Papers 2026-05-12

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMsAbstract:Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledg ...

HuggingFace Papers 2026-05-13

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Qwen-Image-2.0 Technical ReportAbstract:We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex ...

HuggingFace Papers 2026-05-19

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document IntelligenceAbstract:Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage — a critical risk in high-stakes do ...

HuggingFace Papers 2026-05-27

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement LearningAbstract:Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices ...

HuggingFace Papers 2026-05-28

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. ResearchMath-14K: Scaling Research-Level Mathematics via AgentsAbstract:The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a ...

HuggingFace Papers 2026-03-29

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. PixelSmile: Toward Fine-Grained Facial Expression EditingLLM Analysis Q: 这篇论文试图解决什么问题？该论文旨在解决细粒度面部表情编辑中的核心挑战，具体包括以下几个关键问题： 1. 语义重叠导致的结构化混淆面部表情存在于连续的语义流形上，本质上相互重叠（如恐惧与惊讶共享”睁大眼睛、张嘴”特征，愤怒与厌恶共享”皱眉、负面情绪”特征）。现有基于离散类别标签（one-hot标签）的训练方法强制将连续的表情划分为刚性边界，导致：生成模型在潜在空间中学习纠缠的表征编辑某一情绪时意外触发其他情绪特征（如编辑恐惧时混入惊讶特征）人类标注者、分类器和生成模型均出现系统性的跨类别混淆 2. 缺乏连续、细粒度的表情控制现有方法主要依赖离散标签或粗略的参考信号，无法捕捉人类情感的细微结构，导致：无法精确控制表情强度（intensity）的连续变化难以在语义相邻的表情之间实现平滑、线性的过渡大强度编辑时易出现身份漂移（identi ...