37.2° Blog

HuggingFace Papers 2026-02-27

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential RecommendationModeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challen ...

HuggingFace Papers 2026-03-04

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. From Scale to Speed: Adaptive Test-Time Scaling for Image EditingImage Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: in ...

HuggingFace Papers 2026-03-05

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Utonia: Toward One Encoder for All Point CloudsWe dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite th ...

HuggingFace Papers 2026-03-10

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision EncodersVision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via ...

HuggingFace Papers 2026-03-11

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Lost in Stories: Consistency Bugs in Long Story Generation by LLMsWhat happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot ...

HuggingFace Papers 2026-03-12

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene EditingLeveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe tha ...

HuggingFace Papers 2026-03-13

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. OpenClaw-RL: Train Any Agent Simply by TalkingEvery agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, termin ...

HuggingFace Papers 2026-03-21

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Generation Models Know Space: Unleashing Implicit 3D Priors for Scene UnderstandingWhile Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we ...

HuggingFace Papers 2026-03-25

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World ModelsVideo—based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text—video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling ...

HuggingFace Papers 2026-03-28

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. PixelSmile: Toward Fine-Grained Facial Expression EditingFine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a dif ...

HuggingFace Papers 2026-03-29

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. PixelSmile: Toward Fine-Grained Facial Expression EditingLLM Analysis Q: 这篇论文试图解决什么问题？该论文旨在解决细粒度面部表情编辑中的核心挑战，具体包括以下几个关键问题： 1. 语义重叠导致的结构化混淆面部表情存在于连续的语义流形上，本质上相互重叠（如恐惧与惊讶共享”睁大眼睛、张嘴”特征，愤怒与厌恶共享”皱眉、负面情绪”特征）。现有基于离散类别标签（one-hot标签）的训练方法强制将连续的表情划分为刚性边界，导致：生成模型在潜在空间中学习纠缠的表征编辑某一情绪时意外触发其他情绪特征（如编辑恐惧时混入惊讶特征）人类标注者、分类器和生成模型均出现系统性的跨类别混淆 2. 缺乏连续、细粒度的表情控制现有方法主要依赖离散标签或粗略的参考信号，无法捕捉人类情感的细微结构，导致：无法精确控制表情强度（intensity）的连续变化难以在语义相邻的表情之间实现平滑、线性的过渡大强度编辑时易出现身份漂移（identi ...

HuggingFace Papers 2026-03-30

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. PixelSmile: Toward Fine-Grained Facial Expression EditingLLM Analysis Q: 这篇论文试图解决什么问题？该论文旨在解决细粒度面部表情编辑中的核心挑战，具体包括以下几个关键问题： 1. 语义重叠导致的结构化混淆面部表情存在于连续的语义流形上，本质上相互重叠（如恐惧与惊讶共享”睁大眼睛、张嘴”特征，愤怒与厌恶共享”皱眉、负面情绪”特征）。现有基于离散类别标签（one-hot标签）的训练方法强制将连续的表情划分为刚性边界，导致：生成模型在潜在空间中学习纠缠的表征编辑某一情绪时意外触发其他情绪特征（如编辑恐惧时混入惊讶特征）人类标注者、分类器和生成模型均出现系统性的跨类别混淆 2. 缺乏连续、细粒度的表情控制现有方法主要依赖离散标签或粗略的参考信号，无法捕捉人类情感的细微结构，导致：无法精确控制表情强度（intensity）的连续变化难以在语义相邻的表情之间实现平滑、线性的过渡大强度编辑时易出现身份漂移（identi ...

HuggingFace Papers 2026-03-31

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World ModelsVideo world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring ...

HuggingFace Papers 2026-04-04

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language ModelsData-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases w ...

HuggingFace Papers 2026-04-06

Created2019-06-18|AI

数据来源：HuggingFace Papers Latest Papers1. A Simple Baseline for Streaming Video UnderstandingRecent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and on ...