HuggingFace Papers 2025-07-15
数据来源:HuggingFace Papers
Latest Papers
1. SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/
Authors: Youliang Zhang,Zhaoyang Li,Duomin Wang,Jiahe Zhang,Deyu Zhou,Zixin Yin,Xili Dai,Gang Yu,Xiu Li
Categories: cs.CV,eess.AS
PDF URL: https://arxiv.org/pdf/2507.09862.pdf
Arxiv URL: https://arxiv.org/abs/2507.09862
Arxiv ID: 2507.09862
Published: 2025-07-14T02:22:47Z
Updated: 2025-07-14T02:22:47.000Z
2. Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.
Authors: Mingqi Wu,Zhihao Zhang,Qiaole Dong,Zhiheng Xi,Jun Zhao,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Yanwei Fu,Qin Liu,Songyang Zhang,Qi Zhang
Categories: cs.LG,cs.AI,cs.CL
PDF URL: https://arxiv.org/pdf/2507.10532.pdf
Arxiv URL: https://arxiv.org/abs/2507.10532
Arxiv ID: 2507.10532
Published: 2025-07-14T17:55:15Z
Updated: 2025-07-14T17:55:15.000Z
3. EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent’s intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset’s effectiveness in enabling the development of embodied reasoning capabilities.
Authors: Mingxian Lin,Wei Huang,Yitang Li,Chengjie Jiang,Kui Wu,Fangwei Zhong,Shengju Qian,Xin Wang,Xiaojuan Qi
Categories: cs.CV,cs.AI,cs.CL
PDF URL: https://arxiv.org/pdf/2507.10548.pdf
Arxiv URL: https://arxiv.org/abs/2507.10548
Arxiv ID: 2507.10548
Published: 2025-07-14T17:59:46Z
Updated: 2025-07-14T17:59:46.000Z
4. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
Authors: Sangmin Bae,Yujin Kim,Reza Bayat,Sungnyun Kim,Jiyoun Ha,Tal Schuster,Adam Fisch,Hrayr Harutyunyan,Ziwei Ji,Aaron Courville,Se-Young Yun
Categories: cs.CL,cs.LG
PDF URL: https://arxiv.org/pdf/2507.10524.pdf
Arxiv URL: https://arxiv.org/abs/2507.10524
Arxiv ID: 2507.10524
Published: 2025-07-14T17:49:00Z
Updated: 2025-07-14T17:49:00.000Z
5. REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the “overthinking trap” is a critical factor contributing to the performance degradation; (2) the models trained with “long2short” technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.
Authors: Zhuoshi Pan,Qizhi Pei,Yu Li,Qiyao Sun,Zinan Tang,H. Vicky Zhao,Conghui He,Lijun Wu
Categories: cs.CL
PDF URL: https://arxiv.org/pdf/2507.10541.pdf
Arxiv URL: https://arxiv.org/abs/2507.10541
Arxiv ID: 2507.10541
Published: 2025-07-14T17:58:47Z
Updated: 2025-07-14T17:58:47.000Z
6. LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In this work, we introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation. Through empirical attention analysis, we identify two key patterns: punctuation tokens receive dominant attention in early layers, while conceptual tokens govern semantic reasoning in intermediate layers. By selectively suppressing attention to these token types at their respective depths, we achieve the induction of controlled factual degradation and derive contrastive signals to guide the final factual decoding. Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.
Authors: Jingze Zhu,Yongliang Wu,Wenbo Zhu,Jiawang Cao,Yanqiang Zheng,Jiawei Chen,Xu Yang,Bernt Schiele,Jonas Fischer,Xinting Hu
Categories: cs.AI
PDF URL: https://arxiv.org/pdf/2507.04404.pdf
Arxiv URL: https://arxiv.org/abs/2507.04404
Arxiv ID: 2507.04404
Published: 2025-07-06T14:35:43Z
Updated: 2025-07-06T14:35:43.000Z
7. CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
Authors: Taolin Zhang,Maosong Cao,Alexander Lam,Songyang Zhang,Kai Chen
Categories: cs.CL,cs.AI
PDF URL: https://arxiv.org/pdf/2507.09104.pdf
Arxiv URL: https://arxiv.org/abs/2507.09104
Arxiv ID: 2507.09104
Published: 2025-07-12T01:34:24Z
Updated: 2025-07-12T01:34:24.000Z
8. MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.
Authors: Chenguo Lin,Yuchen Lin,Panwang Pan,Yifan Yu,Honglei Yan,Katerina Fragkiadaki,Yadong Mu
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2507.10065.pdf
Arxiv URL: https://arxiv.org/abs/2507.10065
Arxiv ID: 2507.10065
Published: 2025-07-14T08:49:57Z
Updated: 2025-07-14T08:49:57.000Z
9. DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design
We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster’s superiority over existing methods, achieving a high usability rate of 88.55\%, compared to GPT-4o (47.56\%) and SeedEdit3.0 (25.96\%). DreamPoster will be online in Jimeng and other Bytedance Apps.
Authors: Xiwei Hu,Haokun Chen,Zhongqi Qi,Hui Zhang,Dexiang Hong,Jie Shao,Xinglong Wu
Categories: cs.CV
PDF URL: https://arxiv.org/pdf/2507.04218.pdf
Arxiv URL: https://arxiv.org/abs/2507.04218
Arxiv ID: 2507.04218
Published: 2025-07-06T03:06:45Z
Updated: 2025-07-06T03:06:45.000Z
10. A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning
Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model’s accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.
Authors: Hiroshi Yoshihara,Taiki Yamaguchi,Yuichi Inoue
Categories: cs.LG,cs.AI
PDF URL: https://arxiv.org/pdf/2507.08267.pdf
Arxiv URL: https://arxiv.org/abs/2507.08267
Arxiv ID: 2507.08267
Published: 2025-07-11T02:26:01Z
Updated: 2025-07-11T02:26:01.000Z
11. Favicon Trojans: Executable Steganography Via Ico Alpha Channel Exploitation
This paper presents a novel method of executable steganography using the alpha transparency layer of ICO image files to embed and deliver self-decompressing JavaScript payloads within web browsers. By targeting the least significant bit (LSB) of non-transparent alpha layer image values, the proposed method successfully conceals compressed JavaScript code inside a favicon image without affecting visual fidelity. Global web traffic loads 294 billion favicons daily and consume 0.9 petabytes of network bandwidth. A proof-of-concept implementation demonstrates that a 64x64 ICO image can embed up to 512 bytes uncompressed, or 0.8 kilobyte when using lightweight two-fold compression. On page load, a browser fetches the favicon as part of standard behavior, allowing an embedded loader script to extract and execute the payload entirely in memory using native JavaScript APIs and canvas pixel access. This creates a two-stage covert channel requiring no additional network or user requests. Testing across multiple browsers in both desktop and mobile environments confirms successful and silent execution of the embedded script. We evaluate the threat model, relate it to polymorphic phishing attacks that evade favicon-based detection, and analyze evasion of content security policies and antivirus scanners. We map nine example MITRE ATT&CK Framework objectives to single line JavaScript to execute arbitrarily in ICO files. Existing steganalysis and sanitization defenses are discussed, highlighting limitations in detecting or neutralizing alpha-channel exploits. The results demonstrate a stealthy and reusable attack surface that blurs traditional boundaries between static images and executable content. Because modern browsers report silent errors when developers specifically fail to load ICO files, this attack surface offers an interesting example of required web behaviors that in turn compromise security.
Authors: David Noever,Forrest McKee
Categories: cs.CR
PDF URL: https://arxiv.org/pdf/2507.09074.pdf
Arxiv URL: https://arxiv.org/abs/2507.09074
Arxiv ID: 2507.09074
Published: 2025-07-11T23:29:04Z
Updated: 2025-07-11T23:29:04.000Z
12. From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.
Authors: Seokhee Hong,Sunkyoung Kim,Guijin Son,Soyeon Kim,Yeonjung Hong,Jinsik Lee
Categories: cs.CL,cs.AI
PDF URL: https://arxiv.org/pdf/2507.08924.pdf
Arxiv URL: https://arxiv.org/abs/2507.08924
Arxiv ID: 2507.08924
Published: 2025-07-11T17:56:32Z
Updated: 2025-07-11T17:56:32.000Z