数据来源:HuggingFace Papers

Latest Papers

1. Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.

Authors: Tiezheng Zhang,Yitong Li,Yu-cheng Chou,Jieneng Chen,Alan Yuille,Chen Wei,Junfei Xiao

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.07104.pdf

Arxiv URL: https://arxiv.org/abs/2507.07104

Arxiv ID: 2507.07104

Published: 2025-07-09T17:59:04Z

Updated: 2025-07-09T17:59:04.000Z


2. EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.

Authors: LG AI Research,:,Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Yemuk Choi,Kyubeen Han,Seokhee Hong,Junwon Hwang,Taewan Hwang,Joonwon Jang,Hyojin Jeon,Kijeong Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Euisoon Kim,Hyosang Kim,Jihoon Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Gwangho Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Young Min Paik,Yongmin Park,Youngyong Park,Sanghyun Seo,Sihoon Yang,Heuiyeen Yeen,Sihyuk Yi,Hyeongu Yun

Categories: cs.CL,cs.AI

PDF URL: https://arxiv.org/pdf/2507.11407.pdf

Arxiv URL: https://arxiv.org/abs/2507.11407

Arxiv ID: 2507.11407

Published: 2025-07-15T15:24:51Z

Updated: 2025-07-15T15:24:51.000Z


3. Scaling Laws for Optimal Data Mixtures

Large foundation models are typically trained on data from multiple domains, with the data mixture—the proportion of each domain used—playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Authors: Mustafa Shukor,Louis Bethune,Dan Busbridge,David Grangier,Enrico Fini,Alaaeldin El-Nouby,Pierre Ablin

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2507.09404.pdf

Arxiv URL: https://arxiv.org/abs/2507.09404

Arxiv ID: 2507.09404

Published: 2025-07-12T21:16:08Z

Updated: 2025-07-12T21:16:08.000Z


4. Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

Authors: Yilun Zhao,Chengye Wang,Chuhan Li,Arman Cohan

Categories: cs.CL,cs.CV

PDF URL: https://arxiv.org/pdf/2507.10787.pdf

Arxiv URL: https://arxiv.org/abs/2507.10787

Arxiv ID: 2507.10787

Published: 2025-07-14T20:35:25Z

Updated: 2025-07-14T20:35:25.000Z


5. OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

Authors: Wasi Uddin Ahmad,Somshubra Majumdar,Aleksander Ficek,Sean Narenthiran,Mehrzad Samadi,Jocelyn Huang,Siddhartha Jain,Vahid Noroozi,Boris Ginsburg

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.09075.pdf

Arxiv URL: https://arxiv.org/abs/2507.09075

Arxiv ID: 2507.09075

Published: 2025-07-11T23:35:54Z

Updated: 2025-07-11T23:35:54.000Z


6. AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks cover at most 2-5 agents, AgentsNet is practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

Authors: Florian Grötschla,Luis Müller,Jan Tönshoff,Mikhail Galkin,Bryan Perozzi

Categories: cs.MA,cs.LG

PDF URL: https://arxiv.org/pdf/2507.08616.pdf

Arxiv URL: https://arxiv.org/abs/2507.08616

Arxiv ID: 2507.08616

Published: 2025-07-11T14:13:22Z

Updated: 2025-07-11T14:13:22.000Z


7. Token-based Audio Inpainting via Discrete Diffusion

Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at https://iftach21.github.io/

Authors: Tali Dror,Iftach Shoham,Moshe Buchris,Oren Gal,Haim Permuter,Gilad Katz,Eliya Nachmani

Categories: cs.SD,cs.AI,cs.IT,cs.LG,eess.AS,math.IT

PDF URL: https://arxiv.org/pdf/2507.08333.pdf

Arxiv URL: https://arxiv.org/abs/2507.08333

Arxiv ID: 2507.08333

Published: 2025-07-11T06:25:49Z

Updated: 2025-07-11T06:25:49.000Z


8. LLMalMorph: On The Feasibility of Generating Variant Malware using Large-Language-Models

Large Language Models (LLMs) have transformed software development and automated code generation. Motivated by these advancements, this paper explores the feasibility of LLMs in modifying malware source code to generate variants. We introduce LLMalMorph, a semi-automated framework that leverages semantical and syntactical code comprehension by LLMs to generate new malware variants. LLMalMorph extracts function-level information from the malware source code and employs custom-engineered prompts coupled with strategically defined code transformations to guide the LLM in generating variants without resource-intensive fine-tuning. To evaluate LLMalMorph, we collected 10 diverse Windows malware samples of varying types, complexity and functionality and generated 618 variants. Our thorough experiments demonstrate that it is possible to reduce the detection rates of antivirus engines of these malware variants to some extent while preserving malware functionalities. In addition, despite not optimizing against any Machine Learning (ML)-based malware detectors, several variants also achieved notable attack success rates against an ML-based malware classifier. We also discuss the limitations of current LLM capabilities in generating malware variants from source code and assess where this emerging technology stands in the broader context of malware variant generation.

Authors: Md Ajwad Akil,Adrian Shuai Li,Imtiaz Karim,Arun Iyengar,Ashish Kundu,Vinny Parla,Elisa Bertino

Categories: cs.CR

PDF URL: https://arxiv.org/pdf/2507.09411.pdf

Arxiv URL: https://arxiv.org/abs/2507.09411

Arxiv ID: 2507.09411

Published: 2025-07-12T22:11:10Z

Updated: 2025-07-12T22:11:10.000Z


9. UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

Authors: Peiran Wu,Yunze Liu,Zhengdong Zhu,Enmin Zhou,Shawn Shen

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.11336.pdf

Arxiv URL: https://arxiv.org/abs/2507.11336

Arxiv ID: 2507.11336

Published: 2025-07-15T14:08:29Z

Updated: 2025-07-15T14:08:29.000Z


10. Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Large language models (LLMs) exhibit cognitive biases — systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} — swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

Authors: Itay Itzhak,Yonatan Belinkov,Gabriel Stanovsky

Categories: cs.CL,cs.AI,cs.LG

PDF URL: https://arxiv.org/pdf/2507.07186.pdf

Arxiv URL: https://arxiv.org/abs/2507.07186

Arxiv ID: 2507.07186

Published: 2025-07-09T18:01:14Z

Updated: 2025-07-09T18:01:14.000Z


11. Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94\% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63\% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

Authors: Konstantinos I. Roumeliotis,Ranjan Sapkota,Manoj Karkee,Nikolaos D. Tselikas

Categories: cs.AI,cs.CL

PDF URL: https://arxiv.org/pdf/2507.10571.pdf

Arxiv URL: https://arxiv.org/abs/2507.10571

Arxiv ID: 2507.10571

Published: 2025-07-09T16:39:29Z

Updated: 2025-07-09T16:39:29.000Z


12. Taming generative video models for zero-shot optical flow extraction

Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

Authors: Seungwoo Kim,Khai Loong Aw,Klemen Kotar,Cristobal Eyzaguirre,Wanhee Lee,Yunong Liu,Jared Watrous,Stefan Stojanov,Juan Carlos Niebles,Jiajun Wu,Daniel L. K. Yamins

Categories: cs.CV

PDF URL: https://arxiv.org/pdf/2507.09082.pdf

Arxiv URL: https://arxiv.org/abs/2507.09082

Arxiv ID: 2507.09082

Published: 2025-07-11T23:59:38Z

Updated: 2025-07-11T23:59:38.000Z


13. BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.

Authors: Costas Mavromatis,Soji Adeshina,Vassilis N. Ioannidis,Zhen Han,Qi Zhu,Ian Robinson,Bryan Thompson,Huzefa Rangwala,George Karypis

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.04127.pdf

Arxiv URL: https://arxiv.org/abs/2507.04127

Arxiv ID: 2507.04127

Published: 2025-07-05T18:47:14Z

Updated: 2025-07-05T18:47:14.000Z