数据来源:ArXiv Domain

LLM Domain Papers

1. One Token to Fool LLM-as-a-Judge

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Authors: Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu

Categories: cs.LG, cs.CL

PDF URL: https://arxiv.org/pdf/2507.08794v1.pdf

Published: 2025-07-11T17:55:22Z


2. Large Language Models in Mental Health Care: a Scoping Review

Objectieve:This review aims to deliver a comprehensive analysis of Large Language Models (LLMs) utilization in mental health care, evaluating their effectiveness, identifying challenges, and exploring their potential for future application. Materials and Methods: A systematic search was performed across multiple databases including PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and PsyArXiv in November 2023. The review includes all types of original research, regardless of peer-review status, published or disseminated between October 1, 2019, and December 2, 2023. Studies were included without language restrictions if they employed LLMs developed after T5 and directly investigated research questions within mental health care settings. Results: Out of an initial 313 articles, 34 were selected based on their relevance to LLMs applications in mental health care and the rigor of their reported outcomes. The review identified various LLMs applications in mental health care, including diagnostics, therapy, and enhancing patient engagement. Key challenges highlighted were related to data availability and reliability, the nuanced handling of mental states, and effective evaluation methods. While LLMs showed promise in improving accuracy and accessibility, significant gaps in clinical applicability and ethical considerations were noted. Conclusion: LLMs hold substantial promise for enhancing mental health care. For their full potential to be realized, emphasis must be placed on developing robust datasets, development and evaluation frameworks, ethical guidelines, and interdisciplinary collaborations to address current limitations.

Authors: Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Hongbin Na, Yi-han Sheu, Peilin Zhou, Lauren V. Moran, Sophia Ananiadou, David A. Clifton, Andrew Beam, John Torous

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2401.02984v3.pdf

Published: 2024-01-01T17:35:52Z


3. Weak-to-Strong Jailbreaking on Large Language Models

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2401.17256v4.pdf

Published: 2024-01-30T18:48:37Z


4. Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool’s APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting. The code is available at: https://github.com/dcy0577/Text2BIM

Authors: Changyu Du, Sebastian Esser, Stavros Nousias, André Borrmann

Categories: cs.AI, cs.CL, cs.SE

PDF URL: https://arxiv.org/pdf/2408.08054v2.pdf

Published: 2024-08-15T09:48:45Z


5. Red Teaming Large Language Models for Healthcare

We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities — realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.

Authors: Vahid Balazadeh, Michael Cooper, David Pellow, Atousa Assadi, Jennifer Bell, Mark Coatsworth, Kaivalya Deshpande, Jim Fackler, Gabriel Funingana, Spencer Gable-Cook, Anirudh Gangadhar, Abhishek Jaiswal, Sumanth Kaja, Christopher Khoury, Amrit Krishnan, Randy Lin, Kaden McKeen, Sara Naimimohasses, Khashayar Namdar, Aviraj Newatia, Allan Pang, Anshul Pattoo, Sameer Peesapati, Diana Prepelita, Bogdana Rakova, Saba Sadatamin, Rafael Schulman, Ajay Shah, Syed Azhar Shah, Syed Ahmar Shah, Babak Taati, Balagopal Unnikrishnan, Iñigo Urteaga, Stephanie Williams, Rahul G Krishnan

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2505.00467v2.pdf

Published: 2025-05-01T11:43:27Z


6. A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

Authors: Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.08621v1.pdf

Published: 2025-07-11T14:23:40Z


7. The AI Language Proficiency Monitor — Tracking the Progress of LLMs on Multilingual Benchmarks

To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world’s languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at https://huggingface.co/spaces/fair-forward/evals-for-every-language.

Authors: David Pomerenke, Jonas Nothnagel, Simon Ostermann

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.08538v1.pdf

Published: 2025-07-11T12:38:02Z


8. Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop

Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.

Authors: Mengze Hong, Chen Jason Zhang, Di Jiang

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.08498v1.pdf

Published: 2025-07-11T11:20:39Z


9. A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one’s own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.

Authors: David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.08491v1.pdf

Published: 2025-07-11T11:16:01Z


10. Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

Authors: Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

Categories: cs.LG, cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.06892v3.pdf

Published: 2025-07-09T14:29:45Z


Agent Domain Papers

1. Upgrade or Switch: Do We Need a Next-Gen Trusted Architecture for the Internet of AI Agents?

The emerging Internet of AI Agents challenges existing web infrastructure designed for human-scale, reactive interactions. Unlike traditional web resources, autonomous AI agents initiate actions, maintain persistent state, spawn sub-agents, and negotiate directly with peers: demanding millisecond-level discovery, instant credential revocation, and cryptographic behavioral proofs that exceed current DNS/PKI capabilities. This paper analyzes whether to upgrade existing infrastructure or implement purpose-built index architectures for autonomous agents. We identify critical failure points: DNS propagation (24-48 hours vs. required milliseconds), certificate revocation unable to scale to trillions of entities, and IPv4/IPv6 addressing inadequate for agent-scale routing. We evaluate three approaches: (1) Upgrade paths, (2) Switch options, (3) Hybrid index/registries. Drawing parallels to dialup-to-broadband transitions, we find that agent requirements constitute qualitative, and not incremental, changes. While upgrades offer compatibility and faster deployment, clean-slate solutions provide better performance but require longer for adoption. Our analysis suggests hybrid approaches will emerge, with centralized indexes for critical agents and federated meshes for specialized use cases.

Authors: Ramesh Raskar, Pradyumna Chari, Jared James Grogan, Mahesh Lambe, Robert Lincourt, Raghu Bala, Aditi Joshi, Abhishek Singh, Ayush Chopra, Rajesh Ranjan, Shailja Gupta, Dimitris Stripelis, Maria Gorskikh, Sichao Wang

Categories: cs.NI, cs.AI, cs.MA

PDF URL: https://arxiv.org/pdf/2506.12003v2.pdf

Published: 2025-06-13T17:55:38Z


2. Introspection of Thought Helps AI Agents

AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95\% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3\%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.

Authors: Haoran Sun, Shaoning Zeng

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.08664v1.pdf

Published: 2025-07-11T15:03:17Z


3. DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.

Authors: Haoran Sun, Haoyu Bian, Shaoning Zeng, Yunbo Rao, Xu Xu, Lin Mei, Jianping Gou

Categories: cs.CV, cs.AI

PDF URL: https://arxiv.org/pdf/2507.08648v1.pdf

Published: 2025-07-11T14:51:33Z


4. Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

Authors: Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet

Categories: cs.AI, astro-ph.IM, cs.CL, cs.MA

PDF URL: https://arxiv.org/pdf/2507.07257v2.pdf

Published: 2025-07-09T20:03:30Z


5. Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool’s APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting. The code is available at: https://github.com/dcy0577/Text2BIM

Authors: Changyu Du, Sebastian Esser, Stavros Nousias, André Borrmann

Categories: cs.AI, cs.CL, cs.SE

PDF URL: https://arxiv.org/pdf/2408.08054v2.pdf

Published: 2024-08-15T09:48:45Z


6. Agentic Large Language Models for Conceptual Systems Engineering and Design

Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20\%). Code compatibility peaked at 100\% under specific 2AS settings but averaged below 50\% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.

Authors: Soheyl Massoudi, Mark Fuge

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.08619v1.pdf

Published: 2025-07-11T14:19:05Z


7. To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions

Large language models (LLMs) are increasingly deployed in agentic frameworks, in which prompts trigger complex tool-based analysis in pursuit of a goal. While these frameworks have shown promise across multiple domains including in finance, they typically lack a principled model-building step, relying instead on sentiment- or trend-based analysis. We address this gap by developing an agentic system that uses LLMs to iteratively discover stochastic differential equations for financial time series. These models generate risk metrics which inform daily trading decisions. We evaluate our system in both traditional backtests and using a market simulator, which introduces synthetic but causally plausible price paths and news events. We find that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios across multiple equities. Our results show that combining LLMs with agentic model discovery enhances market risk estimation and enables more profitable trading decisions.

Authors: Dimitrios Emmanoulopoulos, Ollie Olby, Justin Lyon, Namid R. Stillman

Categories: q-fin.ST, cs.AI, cs.CE, cs.MA, q-fin.CP, 68T42, 65C05, 68T01, 60H10, I.2.11; I.2.0; I.2.1; I.2.3; I.2.4; I.2.8

PDF URL: https://arxiv.org/pdf/2507.08584v1.pdf

Published: 2025-07-11T13:29:32Z


8. StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.

Authors: Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, Bo An

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.07445v2.pdf

Published: 2025-07-10T05:48:28Z


9. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina Xinyi Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D’Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D’olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O’Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei Jerry Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui Elena Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adrià Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ilić, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, Vít Listík, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Giménez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Frömmgen, Yanhua Sun, Ralph Leith, Alfonso Castaño, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, José Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru Ţifrea, Tao Zhou, Tony Tu\’ân Nguy~ên, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong Q Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D’sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O’Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga Świetlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Michaël Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter Vértes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Paweł Wesołowski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Angéline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj Ćevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Sabolić, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante Žužul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zalán Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc’aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gellért Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Gaël Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Mikuła, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fréchette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, Iñaki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, Sébastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strejček, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, João Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veličković, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, Víctor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vušak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Dević, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirović, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lučić, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Mohammad Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren Weilun Chen, Danila Sinopalnikov, Hao Zhang, Vlado Galić, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Raphaël Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, Léonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, Séb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D’Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altché, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Plucińska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozińska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, András György, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fränken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adrià Puigdomènech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivière, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yonghao Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, Ágoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao Joyce Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eugénie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannen, Amer Sinha, Charles Chen, Louis O’Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccolò Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrović, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran Žužić, Victor Carbune, Tao Li, Myle Ott, Félix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Andačić, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, Rémi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas Köppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Rakićević, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Paweł Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Comşa, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, Håvard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, André Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, François-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio Vázquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ražnatović, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian Cindy Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin Lucas Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim Põder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Taïga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrzębski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bražinskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav Žanić, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.06261v2.pdf

Published: 2025-07-07T17:36:04Z


10. The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover

The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables unprecedented capabilities in natural language processing and generation. However, these systems have introduced unprecedented security vulnerabilities that extend beyond traditional prompt injection attacks. This paper presents the first comprehensive evaluation of LLM agents as attack vectors capable of achieving complete computer takeover through the exploitation of trust boundaries within agentic AI systems where autonomous entities interact and influence each other. We demonstrate that adversaries can leverage three distinct attack surfaces - direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation - to coerce popular LLMs (including GPT-4o, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 17 state-of-the-art LLMs reveals an alarming vulnerability hierarchy: while 41.2% of models succumb to direct prompt injection, 52.9% are vulnerable to RAG backdoor attacks, and a critical 82.4% can be compromised through inter-agent trust exploitation. Notably, we discovered that LLMs which successfully resist direct malicious commands will execute identical payloads when requested by peer agents, revealing a fundamental flaw in current multi-agent security models. Our findings demonstrate that only 5.9% of tested models (1/17) proved resistant to all attack vectors, with the majority exhibiting context-dependent security behaviors that create exploitable blind spots. Our findings also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.

Authors: Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, Angelo Furfaro

Categories: cs.CR, cs.AI

PDF URL: https://arxiv.org/pdf/2507.06850v3.pdf

Published: 2025-07-09T13:54:58Z


NLP Domain Papers

1. REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user “steering” queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset’s quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

Authors: Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi

Categories: cs.CL, cs.AI, cs.IR, cs.LG

PDF URL: https://arxiv.org/pdf/2503.11924v2.pdf

Published: 2025-03-14T23:47:46Z


2. State-Inference-Based Prompting for Natural Language Trading with Game NPCs

Large Language Models enable dynamic game interactions but struggle with rule-governed trading systems. Current implementations suffer from rule violations, such as item hallucinations and calculation errors, that erode player trust. Here, State-Inference-Based Prompting (SIBP) enables reliable trading through autonomous dialogue state inference and context-specific rule adherence. The approach decomposes trading into six states within a unified prompt framework, implementing context-aware item referencing and placeholder-based price calculations. Evaluation across 100 trading dialogues demonstrates >97% state compliance, >95% referencing accuracy, and 99.7% calculation precision. SIBP maintains computational efficiency while outperforming baseline approaches, establishing a practical foundation for trustworthy NPC interactions in commercial games.

Authors: Minkyung Kim, Junsik Kim, Hwidong Bae, Woongcheol Yang, Sangdon Park, Sohee Bae

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.07203v1.pdf

Published: 2025-07-09T18:24:47Z


3. Adaptive Elicitation of Latent Information Using Natural Language

Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

Authors: Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong

Categories: cs.CL, cs.AI, cs.LG

PDF URL: https://arxiv.org/pdf/2504.04204v2.pdf

Published: 2025-04-05T15:18:55Z


4. Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

Authors: Matthew Anderson Hendricks, Alice Cicirello

Categories: cs.CL, cs.AI, cs.CE

PDF URL: https://arxiv.org/pdf/2507.06803v1.pdf

Published: 2025-07-09T12:44:49Z


5. AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation.

Authors: Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2412.10422v4.pdf

Published: 2024-12-10T11:03:49Z


6. MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].

Authors: K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S

Categories: cs.CL, cs.AI, cs.LG

PDF URL: https://arxiv.org/pdf/2507.08013v1.pdf

Published: 2025-07-06T03:38:05Z


7. Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Current trends in pre-training Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. While the quality of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality — measuring the variability of natural language data — specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled interventional experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data characterizes useful aspects of downstream model evaluation performance — totaling 44 models of various sizes (51M to 7B parameters). We conclude that our formal notion of diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance.

Authors: Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo

Categories: cs.CL, cs.AI, cs.LG, cs.NE

PDF URL: https://arxiv.org/pdf/2306.13840v4.pdf

Published: 2023-06-24T02:25:56Z


8. Natural language processing for African languages

Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.

Authors: David Ifeoluwa Adelani

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.00297v1.pdf

Published: 2025-06-30T22:26:36Z


9. Unveiling Privacy Policy Complexity: An Exploratory Study Using Graph Mining, Machine Learning, and Natural Language Processing

Privacy policy documents are often lengthy, complex, and difficult for non-expert users to interpret, leading to a lack of transparency regarding the collection, processing, and sharing of personal data. As concerns over online privacy grow, it is essential to develop automated tools capable of analyzing privacy policies and identifying potential risks. In this study, we explore the potential of interactive graph visualizations to enhance user understanding of privacy policies by representing policy terms as structured graph models. This approach makes complex relationships more accessible and enables users to make informed decisions about their personal data (RQ1). We also employ graph mining algorithms to identify key themes, such as User Activity and Device Information, using dimensionality reduction techniques like t-SNE and PCA to assess clustering effectiveness. Our findings reveal that graph-based clustering improves policy content interpretability. It highlights patterns in user tracking and data sharing, which supports forensic investigations and identifies regulatory non-compliance. This research advances AI-driven tools for auditing privacy policies by integrating interactive visualizations with graph mining. Enhanced transparency fosters accountability and trust.

Authors: Vijayalakshmi Ramasamy, Seth Barrett, Gokila Dorai, Jessica Zumbach

Categories: cs.CR, cs.AI

PDF URL: https://arxiv.org/pdf/2507.02968v1.pdf

Published: 2025-06-30T14:55:57Z


10. StepProof: Step-by-step verification of natural language mathematical proofs

Interactive theorem provers (ITPs) are powerful tools for the formal verification of mathematical proofs down to the axiom level. However, their lack of a natural language interface remains a significant limitation. Recent advancements in large language models (LLMs) have enhanced the understanding of natural language inputs, paving the way for autoformalization - the process of translating natural language proofs into formal proofs that can be verified. Despite these advancements, existing autoformalization approaches are limited to verifying complete proofs and lack the capability for finer, sentence-level verification. To address this gap, we propose StepProof, a novel autoformalization method designed for granular, step-by-step verification. StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification. Experimental results demonstrate that StepProof significantly improves proof success rates and efficiency compared to traditional methods. Additionally, we found that minor manual adjustments to the natural language proofs, tailoring them for step-level verification, further enhanced StepProof’s performance in autoformalization.

Authors: Xiaolin Hu, Qinghua Zhou, Bogdan Grechuk, Ivan Y. Tyukin

Categories: cs.LO, cs.AI

PDF URL: https://arxiv.org/pdf/2506.10558v2.pdf

Published: 2025-06-12T10:31:23Z


AI Domain Papers

1. Drowning in Documents: Consequences of Scaling Reranker Inference

Rerankers, typically cross-encoders, are computationally intensive but are frequently used because they are widely assumed to outperform cheaper initial IR systems. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. To provide a more robust evaluation, we prioritize strong first-stage retrieval using modern dense embeddings and test rerankers on a variety of carefully chosen, challenging tasks, including internally curated datasets to avoid contamination, and out-of-domain ones. Our empirical results reveal a surprising trend: the best existing rerankers provide initial improvements when scoring progressively more documents, but their effectiveness gradually declines and can even degrade quality beyond a certain limit. We hope that our findings will spur future research to improve reranking.

Authors: Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov

Categories: cs.IR, cs.CL, cs.LG

PDF URL: https://arxiv.org/pdf/2411.11767v2.pdf

Published: 2024-11-18T17:46:32Z


2. The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model’s representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps’ complexity and accuracy. Together, these results suggest an answer to our title’s question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Authors: Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2507.08802v1.pdf

Published: 2025-07-11T17:59:55Z


3. Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Authors: Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

Categories: cs.CV, cs.AI, cs.MM

PDF URL: https://arxiv.org/pdf/2507.08801v1.pdf

Published: 2025-07-11T17:59:42Z


4. NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.

Authors: Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng

Categories: cs.CV, cs.AI, cs.CL, cs.HC, cs.LG

PDF URL: https://arxiv.org/pdf/2507.08800v1.pdf

Published: 2025-07-11T17:59:40Z


5. KV Cache Steering for Inducing Reasoning in Small Language Models

We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.

Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.08799v1.pdf

Published: 2025-07-11T17:59:36Z


6. Filter Equivariant Functions: A symmetric account of length-general extrapolation on lists

What should a function that extrapolates beyond known input/output examples look like? This is a tricky question to answer in general, as any function matching the outputs on those examples can in principle be a correct extrapolant. We argue that a “good” extrapolant should follow certain kinds of rules, and here we study a particularly appealing criterion for rule-following in list functions: that the function should behave predictably even when certain elements are removed. In functional programming, a standard way to express such removal operations is by using a filter function. Accordingly, our paper introduces a new semantic class of functions — the filter equivariant functions. We show that this class contains interesting examples, prove some basic theorems about it, and relate it to the well-known class of map equivariant functions. We also present a geometric account of filter equivariants, showing how they correspond naturally to certain simplicial structures. Our highlight result is the amalgamation algorithm, which constructs any filter-equivariant function’s output by first studying how it behaves on sublists of the input, in a way that extrapolates perfectly.

Authors: Owen Lewis, Neil Ghani, Andrew Dudzik, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković

Categories: cs.PL, cs.LG

PDF URL: https://arxiv.org/pdf/2507.08796v1.pdf

Published: 2025-07-11T17:57:16Z


7. One Token to Fool LLM-as-a-Judge

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Authors: Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu

Categories: cs.LG, cs.CL

PDF URL: https://arxiv.org/pdf/2507.08794v1.pdf

Published: 2025-07-11T17:55:22Z


8. Optimistic Exploration for Risk-Averse Constrained Reinforcement Learning

Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policies that minimise the likelihood of rare and catastrophic constraint violations caused by an environment’s inherent randomness. In general, risk-aversion leads to conservative exploration of the environment which typically results in converging to sub-optimal policies that fail to adequately maximise reward or, in some cases, fail to achieve the goal. In this paper, we propose an exploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic (ORAC), which constructs an exploratory policy by maximising a local upper confidence bound of the state-action reward value function whilst minimising a local lower confidence bound of the risk-averse state-action cost value function. Specifically, at each step, the weighting assigned to the cost value is increased or decreased if it exceeds or falls below the safety constraint value. This way the policy is encouraged to explore uncertain regions of the environment to discover high reward states whilst still satisfying the safety constraints. Our experimental results demonstrate that the ORAC approach prevents convergence to sub-optimal policies and improves significantly the reward-cost trade-off in various continuous control tasks such as Safety-Gymnasium and a complex building energy management environment CityLearn.

Authors: James McCarthy, Radu Marinescu, Elizabeth Daly, Ivana Dusparic

Categories: cs.LG, cs.AI

PDF URL: https://arxiv.org/pdf/2507.08793v1.pdf

Published: 2025-07-11T17:54:54Z


9. Learning-aided Bigraph Matching Approach to Multi-Crew Restoration of Damaged Power Networks Coupled with Road Transportation Networks

The resilience of critical infrastructure networks (CINs) after disruptions, such as those caused by natural hazards, depends on both the speed of restoration and the extent to which operational functionality can be regained. Allocating resources for restoration is a combinatorial optimal planning problem that involves determining which crews will repair specific network nodes and in what order. This paper presents a novel graph-based formulation that merges two interconnected graphs, representing crew and transportation nodes and power grid nodes, into a single heterogeneous graph. To enable efficient planning, graph reinforcement learning (GRL) is integrated with bigraph matching. GRL is utilized to design the incentive function for assigning crews to repair tasks based on the graph-abstracted state of the environment, ensuring generalization across damage scenarios. Two learning techniques are employed: a graph neural network trained using Proximal Policy Optimization and another trained via Neuroevolution. The learned incentive functions inform a bipartite graph that links crews to repair tasks, enabling weighted maximum matching for crew-to-task allocations. An efficient simulation environment that pre-computes optimal node-to-node path plans is used to train the proposed restoration planning methods. An IEEE 8500-bus power distribution test network coupled with a 21 square km transportation network is used as the case study, with scenarios varying in terms of numbers of damaged nodes, depots, and crews. Results demonstrate the approach’s generalizability and scalability across scenarios, with learned policies providing 3-fold better performance than random policies, while also outperforming optimization-based solutions in both computation time (by several orders of magnitude) and power restored.

Authors: Nathan Maurer, Harshal Kaushik, Roshni Anna Jacob, Jie Zhang, Souma Chowdhury

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2506.19703v2.pdf

Published: 2025-06-24T15:12:45Z


10. Exploring Efficient Quantification of Modeling Uncertainties with Differentiable Physics-Informed Machine Learning Architectures

Quantifying and propagating modeling uncertainties is crucial for reliability analysis, robust optimization, and other model-based algorithmic processes in engineering design and control. Now, physics-informed machine learning (PIML) methods have emerged in recent years as a new alternative to traditional computational modeling and surrogate modeling methods, offering a balance between computing efficiency, modeling accuracy, and interpretability. However, their ability to predict and propagate modeling uncertainties remains mostly unexplored. In this paper, a promising class of auto-differentiable hybrid PIML architectures that combine partial physics and neural networks or ANNs (for input transformation or adaptive parameter estimation) is integrated with Bayesian Neural networks (replacing the ANNs); this is done with the goal to explore whether BNNs can successfully provision uncertainty propagation capabilities in the PIML architectures as well, further supported by the auto-differentiability of these architectures. A two-stage training process is used to alleviate the challenges traditionally encountered in training probabilistic ML models. The resulting BNN-integrated PIML architecture is evaluated on an analytical benchmark problem and flight experiments data for a fixed-wing RC aircraft, with prediction performance observed to be slightly worse or at par with purely data-driven ML and original PIML models. Moreover, Monte Carlo sampling of probabilistic BNN weights was found to be most effective in propagating uncertainty in the BNN-integrated PIML architectures.

Authors: Manaswin Oddiraju, Bharath Varma Penumatsa, Divyang Amin, Michael Piedmonte, Souma Chowdhury

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2506.18247v2.pdf

Published: 2025-06-23T02:32:20Z