数据来源:ArXiv Domain

LLM Domain Papers

1. Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

中文摘要

幽默作为一种复杂的语言形式,源于生活的方方面面,而现有的计算幽默研究几乎只关注基于双关语的简短笑话。在这项工作中,我们研究了大型语言模型(LLMs)解释幽默的能力是否取决于特定的幽默形式。我们比较了简单双关语和需要了解现实世界实体和事件的更复杂的主题幽默的模型。在此过程中,我们整理了一个包含600个笑话的数据集,分为4种笑话类型,并手动编写高质量的解释。这些笑话包括异性恋和同性恋双关语、当代网络幽默和时事笑话,在这些笑话中,理解依赖于超越“常识”的推理,而不是植根于有关新闻事件和流行文化的世界知识。使用这个数据集,我们比较了一系列LLM准确、全面地解释不同类型笑话的零样本能力,确定了幽默解释任务中的关键研究空白。我们发现,没有一个测试模型(股份有限公司推理模型)能够可靠地生成所有笑话类型的充分解释,这进一步突出了大多数计算幽默作品对过于简单的笑话形式的狭隘关注。

Authors: Tyler Loakman, William Thorne, Chenghua Lin

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13335v1.pdf

Published: 2025-07-17T17:51:20Z


2. A Survey of Context Engineering for Large Language Models

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

中文摘要

大型语言模型(LLM)的性能从根本上取决于推理过程中提供的上下文信息。本调查介绍了上下文工程,这是一门超越简单提示设计的正式学科,涵盖了LLM信息有效载荷的系统优化。我们提出了一个全面的分类法,将上下文工程分解为其基本组件以及将它们集成到智能系统中的复杂实现。我们首先研究了基础组件:上下文检索和生成、上下文处理和上下文管理。然后,我们探索这些组件是如何在架构上集成以创建复杂的系统实现的:检索增强生成(RAG)、存储系统和工具集成推理以及多代理系统。通过对1300多篇研究论文的系统分析,我们的调查不仅为该领域建立了技术路线图,还揭示了一个关键的研究差距:模型能力之间存在根本的不对称。虽然当前的模型在高级上下文工程的增强下,在理解复杂上下文方面表现出了非凡的能力,但在生成同样复杂的长篇输出方面却存在明显的局限性。解决这一差距是未来研究的首要任务。最终,这项调查为研究人员和工程师推进情境感知人工智能提供了一个统一的框架。

Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13334v1.pdf

Published: 2025-07-17T17:50:36Z


3. The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

中文摘要

大型语言模型的评估是一项复杂的任务,已经提出了几种方法。最常见的是使用自动基准测试,法学硕士必须回答不同主题的多项选择题。然而,这种方法有一定的局限性,最令人担忧的是与人类的相关性较差。另一种方法是让人类评估LLM。这带来了可扩展性问题,因为有大量且不断增长的模型需要评估,因此基于招募多名评估人员并让他们对模型的反应进行排名来运行传统研究是不切实际的(而且成本很高)。另一种方法是使用公共场所,如流行的LM场所,任何用户都可以在其中自由评估任何问题的模型,并对两个模型的回答进行排名。然后将结果细化为模型排名。LLM的一个越来越重要的方面是它们的能量消耗,因此,评估能量意识如何影响人类选择模型的决策是有意义的。在本文中,我们提出了GEA,即生成能源竞技场,这是一个在评估过程中整合模型能耗信息的竞技场。还介绍了GEA的初步结果,表明对于大多数问题,当用户意识到能耗时,他们更喜欢更小、更节能的模型。这表明,对于大多数用户交互,更复杂和性能最高的模型所产生的额外成本和能量并没有提高感知到的响应质量,从而证明其使用是合理的。

Authors: Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego

Categories: cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.13302v1.pdf

Published: 2025-07-17T17:11:14Z


4. AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

中文摘要

我们介绍了AbGen,这是第一个旨在评估LLM在设计科学研究消融研究方面能力的基准。AbGen由来自807篇NLP论文的1500个专家注释示例组成。在此基准测试中,LLM的任务是根据给定的研究背景,为指定的模块或过程生成详细的消融研究设计。我们对DeepSeek-R1-0528和o4-mini等领先LLM的评估突显了这些模型与人类专家在消融研究设计的重要性、可信度和合理性方面存在显著的性能差距。此外,我们证明,目前的自动评估方法对我们的任务来说并不可靠,因为与人工评估相比,它们显示出明显的差异。为了更好地研究这一点,我们开发了AbGen Eval,这是一个元评估基准,旨在评估常用自动评估系统在测量LLM任务绩效方面的可靠性。我们在AbGen Eval上研究了各种LLM作为法官系统,为未来为复杂科学任务开发更有效、更可靠的基于LLM的评估系统的研究提供了见解。

Authors: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13300v1.pdf

Published: 2025-07-17T17:09:22Z


5. QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.

中文摘要

强化学习(RL)已成为训练大型语言推理模型(LLM)的关键组成部分。然而,最近的研究质疑它在改进多步推理方面的有效性,特别是在难题上。为了应对这一挑战,我们通过问题增强提出了一种简单而有效的策略:在训练过程中引入部分解决方案,以降低问题难度,并提供更多信息的学习信号。我们的QuestA方法在数学推理任务的强化学习训练中应用时,不仅可以提高pass@1而且pass@k-particularly关于标准强化学习难以取得进展的问题。这使得DeepScaleR和OpenMath Nemotron等强大的开源模型能够不断改进,进一步增强了它们的推理能力。我们使用1.5B参数模型在数学基准测试中取得了最新的最先进成果:AIME24为67.1%(+5.3%),AIME25为59.5%(+10.0%),HMMT25为35.5%(+4.0%)。此外,我们提供了QuestA提高样本效率的理论解释,为通过RL扩展推理能力提供了一条实用且可推广的途径。

Authors: Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang

Categories: cs.CL, cs.AI, 68T50

PDF URL: https://arxiv.org/pdf/2507.13266v1.pdf

Published: 2025-07-17T16:21:47Z


6. Automating Steering for Safe Multimodal Large Language Models

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

中文摘要

多模态大型语言模型(MLLM)的最新进展释放了强大的跨模态推理能力,但也引发了新的安全问题,特别是在面对对抗性多模态输入时。为了提高MLLM在推理过程中的安全性,我们引入了一种模块化和自适应的推理时间干预技术AutoSteer,而不需要对底层模型进行任何微调。AutoSteer包含三个核心组件:(1)新的安全意识评分(SAS),可自动识别模型内部层中与安全最相关的区别;(2)经过训练的自适应安全探测器,用于估计中间表示的有毒输出的可能性;以及(3)轻型拒绝头,当检测到安全风险时,选择性地干预以调节发电。在LLaVA OV和变色龙上进行的各种安全关键基准测试表明,AutoSteer显著降低了文本、视觉和跨模式威胁的攻击成功率(ASR),同时保持了一般能力。这些发现将AutoSteer定位为一个实用、可解释和有效的框架,用于更安全地部署多模式人工智能系统。

Authors: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng

Categories: cs.CL, cs.AI, cs.IR, cs.LG, cs.MM

PDF URL: https://arxiv.org/pdf/2507.13255v1.pdf

Published: 2025-07-17T16:04:55Z


7. ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.

中文摘要

非结构化临床数据可以作为独特而丰富的信息来源,为临床实践提供有意义的信息。从这些数据中提取最相关的背景对于挖掘其在患者护理中实现最佳和及时决策的真正潜力至关重要。虽然之前的研究已经探索了各种临床文本摘要的方法,但大多数之前的研究要么统一处理所有输入标记,要么依赖于基于启发式的过滤器,这可能会忽视细微的临床线索,并且无法优先考虑对决策至关重要的信息。在这项研究中,我们提出了Contextual,这是一种新的框架,将上下文保留令牌过滤方法与领域特定知识图(KG)相结合,用于上下文增强。通过保留特定于上下文的重要标记并用结构化知识丰富它们,ConTextual提高了语言连贯性和临床保真度。我们对两个公共基准数据集的广泛实证评估表明,ConTextual始终优于其他基准。我们提出的方法强调了令牌级过滤和结构化检索在增强语言和临床完整性方面的互补作用,并为提高临床文本生成的精度提供了可扩展的解决方案。

Authors: Fahmida Liza Piya, Rahmatollah Beheshti

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2504.16394v3.pdf

Published: 2025-04-23T03:42:46Z


8. HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

中文摘要

类比测试模型推断概念之间隐含关系的能力,使其成为评估推理能力的关键基准。虽然大型语言模型(LLM)在英语推理方面得到了广泛的评估,但它们在印度语中的能力仍然没有得到充分的研究,这限制了我们对这些模型是否跨语言泛化的理解。为了解决这一差距,我们引入了一个新的印地语类比测试集(HATS),其中包括405道来自印度政府考试的多项选择题。我们使用各种提示策略对最先进的多语言LLM进行基准测试,并引入了一种基于类比推理认知理论的思维链方法。这种方法提高了印地语类比问题的模型性能。我们的实验表明,无论采用何种提示策略,模型在英语提示下表现最佳。我们的测试集解决了缺乏评估印地语LLM推理能力的关键资源的问题。

Authors: Ashray Gupta, Rohan Joseph, Sunny Rai

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13238v1.pdf

Published: 2025-07-17T15:47:49Z


9. Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model’s internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.

中文摘要

大型语言模型(LLM)在通过提示利用预训练知识方面表现出了令人印象深刻的能力,但它们经常难以完成看不见的任务,特别是在数据稀缺的情况下。虽然跨任务上下文学习为跨任务传递知识提供了一种直接的解决方案,但它在鲁棒性、可扩展性和效率方面仍然面临着严峻的挑战。本文研究了在不进行参数更新或输入扩展的情况下,是否可以通过潜在空间转向来实现跨任务转移。通过对LLM潜在空间中的激活模式的分析,我们观察到上下文示例引起的增强激活在不同任务中具有一致的模式。受这些发现的启发,我们提出了CAST,这是一种新的跨任务激活指导转移框架,通过操纵模型的内部激活状态来实现有效的转移。我们的方法首先从高资源任务中选择有影响力和多样性的样本,然后利用它们的对比表示增强激活来使LLM适应低资源任务。跨域和跨语言迁移设置的广泛实验表明,我们的方法优于竞争基线,并表现出卓越的可扩展性和较低的计算成本。

Authors: Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13236v1.pdf

Published: 2025-07-17T15:47:22Z


10. A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans

The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.

中文摘要

本文介绍了一种针对人类和大型语言模型(LLMs)的通用语言创造力测试。该测试包括各种任务,旨在评估他们根据构词过程(派生和复合)和隐喻语言使用生成新的原始单词和短语的能力。我们对24个人和同等数量的LLM进行了测试,并使用OCSAI工具根据三个标准自动评估了他们的答案:原创性、精细化和灵活性。结果表明,LLM不仅在所有评估标准上都优于人类,而且在八项测试任务中的六项中表现更好。然后,我们计算了个体答案的唯一性,这表明人类和LLM之间存在一些细微差异。最后,我们对数据集进行了简短的手动分析,结果表明,人类更倾向于E(扩展)创造力,而LLM更倾向于F(混合)创造力。

Authors: Anca Dinu, Andra-Maria Florescu, Alina Resceanu

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.12039v2.pdf

Published: 2025-07-16T08:56:19Z


Agent Domain Papers

1. V-Max: A Reinforcement Learning Framework for Autonomous Driving

Learning-based decision-making has the potential to enable generalizable Autonomous Driving (AD) policies, reducing the engineering overhead of rule-based approaches. Imitation Learning (IL) remains the dominant paradigm, benefiting from large-scale human demonstration datasets, but it suffers from inherent limitations such as distribution shift and imitation gaps. Reinforcement Learning (RL) presents a promising alternative, yet its adoption in AD remains limited due to the lack of standardized and efficient research frameworks. To this end, we introduce V-Max, an open research framework providing all the necessary tools to make RL practical for AD. V-Max is built on Waymax, a hardware-accelerated AD simulator designed for large-scale experimentation. We extend it using ScenarioNet’s approach, enabling the fast simulation of diverse AD datasets.

中文摘要

基于学习的决策有可能实现通用的自动驾驶(AD)策略,减少基于规则的方法的工程开销。模仿学习(IL)仍然是主要的范式,受益于大规模的人类演示数据集,但它存在固有的局限性,如分布转移和模仿差距。强化学习(RL)提供了一种有前景的替代方案,但由于缺乏标准化和高效的研究框架,它在AD中的应用仍然有限。为此,我们介绍了V-Max,这是一个开放的研究框架,提供了所有必要的工具,使RL在AD中实用。V-Max建立在Waymax之上,Waymax是一个为大规模实验设计的硬件加速AD模拟器。我们使用ScenarioNet的方法对其进行了扩展,实现了对不同AD数据集的快速模拟。

Authors: Valentin Charraut, Waël Doulazmi, Thomas Tournaire, Thibault Buhet

Categories: cs.LG, cs.AI, cs.RO

PDF URL: https://arxiv.org/pdf/2503.08388v3.pdf

Published: 2025-03-11T12:53:24Z


2. Black Box Deployed — Functional Criteria for Artificial Moral Agents in the LLM Era

The advancement of powerful yet opaque large language models (LLMs) necessitates a fundamental revision of the philosophical criteria used to evaluate artificial moral agents (AMAs). Pre-LLM frameworks often relied on the assumption of transparent architectures, which LLMs defy due to their stochastic outputs and opaque internal states. This paper argues that traditional ethical criteria are pragmatically obsolete for LLMs due to this mismatch. Engaging with core themes in the philosophy of technology, this paper proffers a revised set of ten functional criteria to evaluate LLM-based artificial moral agents: moral concordance, context sensitivity, normative integrity, metaethical awareness, system resilience, trustworthiness, corrigibility, partial transparency, functional autonomy, and moral imagination. These guideposts, applied to what we term “SMA-LLS” (Simulating Moral Agency through Large Language Systems), aim to steer AMAs toward greater alignment and beneficial societal integration in the coming years. We illustrate these criteria using hypothetical scenarios involving an autonomous public bus (APB) to demonstrate their practical applicability in morally salient contexts.

中文摘要

强大但不透明的大型语言模型(LLM)的发展需要对用于评估人工道德主体(AMA)的哲学标准进行根本性的修订。LLM之前的框架通常依赖于透明架构的假设,而LLM由于其随机输出和不透明的内部状态而无法实现这一假设。本文认为,由于这种不匹配,传统的伦理标准对法学硕士来说已经过时了。本文结合技术哲学的核心主题,提出了一套修订后的十项功能标准来评估基于LLM的人工道德主体:道德一致性、情境敏感性、规范完整性、元伦理意识、系统弹性、可信度、可纠正性、部分透明度、功能自主性和道德想象力。这些指导方针适用于我们所说的“SMA-LLS”(通过大型语言系统模拟道德能动性),旨在引导AMA在未来几年实现更大的一致性和有益的社会融合。我们使用涉及自动公共巴士(APB)的假设场景来说明这些标准,以证明它们在道德突出的情况下的实际适用性。

Authors: Matthew E. Brophy

Categories: cs.AI, 68T27, 03B42 68T27, 03B4268T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B4268T27, 03B42, I.2.0; I.2.9; K.4.1

PDF URL: https://arxiv.org/pdf/2507.13175v1.pdf

Published: 2025-07-17T14:39:29Z


The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

中文摘要

静态基准与现实世界法律实践的动态性质之间的差距是推进法律情报的关键障碍。为此,我们介绍了J1-ENVS,这是第一个为基于LLM的代理人量身定制的交互式动态法律环境。在法律专家的指导下,它包括来自中国法律实践的六个代表性场景,涵盖了三个环境复杂性层次。我们进一步介绍了J1-EVAL,这是一个细粒度的评估框架,旨在评估不同法律熟练程度的任务绩效和程序合规性。对17名法学硕士代理人的广泛实验表明,虽然许多模型展示了扎实的法律知识,但它们在动态环境中难以进行程序执行。即使是SOTA型号GPT-4o,其整体性能也达不到60%。这些发现突显了实现动态法律情报的持续挑战,并为指导未来的研究提供了宝贵的见解。

Authors: Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.04037v2.pdf

Published: 2025-07-05T13:31:21Z


4. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

中文摘要

现代语言代理必须在长期、多轮交互中运行,在这种交互中,它们检索外部信息,适应观察,并回答相互依存的查询。然而,大多数LLM系统依赖于完整的上下文提示,无论相关性如何,都会附加所有过去的回合。这会导致内存无限增长,计算成本增加,以及在非分布输入长度上的推理性能下降。我们介绍了MEM1,这是一个端到端的强化学习框架,使代理能够在长时间的多回合任务中以恒定的记忆进行操作。在每个回合,MEM1都会更新一个紧凑的共享内部状态,共同支持内存整合和推理。这种状态将先前的记忆与环境中的新观察结果相结合,同时战略性地丢弃不相关或冗余的信息。为了支持在更真实和组合的环境中进行训练,我们提出了一种简单但有效且可扩展的方法,通过将现有数据集组合成任意复杂的任务序列来构建多回合环境。跨三个领域(包括内部检索QA、开放域网络QA和多回合网络购物)的实验表明,在16个目标的多跳QA任务中,MEM1-7B与Qwen2.5-14B-Instruct相比,性能提高了3.5倍,内存使用率降低了3.7倍,并且泛化超出了训练范围。我们的研究结果表明,推理驱动的内存整合有望成为训练长期交互代理的现有解决方案的可扩展替代方案,从而优化效率和性能。

Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

Categories: cs.CL, cs.AI, cs.IR

PDF URL: https://arxiv.org/pdf/2506.15841v2.pdf

Published: 2025-06-18T19:44:46Z


5. Coral Protocol: Open Infrastructure Connecting The Internet of Agents

Coral Protocol is an open and decentralized collaboration infrastructure that enables communication, coordination, trust and payments for The Internet of Agents. It addresses the growing need for interoperability in a world where organizations are deploying multiple specialized AI agents that must work together across domains and vendors. As a foundational platform for multi-agent AI ecosystems, Coral establishes a common language and coordination framework allowing any agent to participate in complex workflows with others. Its design emphasizes broad compatibility, security, and vendor neutrality, ensuring that agent interactions are efficient and trustworthy. In particular, Coral introduces standardized messaging formats for agent communication, a modular coordination mechanism for orchestrating multi-agent tasks, and secure team formation capabilities for dynamically assembling trusted groups of agents. Together, these innovations position Coral Protocol as a cornerstone of the emerging “Internet of Agents,” unlocking new levels of automation, collective intelligence, and business value through open agent collaboration.

中文摘要

Coral Protocol是一个开放和去中心化的协作基础设施,为代理互联网提供通信、协调、信任和支付。它解决了在一个组织正在部署多个专业人工智能代理的世界中日益增长的互操作性需求,这些代理必须跨域和供应商协同工作。作为多智能体人工智能生态系统的基础平台,Coral建立了一个通用的语言和协调框架,允许任何智能体与其他智能体一起参与复杂的工作流程。它的设计强调广泛的兼容性、安全性和供应商中立性,确保代理交互高效可靠。特别是,Coral引入了用于代理通信的标准化消息格式、用于编排多代理任务的模块化协调机制,以及用于动态组装可信代理组的安全团队组建功能。这些创新共同将Coral协议定位为新兴的“代理互联网”的基石,通过开放的代理协作,将自动化、集体智能和商业价值提升到新的水平。

Authors: Roman J. Georgio, Caelum Forder, Suman Deb, Andri Rahimov, Peter Carroll, Önder Gürcan

Categories: cs.MA, cs.AI

PDF URL: https://arxiv.org/pdf/2505.00749v2.pdf

Published: 2025-04-30T22:17:13Z


6. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

中文摘要

在本报告中,我们将介绍Gemini 2.X型号系列:Gemini 2.5 Pro和Gemini 2.5 Flash,以及我们早期的Gemini 2.0 Flash和Flash Lite型号。Gemini 2.5 Pro是我们迄今为止能力最强的模型,在前沿编码和推理基准上实现了SoTA性能。除了令人难以置信的编码和推理技能外,Gemini 2.5 Pro还是一个擅长多模态理解的思维模型,现在能够处理长达3小时的视频内容。其独特的长上下文、多模式和推理能力的组合可以结合起来,以解锁新的代理工作流程。Gemini 2.5 Flash以极低的计算和延迟要求提供了出色的推理能力,Gemini 2.0 Flash和Flash Lite以低延迟和低成本提供了高性能。总之,Gemini 2.X模型生成跨越了模型能力与成本的完整帕累托边界,允许用户探索复杂代理问题解决的可能性边界。

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina, Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D’Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D’olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O’Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei, Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui, Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adrià Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ilić, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, Vít Listík, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Giménez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Frömmgen, Yanhua Sun, Ralph Leith, Alfonso Castaño, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, José Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru Ţifrea, Tao Zhou, Tony, Nguy~ên, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong, Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D’sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O’Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga Świetlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Michaël Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter Vértes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Paweł Wesołowski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Angéline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj Ćevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Sabolić, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante Žužul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zalán Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc’aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gellért Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Gaël Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Mikuła, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fréchette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, Iñaki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, Sébastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strejček, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, João Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veličković, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, Víctor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vušak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Dević, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirović, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lučić, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Mohammad Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren, Chen, Danila Sinopalnikov, Hao Zhang, Vlado Galić, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Raphaël Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, Léonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, Séb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D’Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altché, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Plucińska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozińska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, András György, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fränken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adrià Puigdomènech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivière, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yuvein, Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, Ágoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao, Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eugénie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannen, Amer Sinha, Charles Chen, Louis O’Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccolò Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrović, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran Žužić, Victor Carbune, Tao Li, Myle Ott, Félix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Andačić, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, Rémi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas Köppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Rakićević, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Paweł Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Comşa, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, Håvard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, André Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, François-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio Vázquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ražnatović, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian, Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin, Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim Põder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Taïga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrzębski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bražinskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav Žanić, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn, Urvashi Khandelwal, Frederik Benzing, Arthur Conmy, Andrey Simanovsky, Françoise Beaufays, Eugene Weinstein, Tongzhou Chen, Luke Leonhard, Bhuvana Ramabhadran

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.06261v3.pdf

Published: 2025-07-07T17:36:04Z


7. MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

中文摘要

基于大型语言模型(LLM)的智能代理的快速崛起强调了对强大、可扩展的评估框架的需求。现有方法依赖于静态基准和劳动密集型数据收集,限制了实际评估。我们介绍\oursystemname,这是一个基于开源模型上下文协议(MCP)的框架,可以自动化跨不同领域的端到端任务生成和LLM代理的深度评估。MCPEval标准化了指标,与本地代理工具无缝集成,并消除了构建评估管道的手动工作。五个现实世界领域的实证结果表明,它在揭示细微差别、特定领域的性能方面是有效的。我们公开发布MCPEvalhttps://github.com/SalesforceAIResearch/MCPEval促进可重复和标准化的LLM试剂评估。

Authors: Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Huan Wang, Shelby Heinecke, Silvio Savarese, Caiming Xiong

Categories: cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.12806v1.pdf

Published: 2025-07-17T05:46:27Z


8. Imitating Mistakes in a Learning Companion AI Agent for Online Peer Learning

In recent years, peer learning has gained attention as a method that promotes spontaneous thinking among learners, and its effectiveness has been confirmed by numerous studies. This study aims to develop an AI Agent as a learning companion that enables peer learning anytime and anywhere. However, peer learning between humans has various limitations, and it is not always effective. Effective peer learning requires companions at the same proficiency levels. In this study, we assume that a learner’s peers with the same proficiency level as the learner make the same mistakes as the learner does and focus on English composition as a specific example to validate this approach.

中文摘要

近年来,同伴学习作为一种促进学习者自发思考的方法受到了关注,其有效性已得到众多研究的证实。本研究旨在开发一种AI Agent作为学习伴侣,使同伴能够随时随地学习。然而,人类之间的同伴学习有各种局限性,并不总是有效的。有效的同伴学习需要具有相同熟练程度的同伴。在这项研究中,我们假设与学习者具有相同熟练程度的同龄人与学习者犯的错误相同,并以英语作文为例来验证这种方法。

Authors: Sosui Moribe, Taketoshi Ushiama

Categories: cs.AI, cs.MA

PDF URL: https://arxiv.org/pdf/2507.12801v1.pdf

Published: 2025-07-17T05:37:07Z


9. VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience — from none to expert — demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

中文摘要

传统上,文本分析需要自然语言处理(NLP)或文本分析方面的专业知识,这对入门级分析师来说是一个障碍。大型语言模型(LLM)的最新进展通过实现更易访问和自动化的文本分析(例如主题检测、摘要、信息提取等)改变了NLP的格局。我们介绍VIDEE,这是一个支持入门级数据分析师使用智能代理进行高级文本分析的系统。VIDEE实例化了一个由三个阶段组成的人类代理协作工作流:(1)分解,它结合了一个人在循环中的蒙特卡洛树搜索算法,以支持具有人类反馈的生成推理,(2)执行,它生成了一个可执行的文本分析管道,以及(3)评估,它集成了基于LLM的评估和可视化,以支持用户对执行结果的验证。我们进行了两个定量实验来评估VIDEE的有效性并分析常见的代理错误。一项涉及具有不同NLP和文本分析经验水平的参与者(从无到专家)的用户研究展示了该系统的可用性,并揭示了不同的用户行为模式。研究结果确定了人机协作的设计意义,验证了VIDEE对非专家用户的实用性,并为智能文本分析系统的未来改进提供了信息。

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

Categories: cs.CL, cs.AI, cs.HC

PDF URL: https://arxiv.org/pdf/2506.21582v2.pdf

Published: 2025-06-17T05:24:58Z


10. Autonomy for Older Adult-Agent Interaction

As the global population ages, artificial intelligence (AI)-powered agents have emerged as potential tools to support older adults’ caregiving. Prior research has explored agent autonomy by identifying key interaction stages in task processes and defining the agent’s role at each stage. However, ensuring that agents align with older adults’ autonomy preferences remains a critical challenge. Drawing on interdisciplinary conceptualizations of autonomy, this paper examines four key dimensions of autonomy for older adults: decision-making autonomy, goal-oriented autonomy, control autonomy, and social responsibility autonomy. This paper then proposes the following research directions: (1) Addressing social responsibility autonomy, which concerns the ethical and social implications of agent use in communal settings; (2) Operationalizing agent autonomy from the task perspective; and (3) Developing autonomy measures.

中文摘要

随着全球人口老龄化,人工智能(AI)驱动的代理已成为支持老年人护理的潜在工具。先前的研究通过识别任务过程中的关键交互阶段并定义代理在每个阶段的角色来探索代理自主性。然而,确保代理人符合老年人的自主偏好仍然是一个关键挑战。基于跨学科的自主概念,本文探讨了老年人自主的四个关键维度:决策自主、目标导向自主、控制自主和社会责任自主。然后,本文提出了以下研究方向:(1)解决社会责任自治问题,这涉及在公共环境中使用代理人的伦理和社会影响;(2)从任务的角度实现代理自治;(3)制定自治措施。

Authors: Jiaxin An

Categories: cs.HC, cs.AI

PDF URL: https://arxiv.org/pdf/2507.12767v1.pdf

Published: 2025-07-17T03:46:13Z


AI Domain Papers

1. VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

中文摘要

最近的研究表明,选择信息丰富且相关的视频帧可以显著提高视频大语言模型(video LLM)的性能。当前的方法,如减少帧间冗余、采用单独的模型进行图像文本相关性评估或利用时间视频基础进行事件定位,基本上采用了无监督学习范式,而它们很难解决长视频理解中的复杂场景。我们提出了视频的指令时间接地(VideoITG),其特点是根据用户指令进行定制的帧采样。VideoITG的核心是VidThinker管道,这是一个明确模仿人类注释过程的自动注释框架。首先,它根据指令生成详细的剪辑级字幕;然后,通过指令引导推理检索相关视频片段;最后,它执行细粒度的帧选择,以确定信息量最大的视觉证据。利用VidThinker,我们构建了VideoITG-40K数据集,其中包含40K个视频和500K个指示的时间基础注释。然后,我们设计了一个即插即用的VideoITG模型,该模型利用视频LLM的视觉语言对齐和推理能力,以判别的方式进行有效的帧选择。结合视频LLM,VideoITG在多个多模态视频理解基准测试中实现了一致的性能改进,显示了其在视频理解方面的优势和巨大潜力。

Authors: Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu

Categories: cs.CV, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13353v1.pdf

Published: 2025-07-17T17:59:59Z


2. Hierarchical Rectified Flow Matching with Mini-Batch Couplings

Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at https://riccizz.github.io/HRF_coupling.

中文摘要

流匹配已成为一种引人注目的生成建模方法,广泛应用于各个领域。为了通过流动匹配模型生成数据,通过建模速度场的正向积分对常微分方程(ODE)进行数值求解。为了更好地捕捉典型速度场中固有的多模态,最近引入了分层流匹配。它使用在生成数据时进行数值积分的ODE层次结构。这种ODE层次结构捕获了多模态速度分布,就像香草流匹配能够对多模态数据分布进行建模一样。虽然这种层次结构能够对多模态速度分布进行建模,但建模分布的复杂性在层次结构的各个级别上保持不变。本文研究了如何通过小批量耦合逐步调整层次结构不同级别分布的复杂性。我们通过合成和成像数据的令人信服的结果,展示了小批量耦合在分层整流流匹配中的优势。代码可在以下网址获得https://riccizz.github.io/HRF_coupling.

Authors: Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao

Categories: cs.CV, cs.LG

PDF URL: https://arxiv.org/pdf/2507.13350v1.pdf

Published: 2025-07-17T17:59:56Z


3. VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

中文摘要

视觉语言模型(VLM)的最新进展通过增加视觉标记的数量来提高性能,视觉标记的长度通常比文本标记长得多。然而,我们观察到,大多数现实世界的场景不需要如此大量的视觉标记。虽然在一小部分OCR相关任务中性能显著下降,但模型在大多数其他一般VQA任务中仍能准确执行,分辨率仅为1/4。因此,我们建议动态处理具有不同分辨率的不同样本,并提出了一种新的视觉令牌压缩范式,即VisionThink。它从降采样图像开始,巧妙地决定它是否足以解决问题。否则,模型可以输出一个特殊的令牌来请求更高分辨率的图像。与使用固定修剪比率或阈值压缩令牌的现有高效VLM方法相比,VisionThink自主决定是否逐案压缩令牌。因此,它在OCR相关任务上表现出强大的细粒度视觉理解能力,同时在更简单的任务上节省了大量的视觉符号。我们采用强化学习,并提出LLM作为判断策略,将RL成功应用于一般VQA任务。此外,我们精心设计了一个奖励函数和惩罚机制,以实现稳定合理的图像调整调用率。大量的实验证明了我们的方法的优越性、效率和有效性。我们的代码可在https://github.com/dvlab-research/VisionThink.

Authors: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Categories: cs.CV, cs.AI, cs.CL, cs.LG

PDF URL: https://arxiv.org/pdf/2507.13348v1.pdf

Published: 2025-07-17T17:59:55Z


4. Imbalance in Balance: Online Concept Balancing in Generation Models

In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.

中文摘要

在视觉生成任务中,复杂概念的响应和组合往往缺乏稳定性,容易出错,这仍然是一个探索不足的领域。本文试图通过精心设计的实验来探索概念反应不佳的原因。我们还设计了一个概念均衡损耗函数(IMBA损耗)来解决这个问题。我们提出的方法是在线的,消除了离线数据集处理的需要,并且需要最少的代码更改。在我们新提出的复杂概念基准Inert CompBench和另外两个公共测试集中,我们的方法显著提高了基线模型的概念响应能力,并仅用少量代码就产生了极具竞争力的结果。

Authors: Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai

Categories: cs.CV, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13345v1.pdf

Published: 2025-07-17T17:59:47Z


5. DeFine: Decision-Making with Analogical Reasoning over Factor Profiles

LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.

中文摘要

法学硕士是决策的理想选择,这要归功于他们在长期背景下的推理能力。然而,在处理描述复杂场景的语音记录时会出现挑战,因为它们很冗长,包括重复、模糊和含糊。例如,在公司电话财报会议期间,尽管未来盈利存在不确定性,但高管可能会预测积极的收入前景,以安抚投资者。LLM在做出决策时系统地考虑这种不确定性至关重要。在本文中,我们介绍了\textsc{DeFine},这是一个从复杂场景构建概率因子分布的模块化框架。然后,它将这些配置文件与类比推理相结合,利用过去类似经验的见解来指导LLM在新情况下做出关键决策。我们的框架将量化不确定性并将其纳入法学硕士决策的任务分开。这种方法在咨询和财务审议等领域特别有用,在这些领域,在不确定性下做出决策至关重要。

Authors: Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2410.01772v2.pdf

Published: 2024-10-02T17:29:34Z


6. Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.

中文摘要

通过模仿学习视觉运动策略已被证明在广泛的机器人领域是有效的。然而,这些政策的执行在很大程度上取决于培训演示的数量,这需要在现实世界中进行昂贵的数据收集。在这项工作中,我们的目标是通过利用来自各种实施例的现有或具有成本效益的数据来减少学习视觉运动机器人策略时的数据收集工作,例如公共机器人数据集和人类玩物体的数据集(来自游戏的人类数据)。我们的方法利用了两个关键见解。首先,我们使用光流作为实施例无关的动作表示,在多实施例数据集上训练世界模型(WM),并在目标实施例的少量机器人数据上对其进行微调。其次,我们开发了一种方法,即潜在策略指导(LPS),通过在WM的潜在空间中搜索更好的动作序列来提高行为克隆策略的输出。在现实世界的实验中,我们观察到,通过将策略与WM相结合,对不同机器人或游戏中具有成本效益的人类数据集从现有的Open X实施例数据集中采样的2000集进行预训练,用少量数据训练的策略的性能有了显著提高(30次演示的相对提高超过50%,50次演示的相对于提高超过20%)。

Authors: Yiqi Wang, Mrinal Verghese, Jeff Schneider

Categories: cs.RO, cs.AI, cs.LG

PDF URL: https://arxiv.org/pdf/2507.13340v1.pdf

Published: 2025-07-17T17:57:57Z


7. Training Transformers with Enforced Lipschitz Constants

Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods — weight decay and spectral normalization — allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon’s update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

中文摘要

神经网络通常对输入和权重扰动高度敏感。这种敏感性与诸如易受对抗性例子、发散训练和过拟合等病理学有关。为了解决这些问题,过去的研究着眼于完全从Lipschitz组件构建神经网络。然而,这些技术还没有成熟到研究人员已经训练出一种现代架构的程度,例如在初始化后强制执行Lipschitz证书的变压器。为了探索这一差距,我们首先开发和基准测试新的、计算高效的工具,用于维护范数约束的权重矩阵。应用这些工具,我们能够在整个训练过程中强制执行Lipschitz边界来训练变压器模型。我们发现优化器动态很重要:从AdamW切换到Muon改进了标准方法——权重衰减和谱归一化——使模型能够在较低的Lipschitz界限下达到相同的性能。受Muon具有固定谱范数的更新的启发,我们共同设计了一种权重约束方法,改善了MLP和2M参数变换器上Lipschitz与性能的权衡。我们对莎士比亚文本的2-Lipschitz变换器达到了60%的验证准确率。扩展到145M参数,我们的10 Lipschitz转换器在互联网文本上的准确率达到21%。然而,为了与NanoGPT基线验证准确率39.4%相匹配,我们的Lipschitz上限增加到10^264。尽管如此,我们的Lipschitz变压器在没有层范数、QK范数和logit-tanh软封顶等稳定性措施的情况下进行训练。

Authors: Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola

Categories: cs.LG

PDF URL: https://arxiv.org/pdf/2507.13338v1.pdf

Published: 2025-07-17T17:55:00Z


8. FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human — or superhuman — expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI’s o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples — highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

中文摘要

前沿人工智能模型展示了强大的知识广度。但是,它们离真正的人类或超人的专业知识有多近呢?真正的专家可以解决最困难的问题,突破科学理解的界限。为了阐明前沿模型能力的局限性,我们避开了人为的竞争性编程难题,而是专注于现实生活中的研究问题。我们构建了公式一,这是一个位于图论、逻辑和算法交叉点的基准,所有这些都在前沿模型的训练分布范围内。我们的问题要求极高,需要一系列推理步骤。数据集有三个关键属性。首先,它具有商业利益,涉及实际的大规模优化问题,例如路由、调度和网络设计中出现的问题。其次,它是从图上的一元二阶(MSO)逻辑的高度表达性框架中生成的,为大规模自动生成问题铺平了道路;非常适合构建RL环境。第三,我们的许多问题都与理论计算机科学的前沿以及其中的核心猜想密切相关,例如强指数时间假说(SETH)。因此,我们数据集上任何重大的算法进步,除了已知的结果之外,都可能带来深远的理论意义。值得注意的是,像OpenAI的o3这样的最先进的模型在FormulaOne上完全失败了,即使给出了10次尝试和解释性的少数例子,也只解决了不到1%的问题,这突显了它们在某些领域距离专家级理解还有多远。为了支持进一步的研究,我们还策划了Formula One Warmup,提供了一组来自同一发行版的更简单的任务。我们发布了完整的语料库以及一个全面的评估框架。

Authors: Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua

Categories: cs.AI, cs.CC, math.LO

PDF URL: https://arxiv.org/pdf/2507.13337v1.pdf

Published: 2025-07-17T17:53:55Z


9. Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

中文摘要

幽默作为一种复杂的语言形式,源于生活的方方面面,而现有的计算幽默研究几乎只关注基于双关语的简短笑话。在这项工作中,我们研究了大型语言模型(LLMs)解释幽默的能力是否取决于特定的幽默形式。我们比较了简单双关语和需要了解现实世界实体和事件的更复杂的主题幽默的模型。在此过程中,我们整理了一个包含600个笑话的数据集,分为4种笑话类型,并手动编写高质量的解释。这些笑话包括异性恋和同性恋双关语、当代网络幽默和时事笑话,在这些笑话中,理解依赖于超越“常识”的推理,而不是植根于有关新闻事件和流行文化的世界知识。使用这个数据集,我们比较了一系列LLM准确、全面地解释不同类型笑话的零样本能力,确定了幽默解释任务中的关键研究空白。我们发现,没有一个测试模型(股份有限公司推理模型)能够可靠地生成所有笑话类型的充分解释,这进一步突出了大多数计算幽默作品对过于简单的笑话形式的狭隘关注。

Authors: Tyler Loakman, William Thorne, Chenghua Lin

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13335v1.pdf

Published: 2025-07-17T17:51:20Z


10. A Survey of Context Engineering for Large Language Models

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

中文摘要

大型语言模型(LLM)的性能从根本上取决于推理过程中提供的上下文信息。本调查介绍了上下文工程,这是一门超越简单提示设计的正式学科,涵盖了LLM信息有效载荷的系统优化。我们提出了一个全面的分类法,将上下文工程分解为其基本组件以及将它们集成到智能系统中的复杂实现。我们首先研究了基础组件:上下文检索和生成、上下文处理和上下文管理。然后,我们探索这些组件是如何在架构上集成以创建复杂的系统实现的:检索增强生成(RAG)、存储系统和工具集成推理以及多代理系统。通过对1300多篇研究论文的系统分析,我们的调查不仅为该领域建立了技术路线图,还揭示了一个关键的研究差距:模型能力之间存在根本的不对称。虽然当前的模型在高级上下文工程的增强下,在理解复杂上下文方面表现出了非凡的能力,但在生成同样复杂的长篇输出方面却存在明显的局限性。解决这一差距是未来研究的首要任务。最终,这项调查为研究人员和工程师推进情境感知人工智能提供了一个统一的框架。

Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.13334v1.pdf

Published: 2025-07-17T17:50:36Z


Evaluation Domain Papers

1. Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark

The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.

中文摘要

基于推理的姿态估计(RPE)基准已成为姿态感知多模态大型语言模型(MLLM)广泛采用的评估标准。尽管具有重要意义,但我们发现了阻碍公平和一致的定量评估的关键再现性和基准质量问题。最值得注意的是,该基准使用了与原始3DPW数据集不同的图像索引,迫使研究人员进行繁琐且容易出错的手动匹配过程,以获得定量指标(如MPJPE、PA-MPJPE)的准确地面实况(GT)注释。此外,我们的分析揭示了几个固有的基准质量限制,包括显著的图像冗余、场景不平衡、过于简单的姿势和模糊的文本描述,共同破坏了对不同场景的可靠评估。为了减轻人工劳动并提高可重复性,我们通过细致的视觉匹配仔细改进了GT注释,并将这些改进后的注释作为开源资源公开发布,从而促进了一致的定量评估,并促进了人类姿势感知多模态推理的未来发展。

Authors: Junsu Kim, Naeun Kim, Jaeho Lee, Incheol Park, Dongyoon Han, Seungryul Baek

Categories: cs.CV, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13314v1.pdf

Published: 2025-07-17T17:33:11Z


2. The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

中文摘要

大型语言模型的评估是一项复杂的任务,已经提出了几种方法。最常见的是使用自动基准测试,法学硕士必须回答不同主题的多项选择题。然而,这种方法有一定的局限性,最令人担忧的是与人类的相关性较差。另一种方法是让人类评估LLM。这带来了可扩展性问题,因为有大量且不断增长的模型需要评估,因此基于招募多名评估人员并让他们对模型的反应进行排名来运行传统研究是不切实际的(而且成本很高)。另一种方法是使用公共场所,如流行的LM场所,任何用户都可以在其中自由评估任何问题的模型,并对两个模型的回答进行排名。然后将结果细化为模型排名。LLM的一个越来越重要的方面是它们的能量消耗,因此,评估能量意识如何影响人类选择模型的决策是有意义的。在本文中,我们提出了GEA,即生成能源竞技场,这是一个在评估过程中整合模型能耗信息的竞技场。还介绍了GEA的初步结果,表明对于大多数问题,当用户意识到能耗时,他们更喜欢更小、更节能的模型。这表明,对于大多数用户交互,更复杂和性能最高的模型所产生的额外成本和能量并没有提高感知到的响应质量,从而证明其使用是合理的。

Authors: Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego

Categories: cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.13302v1.pdf

Published: 2025-07-17T17:11:14Z


3. AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

中文摘要

我们介绍了AbGen,这是第一个旨在评估LLM在设计科学研究消融研究方面能力的基准。AbGen由来自807篇NLP论文的1500个专家注释示例组成。在此基准测试中,LLM的任务是根据给定的研究背景,为指定的模块或过程生成详细的消融研究设计。我们对DeepSeek-R1-0528和o4-mini等领先LLM的评估突显了这些模型与人类专家在消融研究设计的重要性、可信度和合理性方面存在显著的性能差距。此外,我们证明,目前的自动评估方法对我们的任务来说并不可靠,因为与人工评估相比,它们显示出明显的差异。为了更好地研究这一点,我们开发了AbGen Eval,这是一个元评估基准,旨在评估常用自动评估系统在测量LLM任务绩效方面的可靠性。我们在AbGen Eval上研究了各种LLM作为法官系统,为未来为复杂科学任务开发更有效、更可靠的基于LLM的评估系统的研究提供了见解。

Authors: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13300v1.pdf

Published: 2025-07-17T17:09:22Z


4. Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour

Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical ‘pets’, including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics.

中文摘要

机器人越来越多地融入各个行业,特别是在医疗保健领域。然而,四足机器人的许多有价值的应用仍然被忽视。本研究探讨了三种强化学习算法在训练模拟四足机器人自主导航和避障方面的有效性。目标是开发一种能够进行路径跟踪和避障的机器人导盲犬模拟,具有在现实世界中为导盲犬和视障人士提供长期帮助的潜力。它还寻求扩大对医疗“宠物”的研究,包括机器人引导和警戒犬。对13篇相关研究论文的比较分析形成了关键评估标准,包括碰撞检测、寻路算法、传感器使用、机器人类型和仿真平台。该研究侧重于传感器输入、碰撞频率、奖励信号和学习进度,以确定哪种算法最能支持复杂环境中的机器人导航。定制环境用于确保在受控条件下对所有三种算法进行公平评估,从而实现一致的数据收集。结果表明,近端策略优化(PPO)在所有指标上都优于深度Q网络(DQN)和Q学习,特别是在每集目标的平均和中值步数方面。通过分析这些结果,这项研究为机器人导航、人工智能和医疗机器人技术做出了贡献,为人工智能驱动的四足动物移动的可行性及其在辅助机器人中的作用提供了见解。

Authors: Emma M. A. Harrison

Categories: cs.RO, cs.AI, cs.LG

PDF URL: https://arxiv.org/pdf/2507.13277v1.pdf

Published: 2025-07-17T16:38:14Z


5. HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

中文摘要

类比测试模型推断概念之间隐含关系的能力,使其成为评估推理能力的关键基准。虽然大型语言模型(LLM)在英语推理方面得到了广泛的评估,但它们在印度语中的能力仍然没有得到充分的研究,这限制了我们对这些模型是否跨语言泛化的理解。为了解决这一差距,我们引入了一个新的印地语类比测试集(HATS),其中包括405道来自印度政府考试的多项选择题。我们使用各种提示策略对最先进的多语言LLM进行基准测试,并引入了一种基于类比推理认知理论的思维链方法。这种方法提高了印地语类比问题的模型性能。我们的实验表明,无论采用何种提示策略,模型在英语提示下表现最佳。我们的测试集解决了缺乏评估印地语LLM推理能力的关键资源的问题。

Authors: Ashray Gupta, Rohan Joseph, Sunny Rai

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.13238v1.pdf

Published: 2025-07-17T15:47:49Z


6. SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

中文摘要

大型语言模型(LLMs)在软件工程中的快速发展揭示了现有基准测试的关键局限性,特别是广泛使用的SWE工作台数据集。最近的研究发现了严重的数据污染问题,例如SWE工作台报告称,32.67%的成功补丁涉及直接溶液泄漏,31.08%的补丁因测试用例不足而通过。我们介绍了SWE-MERA,这是一个动态的、不断更新的基准测试,旨在通过自动收集真实世界的GitHub问题和严格的质量验证来解决这些基本挑战。我们的方法实现了一个可靠的管道,在确保质量的同时最大限度地降低污染风险,从而产生了大约10000个潜在任务,目前有300个样本可用。使用Aider编码代理进行评估表明,在最先进的模型中具有很强的判别能力。我们报告了最近对2024年9月至2025年6月期间收集的任务进行评估的十几个LLM的性能。

Authors: Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

Categories: cs.SE, cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.11059v2.pdf

Published: 2025-07-15T07:52:33Z


7. Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.

中文摘要

尽管在视觉语言推理方面表现出色,但大型视觉语言模型(LVLM)可能会生成给定图像中不存在的幻觉内容。大多数现有的LVLM幻觉基准仅限于评估与物体相关的幻觉。然而,关于两个对象之间关系的潜在幻觉,即关系幻觉,仍然缺乏研究。为了弥补这一点,我们设计了一个统一的框架来同时测量LVLM中的对象幻觉和关系幻觉。我们框架的核心思想是通过从LVLM的反应中提取的(对象、关系、对象)三元组来评估幻觉,使其易于推广到不同的视觉语言任务。基于我们的框架,我们进一步介绍了Tri-HE,这是一种新的三重水平幻觉评估基准,可用于同时研究对象幻觉和关系幻觉。通过对Tri-HE的全面评估,我们观察到,在现有的LVLM中,关系幻觉问题甚至比物体幻觉更严重,突显了以前被忽视的可靠LVLM问题。此外,基于我们的研究结果,我们设计了一种简单的无需训练的方法,有效地减轻了左心室肌肉萎缩症的幻觉。我们用于复制实验的数据集和代码可在以下网址公开获取:https://github.com/wujunjie1998/Tri-HE.

Authors: Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung

Categories: cs.CV, cs.AI, cs.CL, cs.LG

PDF URL: https://arxiv.org/pdf/2410.23114v4.pdf

Published: 2024-10-30T15:25:06Z


The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

中文摘要

静态基准与现实世界法律实践的动态性质之间的差距是推进法律情报的关键障碍。为此,我们介绍了J1-ENVS,这是第一个为基于LLM的代理人量身定制的交互式动态法律环境。在法律专家的指导下,它包括来自中国法律实践的六个代表性场景,涵盖了三个环境复杂性层次。我们进一步介绍了J1-EVAL,这是一个细粒度的评估框架,旨在评估不同法律熟练程度的任务绩效和程序合规性。对17名法学硕士代理人的广泛实验表明,虽然许多模型展示了扎实的法律知识,但它们在动态环境中难以进行程序执行。即使是SOTA型号GPT-4o,其整体性能也达不到60%。这些发现突显了实现动态法律情报的持续挑战,并为指导未来的研究提供了宝贵的见解。

Authors: Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.04037v2.pdf

Published: 2025-07-05T13:31:21Z


9. MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

中文摘要

LLM的进步增强了软件工程中的任务自动化;然而,目前的评估主要集中在自然语言任务上,忽略了代码质量。大多数基准测试将高级推理置于可执行代码和实际性能之上,在理解生产中与这些模型相关的真实功能和风险方面存在差距。为了解决这个问题,我们提出了MERA代码,这是MERA基准系列的新成员,专门用于评估俄语最新代码生成LLM的代码。该基准包括跨越8种编程语言的11个评估任务。我们提出的评估方法具有一个分类法,概述了模型完成这些任务所需的实用编码技能。该基准包括一个供用户进行MERA评估的开源代码库、一个与各种编程环境兼容的评分系统,以及一个具有排行榜和提交系统的平台。我们评估开放式LLM和前沿API模型,分析它们在非英语语言的实际编码任务方面的局限性。我们正在公开发布MERA,以指导未来的研究,预测模型开发中的突破性特征,并规范评估程序。

Authors: Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova

Categories: cs.SE, cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.12284v2.pdf

Published: 2025-07-16T14:31:33Z


10. Benchmarking Sub-Genre Classification For Mainstage Dance Music

Music classification, a cornerstone of music information retrieval, supports a wide array of applications. To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, we introduce a novel benchmark featuring a new dataset and baseline. Our dataset expands the scope of sub-genres to reflect the diversity of recent mainstage live sets performed by leading DJs at global music festivals, capturing the vibrant and rapidly evolving electronic dance music (EDM) scene that engages millions of fans worldwide. We employ a continuous soft labeling approach to accommodate tracks blending multiple sub-genres, preserving their inherent complexity. Experiments demonstrate that even state-of-the-art multimodal large language models (MLLMs) struggle with this task, while our specialized baseline models achieve high accuracy. This benchmark supports applications such as music recommendation, DJ set curation, and interactive multimedia systems, with video demos provided. Our code and data are all open-sourced at https://github.com/Gariscat/housex-v2.git}{https://github.com/Gariscat/housex-v2.git.

中文摘要

音乐分类是音乐信息检索的基石,支持广泛的应用。为了解决主舞台舞蹈音乐中缺乏全面数据集和有效子流派分类方法的问题,我们引入了一种新的基准,该基准具有新的数据集和基线。我们的数据集扩展了子流派的范围,以反映全球音乐节上领先DJ最近在主舞台现场表演的多样性,捕捉到了全球数百万粉丝参与的充满活力和快速发展的电子舞曲(EDM)场景。我们采用连续的软标签方法来适应混合多个子流派的曲目,保持其固有的复杂性。实验表明,即使是最先进的多模态大型语言模型(MLLM)也难以完成这项任务,而我们的专用基线模型却能达到很高的精度。该基准测试支持音乐推荐、DJ集策划和交互式多媒体系统等应用程序,并提供视频演示。我们的代码和数据都是开源的https://github.com/Gariscat/housex-v2.git}{https://github.com/Gariscat/housex-v2.git.

Authors: Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li

Categories: cs.SD, cs.AI, cs.MM, H.5.5; I.2.1

PDF URL: https://arxiv.org/pdf/2409.06690v2.pdf

Published: 2024-09-10T17:54:00Z