数据来源:ArXiv Domain

LLM Domain Papers

1. Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.

中文摘要

大型语言模型(LLM)传统上依赖于静态训练数据,将其知识限制在固定快照上。然而,最近的进步为LLM配备了网络浏览功能,实现了实时信息检索和对实时网络内容的多步推理。虽然之前的研究已经证明了LLM访问和分析网站的能力,但它们直接检索和分析社交媒体数据的能力仍未得到探索。在这里,我们评估网络浏览LLM是否可以仅凭用户名推断社交媒体用户的人口统计属性。使用48个X(推特)帐户的合成数据集和1384名国际参与者的调查数据集,我们表明这些模型可以访问社交媒体内容,并以合理的准确性预测用户的人口统计数据。对合成数据集的分析进一步揭示了LLM如何解析和解释社交媒体资料,这可能会对活动最少的账户引入性别和政治偏见。虽然这种能力为后API时代的计算社会科学带来了希望,但它也增加了滥用的风险,尤其是在强调需要保障的信息操作和定向广告中。我们建议LLM提供商在面向公众的应用程序中限制此功能,同时为经过验证的研究目的保留受控访问。

Authors: Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei, Mohsen Mosleh

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.12372v1.pdf

Published: 2025-07-16T16:21:01Z


2. Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework’s value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.

中文摘要

大型语言模型(LLMs)在理解和生成人类语言方面表现出了显著的能力,有助于与复杂系统进行更自然的交互。然而,他们面临着诸如LLM处理的用户请求不明确等挑战。为了应对这些挑战,本文介绍并评估了一个多代理辩论框架,该框架旨在增强单个模型之外的检测和解析能力。该框架由三种LLM架构(Llama3-8B、Gemma2-9B和Mistral-7B变体)和一个具有不同歧义的数据集组成。辩论框架显著提高了Llama3-8B和Mistral-7B变体在各自基线上的表现,Mistral-7B-led辩论取得了显著的76.7%的成功率,并证明对复杂的歧义和有效的共识特别有效。虽然承认对合作策略的不同模型反应,但这些发现强调了辩论框架作为增强LLM能力的有针对性的方法的价值。这项工作通过展示结构化辩论如何提高交互系统的清晰度,为开发更强大和自适应的语言理解系统提供了重要的见解。

Authors: Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Categories: cs.CL, cs.HC

PDF URL: https://arxiv.org/pdf/2507.12370v1.pdf

Published: 2025-07-16T16:15:25Z


3. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

中文摘要

检索增强生成(RAG)通过注入外部知识来提高大型语言模型(LLM)的真实性,但它在需要多步推理的问题上存在不足;相反,纯粹以推理为导向的方法往往会产生幻觉或误解事实。这项调查在统一的推理检索视角下综合了这两个方面。我们首先绘制了高级推理如何优化RAG(推理增强RAG)的每个阶段。然后,我们展示了检索到的不同类型的知识如何为复杂推理提供缺失的前提并扩展上下文(RAG增强推理)。最后,我们重点介绍新兴的协同RAG推理框架,其中(代理)LLM迭代地交织搜索和推理,以在知识密集型基准测试中实现最先进的性能。我们对方法、数据集和开放式挑战进行了分类,并概述了通往更深入的RAG推理系统的研究途径,这些系统更有效、多模自适应、值得信赖和以人为本。该系列可在https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.09477v2.pdf

Published: 2025-07-13T03:29:41Z


4. Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization

Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there’s a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets — VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs’ understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.

中文摘要

大型语言模型(LLMs)已被广泛应用于各种NLP任务和领域,证明了它们的适应性和有效性。在电子设计自动化(EDA)领域,LLM显示出对寄存器传输级(RTL)代码生成和摘要等任务的承诺。然而,尽管用于一般代码相关任务的LLM激增,但缺乏专注于评估和改进这些硬件描述语言(HDL)模型的研究,尤其是VHDL。在这项研究中,我们使用各种度量和两个数据集——VHDL Eval和VHDL Xform——评估了现有代码LLM在VHDL代码生成和摘要方面的性能。后者是一个内部数据集,旨在衡量LLM对功能等效代码的理解。我们的研究结果显示,这些模型在不同指标上的表现一直不佳,突显出它们在该领域的适用性存在显著差距。为了应对这一挑战,我们提出了描述链(CoDes),这是一种提高LLM在VHDL代码生成和摘要任务中的性能的新方法。CoDes涉及基于以下内容生成一系列中间描述步骤:(i)用于代码生成的问题陈述,以及(ii)用于摘要的VHDL代码。然后,这些步骤与原始输入提示(问题陈述或代码)集成,并作为LLM的输入提供,以生成最终输出。我们的实验表明,CoDes方法在两个数据集的各种指标上都明显优于标准提示策略。该方法不仅提高了VHDL代码生成和摘要的质量,而且为未来旨在增强VHDL代码LLM的研究提供了框架。

Authors: Prashanth Vijayaraghavan, Apoorva Nitsure, Charles Mackin, Luyao Shi, Stefano Ambrogio, Arvind Haran, Viresh Paruthi, Ali Elzein, Dan Coops, David Beymer, Tyler Baldwin, Ehsan Degan

Categories: cs.CL, cs.AI, cs.AR

PDF URL: https://arxiv.org/pdf/2507.12308v1.pdf

Published: 2025-07-16T15:05:30Z


5. Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding

Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.

中文摘要

文本异常检测是自然语言处理(NLP)中的一项关键任务,其应用涵盖欺诈检测、错误信息识别、垃圾邮件检测和内容审核等。尽管大型语言模型(LLM)和异常检测算法取得了重大进展,但缺乏评估文本数据现有异常检测方法的标准化和全面的基准,限制了创新方法的严格比较和发展。这项工作进行了一项全面的实证研究,并引入了一个文本异常检测的基准,利用了来自各种预训练语言模型的嵌入,覆盖了广泛的文本数据集。我们的工作通过结合(1)早期语言模型(GloVe、BERT)系统地评估了基于嵌入的文本异常检测的有效性;(2)多个LLMs(LLaMa-2、LLaMa-3、Mistral、OpenAI(小、ada、大));(3)多域文本数据集(新闻、社交媒体、科学出版物);(4)综合评价指标(AUROC、AUPRC)。我们的实验揭示了一个关键的经验见解:嵌入质量显著影响异常检测效率,基于深度学习的方法在利用LLM衍生的嵌入时没有表现出优于传统浅层算法(如KNN、Isolation Forest)的性能优势。此外,我们在跨模型性能矩阵中观察到强烈的低秩特征,这为在实际应用中快速评估(或嵌入评估)和选择模型提供了一种有效的策略。此外,通过开源我们的基准工具包,其中包括来自不同模型和代码的所有嵌入https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark,这项工作为未来研究鲁棒和可扩展的文本异常检测系统奠定了基础。

Authors: Feng Xiao, Jicong Fan

Categories: cs.CL, cs.AI, cs.LG

PDF URL: https://arxiv.org/pdf/2507.12295v1.pdf

Published: 2025-07-16T14:47:41Z


6. Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it’s judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it’s unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. One of the most common types of novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.

中文摘要

新颖性是同行评审过程中评估学术论文的关键标准。传统上,它由专家判断或通过独特的参考组合进行衡量。这两种方法都有局限性:专家的知识有限,组合方法的有效性不确定。此外,尚不清楚独特引用是否真的能衡量新颖性。大型语言模型(LLM)拥有丰富的知识,而人类专家拥有LLM所不具备的判断能力。因此,我们的研究整合了法学硕士和人类专家的知识和能力,以解决新颖性评估的局限性。学术论文中最常见的新颖性类型之一是引入新方法。在本文中,我们建议利用人类知识和LLM来辅助预训练语言模型(PLM,如BERT等)预测论文的方法新颖性。具体来说,我们从同行评审报告中提取与学术论文新颖性相关的句子,并使用LLM总结学术论文的方法论部分,然后用于微调PLM。此外,我们还设计了一个具有新颖稀疏注意力的文本引导融合模块,以更好地整合人类和LLM知识。我们将我们提出的方法与大量基线进行了比较。大量实验表明,我们的方法取得了优异的性能。

Authors: Wenqing Wu, Chengzhi Zhang, Yi Zhao

Categories: cs.CL, cs.AI, cs.DL, cs.HC

PDF URL: https://arxiv.org/pdf/2507.11330v2.pdf

Published: 2025-07-15T14:03:55Z


7. Measuring Spiritual Values and Bias of Large Language Models

Large language models (LLMs) have become integral tool for users from various backgrounds. LLMs, trained on vast corpora, reflect the linguistic and cultural nuances embedded in their pre-training data. However, the values and perspectives inherent in this data can influence the behavior of LLMs, leading to potential biases. As a result, the use of LLMs in contexts involving spiritual or moral values necessitates careful consideration of these underlying biases. Our work starts with verification of our hypothesis by testing the spiritual values of popular LLMs. Experimental results show that LLMs’ spiritual values are quite diverse, as opposed to the stereotype of atheists or secularists. We then investigate how different spiritual values affect LLMs in social-fairness scenarios e.g., hate speech identification). Our findings reveal that different spiritual values indeed lead to different sensitivity to different hate target groups. Furthermore, we propose to continue pre-training LLMs on spiritual texts, and empirical results demonstrate the effectiveness of this approach in mitigating spiritual bias.

中文摘要

大型语言模型(LLM)已成为来自不同背景的用户不可或缺的工具。在庞大的语料库上训练的法学硕士反映了其训练前数据中嵌入的语言和文化细微差别。然而,这些数据中固有的价值观和观点会影响LLM的行为,从而导致潜在的偏见。因此,在涉及精神或道德价值观的情况下使用LLM需要仔细考虑这些潜在的偏见。我们的工作始于通过测试流行LLM的精神价值来验证我们的假设。实验结果表明,与无神论者或世俗主义者的刻板印象相反,LLMs的精神价值观非常多样化。然后,我们研究了不同的精神价值观如何影响社会公平场景中的LLM,例如仇恨言论识别。我们的研究结果表明,不同的精神价值观确实会导致对不同仇恨目标群体的不同敏感性。此外,我们建议继续对法学硕士进行精神文本的预培训,实证结果证明了这种方法在减轻精神偏见方面的有效性。

Authors: Songyuan Liu, Ziyang Zhang, Runze Yan, Wei Wu, Carl Yang, Jiaying Lu

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2410.11647v2.pdf

Published: 2024-10-15T14:33:23Z


8. Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training

As large language models (LLMs) become increasingly prevalent, concerns about their reliability, particularly due to hallucinations - factually inaccurate or irrelevant outputs - have grown. Our research investigates the relationship between the uncertainty in training dynamics and the emergence of hallucinations. Using models from the Pythia suite and several hallucination detection metrics, we analyze hallucination trends and identify significant variance during training. To address this, we propose \textbf{Sensitivity Dropout (SenD)}, a novel training protocol designed to reduce hallucination variance during training by deterministically dropping embedding indices with significant variability. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This metric is integrated into our training protocol, allowing SenD to be both computationally scalable and effective at reducing hallucination variance. SenD improves test-time reliability of Pythia and Meta’s Llama models by up to 17\% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains without affecting downstream task performance.

中文摘要

随着大型语言模型(LLM)的日益普及,人们对其可靠性的担忧也在增加,尤其是由于幻觉——事实上不准确或无关的输出。我们的研究调查了训练动态的不确定性与幻觉出现之间的关系。使用Pythia套件中的模型和几个幻觉检测指标,我们分析了幻觉趋势,并确定了训练过程中的显著差异。为了解决这个问题,我们提出了\textbf{Sensitivity Dropout(SenD)},这是一种新的训练协议,旨在通过确定性地降低具有显著可变性的嵌入指数来减少训练过程中的幻觉方差。此外,我们开发了一种无监督的幻觉检测度量——高效特征分数(EES),它以2倍的速度近似于传统的特征分数。该指标已集成到我们的训练协议中,使SenD在计算上具有可扩展性,并能有效减少幻觉方差。SenD将Pythia和Meta的Llama模型的测试时间可靠性提高了17%,并在不影响下游任务性能的情况下提高了维基百科、医学、法律和编码领域的事实准确性。

Authors: Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi

Categories: cs.AI, cs.CL, math.SP

PDF URL: https://arxiv.org/pdf/2410.15460v4.pdf

Published: 2024-10-20T18:18:23Z


9. Improving Contextual ASR via Multi-grained Fusion with Large Language Models

While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR’s acoustic information with LLM’s rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at https://github.com/.

中文摘要

虽然端到端的自动语音识别(ASR)模型在转录一般语音方面表现出了令人印象深刻的性能,但它们往往难以准确识别上下文相关的关键字,如专有名词或用户特定的实体。之前的方法已经探索了利用文本模式中的关键字词典来提高关键字识别,要么通过逐个标记生成的标记级融合来指导标记,要么通过短语级融合来直接复制关键字短语。然而,这些方法在不同的粒度下运行,并有其自身的局限性。在本文中,我们提出了一种新的多粒度融合方法,该方法结合了令牌级和短语级融合与大型语言模型(LLM)的优点。我们的方法采用了一种后期融合策略,将ASR的声学信息与LLM丰富的上下文知识完美结合,在细粒度令牌精度和整体短语级理解之间取得平衡。在中文和英文数据集上的实验表明,我们的方法在关键字相关指标上达到了最先进的性能,同时在非关键字文本上保持了很高的准确性。消融研究进一步证实,令牌级和短语级组件都对性能提升做出了重大贡献,在我们的联合多粒度框架中相辅相成。代码和模型将在https://github.com/.

Authors: Shilin Zhou, Zhenghua Li

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.12252v1.pdf

Published: 2025-07-16T13:59:32Z


10. Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.

中文摘要

人类的推理涉及不同的策略,每种策略都适合特定的问题。先前的研究表明,大型语言模型(LLMs)倾向于采用单一的推理策略,这可能会限制它们在各种推理挑战中的有效性。在这项工作中,我们研究了提示是否可以控制LLM推理策略,并评估了它对逻辑问题解决的影响。虽然我们的实验表明,没有单一的策略能够始终如一地提高准确性,但如果模型能够自适应地选择最优策略,性能可以得到提高。我们提出了指导LLM进行策略选择的方法,强调了提高其推理能力的新方法。

Authors: Yanjian Zhang, Guillaume Wisniewski, Nadi Tomeh, Thierry Charnois

Categories: cs.CL

PDF URL: https://arxiv.org/pdf/2507.11423v2.pdf

Published: 2025-07-15T15:47:47Z


Agent Domain Papers

1. Dynamic Risk Assessments for Offensive Cybersecurity Agents

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent’s cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline — without any external assistance. These results highlight the need to evaluate agents’ cybersecurity risk in a dynamic manner, painting a more representative picture of risk.

中文摘要

基础模型正日益成为更好的自主程序员,这增加了它们也可以自动化危险的攻击性网络操作的可能性。目前的前沿模型审计调查了此类代理的网络安全风险,但大多数都没有考虑到现实世界中对手可用的自由度。特别是,有了强大的验证者和财务激励,进攻性网络安全代理可以接受潜在对手的迭代改进。我们认为,评估应考虑网络安全背景下的扩展威胁模型,强调对手在固定计算预算内,在有状态和无状态环境中可能拥有的不同自由度。我们发现,即使计算预算相对较小(在我们的研究中为8个H100 GPU小时),对手也可以在没有任何外部帮助的情况下,将代理在InterCode CTF上的网络安全能力提高40%以上。这些结果突显了以动态方式评估代理网络安全风险的必要性,描绘了一幅更具代表性的风险图景。

Authors: Boyi Wei, Benedikt Stroebl, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson

Categories: cs.CR, cs.AI

PDF URL: https://arxiv.org/pdf/2505.18384v3.pdf

Published: 2025-05-23T21:18:59Z


2. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

中文摘要

检索增强生成(RAG)通过注入外部知识来提高大型语言模型(LLM)的真实性,但它在需要多步推理的问题上存在不足;相反,纯粹以推理为导向的方法往往会产生幻觉或误解事实。这项调查在统一的推理检索视角下综合了这两个方面。我们首先绘制了高级推理如何优化RAG(推理增强RAG)的每个阶段。然后,我们展示了检索到的不同类型的知识如何为复杂推理提供缺失的前提并扩展上下文(RAG增强推理)。最后,我们重点介绍新兴的协同RAG推理框架,其中(代理)LLM迭代地交织搜索和推理,以在知识密集型基准测试中实现最先进的性能。我们对方法、数据集和开放式挑战进行了分类,并概述了通往更深入的RAG推理系统的研究途径,这些系统更有效、多模自适应、值得信赖和以人为本。该系列可在https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.09477v2.pdf

Published: 2025-07-13T03:29:41Z


3. From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ‘locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.

中文摘要

代理网络(WoA)的概念将静态的、以文档为中心的网络转变为代表用户的自主代理环境,随着大型语言模型(LLM)的能力越来越强,它引起了越来越多的兴趣。然而,这一领域的研究仍然分散在不同的社区。当代调查对最新的LLM框架进行了编目,而多代理系统(MAS)和语义网的丰富历史通常被视为独立的遗留领域。这种碎片化掩盖了现代系统的知识谱系,阻碍了对该领域轨迹的全面理解。我们首次对《魔兽世界》进行了全面的进化概述。我们表明,像A2A和MCP这样的现代协议是对FIPA标准和基于OWL的语义代理等早期标准的充分记录的局限性的直接进化反应。为了使这一分析系统化,我们引入了一个四轴分类法(语义基础、交流范式、智力轨迹、发现机制)。该框架提供了一个统一的分析视角,用于比较所有世代的代理架构,揭示了其他人看到脱节的清晰血统。我们的分析确定了“智能轨迹”的范式转变:从编码在外部数据(语义网)或平台(MAS)中,到嵌入在代理的核心模型(LLM)中。这一转变是现代Agent AI的基础,实现了WoA长期以来设想的可扩展和自适应系统。我们得出结论,虽然新协议是必不可少的,但它们不足以建立一个强大、开放、值得信赖的生态系统。最后,我们认为下一个研究前沿在于解决持续存在的社会技术挑战,我们为新兴的WoA制定了一个新的议程,重点关注分散的身份、经济模式、安全和治理。

Authors: Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

Categories: cs.AI, cs.CL, cs.CR, cs.HC, cs.MA, I.2.11; I.2.7; C.2.4; K.6.5; I.2.4

PDF URL: https://arxiv.org/pdf/2507.10644v2.pdf

Published: 2025-07-14T16:47:19Z


4. Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models’ understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents’ ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.

中文摘要

生成式人工智能模型的用户提示往往没有得到充分说明,导致用户意图和模型理解之间的不一致。因此,用户通常必须精心改进他们的提示。我们研究了文本到图像(T2I)生成中的对齐问题,并提出了一个主动T2I代理的原型,该原型配备了一个接口,用于(1)在不确定时主动提出澄清问题,以及(2)将他们对用户意图的不确定性表示为可理解和可编辑的信念图。我们为这些代理构建了简单的原型,并提出了一种新的可扩展和自动化的评估方法,该方法使用两个代理,一个代理具有地面真实意图(图像),而另一个代理则试图提出尽可能少的问题,以与地面真实保持一致。我们在三个图像文本数据集上进行了实验:ImageInWords(Garg等人,2024)、COCO(Lin等人,2014)和DesignBench,这是我们用强大的艺术和设计元素策划的基准。在三个数据集上的实验表明,所提出的T2I代理能够提出信息性问题并引出关键信息,从而成功地与标准T2I生成的VQAScore(Lin等人,2024)进行比对,VQAScore至少高出2倍。此外,我们进行了人体研究,观察到至少90%的人类受试者发现这些药物及其信念图有助于他们的T2I工作流程,突显了我们方法的有效性。代码和DesignBench可以在以下网址找到https://github.com/google-deepmind/proactive_t2i_agents.

Authors: Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

Categories: cs.AI, cs.CV, cs.LG

PDF URL: https://arxiv.org/pdf/2412.06771v2.pdf

Published: 2024-12-09T18:56:32Z


5. Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.

中文摘要

对于临床数据集成和医疗保健服务,HL7 Contoso标准已成为复杂健康数据之间互操作性的理想格式。之前尝试将自由形式的临床记录自动转换为结构化的Contoso资源,依赖于模块化、基于规则的系统或具有指令调优和约束解码的LLM。由于它们经常受到通用性有限和结构不一致的影响,我们提出了一个由LLM代理、代码执行和医疗术语数据库工具支持的端到端框架来解决这些问题。我们的解决方案被称为Infherno,旨在遵守Contoso文档模式,并在从非结构化文本预测Contoso资源方面与人类基线竞争良好。该实施具有定制和合成数据以及本地和专有模型的前端,支持临床数据集成过程和跨机构的互操作性。

Authors: Johann Frei, Nils Feldhus, Lisa Raithel, Roland Roller, Alexander Meyer, Frank Kramer

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.12261v1.pdf

Published: 2025-07-16T14:06:51Z


6. Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

Misinformation poses a significant threat in today’s digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system’s capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space.

中文摘要

在当今的数字世界中,虚假信息构成了重大威胁,往往通过YouTube等平台迅速传播。本文介绍了一种通过开发一个人工智能驱动的系统来打击错误信息的新方法,该系统不仅可以对YouTube视频中的声明进行事实核查,还可以积极让用户参与评论部分并挑战误导性叙述。我们的系统由两个主要代理组成:真相侦探和趋势本德。Truth Sleuth从YouTube视频中提取声明,使用检索增强生成(RAG)方法——利用维基百科、谷歌搜索、谷歌FactCheck等来源——准确评估其真实性,并生成一份细致入微、全面的报告。通过严格的即时工程,Trend Bender利用这份报告以及精心策划的相关文章语料库,生成富有洞察力和说服力的评论,旨在激发富有成效的辩论。通过精心设置的自我评估循环,该代理能够迭代地改进其风格并改进其输出。我们通过在既定基准数据集上的实验和在YouTube上的实际部署来展示该系统的功能,展示了其吸引用户和潜在影响观点的潜力。我们的研究结果突显了我们的事实核查代理的高准确性,并证实了人工智能驱动的干预措施在打击错误信息和促进更知情的在线空间方面的潜力。

Authors: Cécile Logé, Rehan Ghori

Categories: cs.CL, cs.AI, cs.CY

PDF URL: https://arxiv.org/pdf/2507.10577v2.pdf

Published: 2025-07-11T10:08:05Z


7. A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. …

中文摘要

推理是一种基本的认知过程,能够进行逻辑推理、解决问题和决策。随着大型语言模型(LLM)的快速发展,推理已成为区分高级人工智能系统与支持聊天机器人的传统模型的关键能力。在这项调查中,我们沿着两个正交维度对现有方法进行了分类:(1)方案,它定义了实现推理的阶段(无论是在推理时还是通过专门的训练);以及(2)架构,它决定了推理过程中涉及的组件,区分了独立的LLM和包含外部工具的代理复合系统,以及多代理协作。在每个维度中,我们分析了两个关键视角:(1)输入层面,侧重于构建LLM所依赖的高质量提示的技术;以及(2)输出级别,即细化多个采样候选以提高推理质量的方法。这种分类提供了对LLM推理发展格局的系统理解,突出了新兴趋势,如从推理扩展到学习推理的转变(例如DeepSeek-R1),以及向代理工作流程的过渡(例如OpenAI Deep Research、Manus Agent)。此外,我们还涵盖了广泛的学习算法,从监督微调到强化学习,如PPO和GRPO,以及推理机和验证器的培训。我们还研究了代理工作流的关键设计,从生成器-评估器和LLM辩论等既定模式到最近的创新。 …

Authors: Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

Categories: cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2504.09037v2.pdf

Published: 2025-04-12T01:27:49Z


8. Robust Planning for Autonomous Vehicles with Diffusion-Based Failure Samplers

High-risk traffic zones such as intersections are a major cause of collisions. This study leverages deep generative models to enhance the safety of autonomous vehicles in an intersection context. We train a 1000-step denoising diffusion probabilistic model to generate collision-causing sensor noise sequences for an autonomous vehicle navigating a four-way intersection based on the current relative position and velocity of an intruder. Using the generative adversarial architecture, the 1000-step model is distilled into a single-step denoising diffusion model which demonstrates fast inference speed while maintaining similar sampling quality. We demonstrate one possible application of the single-step model in building a robust planner for the autonomous vehicle. The planner uses the single-step model to efficiently sample potential failure cases based on the currently measured traffic state to inform its decision-making. Through simulation experiments, the robust planner demonstrates significantly lower failure rate and delay rate compared with the baseline Intelligent Driver Model controller.

中文摘要

交叉口等高风险交通区域是碰撞的主要原因。这项研究利用深度生成模型来提高十字路口环境中自动驾驶汽车的安全性。我们训练了一个1000步去噪扩散概率模型,根据入侵者的当前相对位置和速度,为在四向交叉口行驶的自动驾驶汽车生成碰撞引起的传感器噪声序列。使用生成对抗架构,将1000步模型提取为单步去噪扩散模型,该模型在保持相似采样质量的同时具有快速的推理速度。我们展示了单步模型在为自动驾驶汽车构建鲁棒规划器中的一种可能应用。规划者使用单步模型根据当前测量的交通状态有效地对潜在故障案例进行采样,以告知其决策。通过仿真实验,与基线智能驾驶员模型控制器相比,鲁棒规划器的故障率和延迟率显著降低。

Authors: Juanran Wang, Marc R. Schlichting, Mykel J. Kochenderfer

Categories: cs.RO, cs.AI

PDF URL: https://arxiv.org/pdf/2507.11991v1.pdf

Published: 2025-07-16T07:43:55Z


NLP Domain Papers

1. NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research

Recent developments in large language models (LLMs) have been accompanied by rapidly growing public interest in natural language processing (NLP). This attention is reflected by major news venues, which sometimes invite NLP researchers to share their knowledge and views with a wide audience. Recognizing the opportunities of the present, for both the research field and for individual researchers, this paper shares recommendations for communicating with a general audience about the capabilities and limitations of NLP. These recommendations cover three themes: vague terminology as an obstacle to public understanding, unreasonable expectations as obstacles to sustainable growth, and ethical failures as obstacles to continued support. Published NLP research and popular news coverage are cited to illustrate these themes with examples. The recommendations promote effective, transparent communication with the general public about NLP, in order to strengthen public understanding and encourage support for research.

中文摘要

随着大型语言模型(LLM)的最新发展,公众对自然语言处理(NLP)的兴趣也在迅速增长。这种关注反映在主要的新闻场所,有时会邀请NLP研究人员与广大受众分享他们的知识和观点。认识到当前对研究领域和个体研究人员的机遇,本文分享了与普通受众就NLP的能力和局限性进行沟通的建议。这些建议涵盖三个主题:模糊的术语是公众理解的障碍,不合理的期望是可持续增长的障碍,道德失败是持续支持的障碍。引用已发表的NLP研究和流行的新闻报道来举例说明这些主题。这些建议促进了与公众就NLP进行有效、透明的沟通,以加强公众的理解,鼓励对研究的支持。

Authors: Shomir Wilson

Categories: cs.CY, cs.AI, cs.CL

PDF URL: https://arxiv.org/pdf/2507.10559v2.pdf

Published: 2025-07-02T15:50:09Z


2. Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.

中文摘要

具有数值反馈的强化学习(RL)的最新进展,如标量奖励,显著增强了大型语言模型(LLM)的复杂推理能力。尽管取得了这一成功,但我们确定了仅使用数字反馈的强化学习遇到的三个关键挑战:性能平台、自我反思的有效性有限和持续的失败。然后,我们证明,即使在表现出性能停滞之后,强化学习微调模型也可以通过利用自然语言反馈的批评形式,对持续失败的问题进行正确的改进。基于这一见解,我们提出了Critique GRPO,这是一个在线强化学习框架,集成了自然语言和数值反馈,用于有效的策略优化。Critique GRPO使法学硕士能够从最初的反应中学习,并在保持探索的同时进行自我完善。此外,我们使用一个整形函数来放大从正确的,尤其是不熟悉的改进中学习,并惩罚不正确的改进。对Qwen2.5-7B-Base、Qwen2.5 Math-7B-Base和Qwen3-8B的广泛实验表明,Critique GRPO在八项具有挑战性的数学、STEM和一般推理任务中始终优于监督学习和基于RL的微调方法,提高了平均水平pass@1在Qwen2.5-7B-Base和Qwen3-8B上的得分分别降低了约4.4%和3.8%。值得注意的是,Critique GRPO通过自我批评和弱到强的泛化实现了有效的自我提升,比GRPO获得了持续的收益,如16.7%和10.0%pass@1分别对AIME 2024进行了改进。

Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2506.03106v3.pdf

Published: 2025-06-03T17:39:02Z


3. Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

中文摘要

本文通过提出一种利用领域和专家知识的策略,从与感兴趣的动态系统相关的文档语料库和描述特定系统的输入文档开始,自动生成动态系统计算模型,从而有助于加快工程动态系统的设计和部署。该策略分五个步骤实施,至关重要的是,它使用系统建模语言图(SysML)来提取有关组件的依赖关系、属性和操作的准确信息。在特定任务中采用自然语言处理(NLP)策略和大型语言模型(LLM)来改进SySML图自动生成的中间输出,例如:关键名词列表;提取的关系列表;关键短语和关键关系列表;块属性值;区块关系;以及BDD图生成。通过不同的案例研究说明了自动SysML图生成的适用性。然后通过代码生成和计算模型生成步骤,从SysML图中获得复杂动力系统的计算模型。在代码生成步骤中,NLP策略用于摘要,而LLM仅用于验证。所提出的方法不限于特定的系统、领域或计算软件。通过一个从文本到单摆模型的端到端示例,展示了所提出方法的适用性,与仅使用LLM得出的结果相比,性能有所提高。

Authors: Matthew Anderson Hendricks, Alice Cicirello

Categories: cs.CL, cs.AI, cs.CE

PDF URL: https://arxiv.org/pdf/2507.06803v2.pdf

Published: 2025-07-09T12:44:49Z


6. A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.

中文摘要

有效的人工智能治理需要利益相关者采用结构化的方法来访问和验证人工智能系统的行为。随着大型语言模型的兴起,自然语言解释(NLE)现在是阐明模型行为的关键,这需要重点研究它们的特征和治理含义。我们借鉴可解释人工智能(XAI)文献,创建了一个更新的XAI分类法,适用于基于提示的非线性编辑,涵盖三个维度:(1)上下文,包括任务、数据、受众和目标;(2)生成和呈现,涵盖生成方法、输入、交互、输出和形式;(3)评估,侧重于内容、呈现和以用户为中心的属性,以及评估的设置。该分类法为研究人员、审计员和政策制定者提供了一个框架,用于描述、设计和增强透明人工智能系统的非线性神经系统。

Authors: Isar Nejadgholi, Mona Omidyeganeh, Marc-Antoine Drouin, Jonathan Boisvert

Categories: cs.CL, cs.AI

PDF URL: https://arxiv.org/pdf/2507.10585v1.pdf

Published: 2025-07-11T12:52:19Z


8. State-Inference-Based Prompting for Natural Language Trading with Game NPCs

Large Language Models enable dynamic game interactions but struggle with rule-governed trading systems. Current implementations suffer from rule violations, such as item hallucinations and calculation errors, that erode player trust. Here, State-Inference-Based Prompting (SIBP) enables reliable trading through autonomous dialogue state inference and context-specific rule adherence. The approach decomposes trading into six states within a unified prompt framework, implementing context-aware item referencing and placeholder-based price calculations. Evaluation across 100 trading dialogues demonstrates >97% state compliance, >95% referencing accuracy, and 99.7% calculation precision. SIBP maintains computational efficiency while outperforming baseline approaches, establishing a practical foundation for trustworthy NPC interactions in commercial games.

中文摘要

大型语言模型支持动态游戏交互,但在规则控制的交易系统中很难实现。当前的实现存在违反规则的问题,例如物品幻觉和计算错误,这会削弱玩家的信任。在这里,基于状态推理的提示(SIBP)通过自主对话状态推理和特定于上下文的规则遵守来实现可靠的交易。该方法在统一的提示框架内将交易分解为六个状态,实现了上下文感知的项目引用和基于占位符的价格计算。对100个交易对话的评估表明,状态合规性>97%,引用准确率>95%,计算精度99.7%。SIBP在保持计算效率的同时优于基线方法,为商业游戏中值得信赖的NPC交互奠定了实用基础。

Authors: Minkyung Kim, Junsik Kim, Hwidong Bae, Woongcheol Yang, Sangdon Park, Sohee Bae

Categories: cs.AI

PDF URL: https://arxiv.org/pdf/2507.07203v1.pdf

Published: 2025-07-09T18:24:47Z


ArXiv Papers 趋势报告

生成时间: 2025-07-17 23:12:37

数据总量: 17 条

1. Aime: Towards Fully-Autonomous Multi-Agent Framework

Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems. However, the potential of these systems is often constrained by the prevalent plan-and-execute framework, which suffers from critical limitations: rigid plan execution, static agent capabilities, and inefficient communication. These weaknesses hinder their adaptability and robustness in dynamic environments. This paper introduces Aime, a novel multi-agent framework designed to overcome these challenges through dynamic, reactive planning and execution. Aime replaces the conventional static workflow with a fluid and adaptive architecture. Its core innovations include: (1) a Dynamic Planner that continuously refines the overall strategy based on real-time execution feedback; (2) an Actor Factory that implements Dynamic Actor instantiation, assembling specialized agents on-demand with tailored tools and knowledge; and (3) a centralized Progress Management Module that serves as a single source of truth for coherent, system-wide state awareness. We empirically evaluated Aime on a diverse suite of benchmarks spanning general reasoning (GAIA), software engineering (SWE-bench Verified), and live web navigation (WebVoyager). The results demonstrate that Aime consistently outperforms even highly specialized state-of-the-art agents in their respective domains. Its superior adaptability and task success rate establish Aime as a more resilient and effective foundation for multi-agent collaboration.

🔗 查看详情

元数据: {“arxivId”:”2507.11988v1”,”authors”:”Yexuan Shi, Mingyu Wang, Yunxiang Cao, Hongjie Lai, Junjian Lan, Xin Han, Yu Wang, Jie Geng, Zhenan Li, Zihao Xia, Xiang Chen, Chen Li, Jian Xu, Wenbo Duan, Yuanshuo Zhu”,”categories”:”cs.AI”,”published”:”2025-07-16T07:38:28Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.11988v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.11988v1","rank":9}


2. macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 5%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. macOSWorld is available at https://github.com/showlab/macosworld.

🔗 查看详情

元数据: {“arxivId”:”2506.04135v3”,”authors”:”Pei Yang, Hai Ci, Mike Zheng Shou”,”categories”:”cs.AI”,”published”:”2025-06-04T16:26:56Z”,”pdfUrl”:”https://arxiv.org/pdf/2506.04135v3.pdf","abstractUrl":"https://arxiv.org/abs/2506.04135v3","rank":10}


3. A quantum semantic framework for natural language processing

Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. In this work, we argue this property imposes fundamental limitations on Large Language Models (LLMs) and other modern NLP systems, precisely because they operate within natural language itself. Using Kolmogorov complexity, we demonstrate that as an expression’s complexity grows, the amount of contextual information required to reliably resolve its ambiguity explodes combinatorially. The computational intractability of recovering a single intended meaning for complex or ambiguous text therefore suggests that the classical view that linguistic forms possess intrinsic meaning in and of themselves is conceptually inadequate. We argue instead that meaning is dynamically actualized through an observer-dependent interpretive act, a process whose non-deterministic nature is most appropriately described by a non-classical, quantum-like logic. To test this hypothesis, we conducted a semantic Bell inequality test using diverse LLM agents. Our experiments yielded average CHSH expectation values from 1.2 to 2.8, with several runs producing values (e.g., 2.3-2.4) in significant violation of the classical boundary ($|S|\leq2$), demonstrating that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.

🔗 查看详情

元数据: {“arxivId”:”2506.10077v2”,”authors”:”Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar”,”categories”:”cs.CL, cs.AI, cs.IR, cs.IT, math.IT”,”published”:”2025-06-11T18:00:30Z”,”pdfUrl”:”https://arxiv.org/pdf/2506.10077v2.pdf","abstractUrl":"https://arxiv.org/abs/2506.10077v2","rank":4}


4. Natural Language-based Assessment of L2 Oral Proficiency using LLMs

Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.

🔗 查看详情

元数据: {“arxivId”:”2507.10200v1”,”authors”:”Stefano Bannò, Rao Ma, Mengjie Qian, Siyuan Tang, Kate Knill, Mark Gales”,”categories”:”eess.AS, cs.AI, cs.CL”,”published”:”2025-07-14T12:13:50Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.10200v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.10200v1","rank":5}


5. REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user “steering” queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset’s quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

🔗 查看详情

元数据: {“arxivId”:”2503.11924v2”,”authors”:”Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi”,”categories”:”cs.CL, cs.AI, cs.IR, cs.LG”,”published”:”2025-03-14T23:47:46Z”,”pdfUrl”:”https://arxiv.org/pdf/2503.11924v2.pdf","abstractUrl":"https://arxiv.org/abs/2503.11924v2","rank":7}


6. Adaptive Elicitation of Latent Information Using Natural Language

Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

🔗 查看详情

元数据: {“arxivId”:”2504.04204v2”,”authors”:”Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong”,”categories”:”cs.CL, cs.AI, cs.LG”,”published”:”2025-04-05T15:18:55Z”,”pdfUrl”:”https://arxiv.org/pdf/2504.04204v2.pdf","abstractUrl":"https://arxiv.org/abs/2504.04204v2","rank":9}


7. AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation.

🔗 查看详情

元数据: {“arxivId”:”2412.10422v4”,”authors”:”Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du”,”categories”:”cs.CL, cs.AI”,”published”:”2024-12-10T11:03:49Z”,”pdfUrl”:”https://arxiv.org/pdf/2412.10422v4.pdf","abstractUrl":"https://arxiv.org/abs/2412.10422v4","rank":10}


8. Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $10^{19}$ to $10^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.

🔗 查看详情

元数据: {“arxivId”:”2507.12466v1”,”authors”:”David Mizrahi, Anders Boesen Lindbo Larsen, Jesse Allardice, Suzie Petryk, Yuri Gorokhov, Jeffrey Li, Alex Fang, Josh Gardner, Tom Gunter, Afshin Dehghan”,”categories”:”cs.CL, cs.LG”,”published”:”2025-07-16T17:59:45Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.12466v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.12466v1","rank":1}


9. CytoSAE: Interpretable Cell Embeddings for Hematology

Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We propose CytoSAE, a sparse autoencoder which is trained on over 40,000 peripheral blood single-cell images. CytoSAE generalizes to diverse and out-of-domain datasets, including bone marrow cytology, where it identifies morphologically relevant concepts which we validated with medical experts. Furthermore, we demonstrate scenarios in which CytoSAE can generate patient-specific and disease-specific concepts, enabling the detection of pathognomonic cells and localized cellular abnormalities at the patch level. We quantified the effect of concepts on a patient-level AML subtype classification task and show that CytoSAE concepts reach performance comparable to the state-of-the-art, while offering explainability on the sub-cellular level. Source code and model weights are available at https://github.com/dynamical-inference/cytosae.

🔗 查看详情

元数据: {“arxivId”:”2507.12464v1”,”authors”:”Muhammed Furkan Dasdelen, Hyesu Lim, Michele Buck, Katharina S. Götze, Carsten Marr, Steffen Schneider”,”categories”:”cs.CV, cs.LG, q-bio.QM”,”published”:”2025-07-16T17:59:32Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.12464v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.12464v1","rank":2}


10. Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models

Text-to-audio models have recently emerged as a powerful technology for generating sound from textual descriptions. However, their high computational demands raise concerns about energy consumption and environmental impact. In this paper, we conduct an analysis of the energy usage of 7 state-of-the-art text-to-audio diffusion-based generative models, evaluating to what extent variations in generation parameters affect energy consumption at inference time. We also aim to identify an optimal balance between audio quality and energy consumption by considering Pareto-optimal solutions across all selected models. Our findings provide insights into the trade-offs between performance and environmental impact, contributing to the development of more efficient generative audio models.

🔗 查看详情

元数据: {“arxivId”:”2505.07615v2”,”authors”:”Riccardo Passoni, Francesca Ronchini, Luca Comanducci, Romain Serizel, Fabio Antonacci”,”categories”:”eess.AS, cs.AI, cs.LG, cs.SD”,”published”:”2025-05-12T14:36:47Z”,”pdfUrl”:”https://arxiv.org/pdf/2505.07615v2.pdf","abstractUrl":"https://arxiv.org/abs/2505.07615v2","rank":3}


11. Interpreting Radiologist’s Intention from Eye Movements in Chest X-ray Diagnosis

Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists’ goals. To capture the nuances of radiologists’ varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent’s ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.

🔗 查看详情

元数据: {“arxivId”:”2507.12461v1”,”authors”:”Trong-Thang Pham, Anh Nguyen, Zhigang Deng, Carol C. Wu, Hien Van Nguyen, Ngan Le”,”categories”:”cs.CV, cs.AI”,”published”:”2025-07-16T17:58:35Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.12461v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.12461v1","rank":4}


12. Cost-aware Stopping for Bayesian Optimization

In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora’s Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function consistently matches or outperforms other acquisition-function—stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.

🔗 查看详情

元数据: {“arxivId”:”2507.12453v1”,”authors”:”Qian Xie, Linda Cai, Alexander Terenin, Peter I. Frazier, Ziv Scully”,”categories”:”cs.LG”,”published”:”2025-07-16T17:54:14Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.12453v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.12453v1","rank":5}


13. TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.

🔗 查看详情

元数据: {“arxivId”:”2504.19982v2”,”authors”:”Emre Can Acikgoz, Carl Guo, Suvodip Dey, Akul Datta, Takyoung Kim, Gokhan Tur, Dilek Hakkani-Tür”,”categories”:”cs.CL, cs.AI”,”published”:”2025-04-28T16:57:17Z”,”pdfUrl”:”https://arxiv.org/pdf/2504.19982v2.pdf","abstractUrl":"https://arxiv.org/abs/2504.19982v2","rank":6}


14. Dynamic Risk Assessments for Offensive Cybersecurity Agents

Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent’s cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline — without any external assistance. These results highlight the need to evaluate agents’ cybersecurity risk in a dynamic manner, painting a more representative picture of risk.

🔗 查看详情

元数据: {“arxivId”:”2505.18384v3”,”authors”:”Boyi Wei, Benedikt Stroebl, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson”,”categories”:”cs.CR, cs.AI”,”published”:”2025-05-23T21:18:59Z”,”pdfUrl”:”https://arxiv.org/pdf/2505.18384v3.pdf","abstractUrl":"https://arxiv.org/abs/2505.18384v3","rank":7}


15. MARS: Unleashing the Power of Variance Reduction for Training Large Models

Training deep neural networks—and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.

🔗 查看详情

元数据: {“arxivId”:”2411.10438v3”,”authors”:”Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu”,”categories”:”cs.LG, math.OC, stat.ML”,”published”:”2024-11-15T18:57:39Z”,”pdfUrl”:”https://arxiv.org/pdf/2411.10438v3.pdf","abstractUrl":"https://arxiv.org/abs/2411.10438v3","rank":8}


16. S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling

Modeling latent representations in a hyperspherical space has proven effective for capturing directional similarities in high-dimensional text data, benefiting topic modeling. Variational autoencoder-based neural topic models (VAE-NTMs) commonly adopt the von Mises-Fisher prior to encode hyperspherical structure. However, VAE-NTMs often suffer from posterior collapse, where the KL divergence term in the objective function highly diminishes, leading to ineffective latent representations. To mitigate this issue while modeling hyperspherical structure in the latent space, we propose the Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM). S2WTM employs a prior distribution supported on the unit hypersphere and leverages the Spherical Sliced-Wasserstein distance to align the aggregated posterior distribution with the prior. Experimental results demonstrate that S2WTM outperforms state-of-the-art topic models, generating more coherent and diverse topics while improving performance on downstream tasks.

🔗 查看详情

元数据: {“arxivId”:”2507.12451v1”,”authors”:”Suman Adhya, Debarshi Kumar Sanyal”,”categories”:”cs.CL, cs.AI, cs.LG”,”published”:”2025-07-16T17:47:45Z”,”pdfUrl”:”https://arxiv.org/pdf/2507.12451v1.pdf","abstractUrl":"https://arxiv.org/abs/2507.12451v1","rank":9}


17. Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning

In many real-world applications of reinforcement learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.

🔗 查看详情

元数据: {“arxivId”:”2502.09724v2”,”authors”:”Cheol Woo Kim, Jai Moondra, Shresth Verma, Madeleine Pollack, Lingkai Kong, Milind Tambe, Swati Gupta”,”categories”:”cs.LG”,”published”:”2025-02-13T19:13:55Z”,”pdfUrl”:”https://arxiv.org/pdf/2502.09724v2.pdf","abstractUrl":"https://arxiv.org/abs/2502.09724v2","rank":10}