ArXiv Domain 2026-06-14
数据来源:ArXiv Domain
LLM Domain Papers1. EDEN: A Large-Scale Corpus of Clinical Notes for ItalianAbstract:We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six ...
ArXiv Domain 2026-06-16
数据来源:ArXiv Domain
LLM Domain Papers1. The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge EvaluationAbstract:LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplement ...
ArXiv Domain 2026-06-21
数据来源:ArXiv Domain
LLM Domain Papers1. Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path AggregationAbstract:Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability g ...
ArXiv Domain 2026-06-22
数据来源:ArXiv Domain
LLM Domain Papers1. Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path AggregationAbstract:Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability g ...