CEU eTD Collection (2025); Zheng, Ying: LLM Citation Hallucination Evaluation: A Multi-Metric Analysis Using the HAGRID Dataset

CEU Electronic Theses and Dissertations, 2025
Author Zheng, Ying
Title LLM Citation Hallucination Evaluation: A Multi-Metric Analysis Using the HAGRID Dataset
Summary Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like responses across diverse domains. However, their tendency to produce hallucinated content—particularly in the form of fabricated or unsupported citations—poses a significant threat to their credibility in factual applications. Hallucination, especially citation-related hallucination, is a well-documented challenge in natural language generation(Ji et al., 2023). This project evaluates citation hallucination in LLM-generated answers by implementing a retrieval-augmented generation (RAG)(Lewis et al., 2021) framework and assessing model output using my complementary citation alignment metrics: retrieval recall, answer–citation recall, TF-IDF keyword coverage, and semantic similarity(Xu et al., 2025). Using the HAGRID dataset(HAGRID, n.d.), I analyzed 1,922 QA samples and found that 65.0% of the generated answers exhibit hallucination, based on a custom multi-metric definition combining citation, semantic, and lexical errors (see Section 2.5 for details), with retrieval failure identified as the primary source.. I further classified hallucination types and visualized overlapping failure patterns to reveal compound risks. My findings highlight the importance of robust retrieval and multi-metric evaluation for reducing citation hallucination. Recommendations include enhancing embedding models and integrating re-ranking mechanisms to improve source grounding and citation accuracy.
Supervisor de la Rubia, Eduardo Arino; Böjte, Berta Eszter
Department Economics MSc
Full texthttps://www.etd.ceu.edu/2025/zheng_ying.pdf

Visit the CEU Library.

© 2007-2025, Central European University