CEU eTD Collection

CEU Electronic Theses and Dissertations, 2025

Author	Zheng, Ying
Title	LLM Citation Hallucination Evaluation: A Multi-Metric Analysis Using the HAGRID Dataset
Summary	Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like responses across diverse domains. However, their tendency to produce hallucinated content—particularly in the form of fabricated or unsupported citations—poses a significant threat to their credibility in factual applications. Hallucination, especially citation-related hallucination, is a well-documented challenge in natural language generation(Ji et al., 2023). This project evaluates citation hallucination in LLM-generated answers by implementing a retrieval-augmented generation (RAG)(Lewis et al., 2021) framework and assessing model output using my complementary citation alignment metrics: retrieval recall, answer–citation recall, TF-IDF keyword coverage, and semantic similarity(Xu et al., 2025). Using the HAGRID dataset(HAGRID, n.d.), I analyzed 1,922 QA samples and found that 65.0% of the generated answers exhibit hallucination, based on a custom multi-metric definition combining citation, semantic, and lexical errors (see Section 2.5 for details), with retrieval failure identified as the primary source.. I further classified hallucination types and visualized overlapping failure patterns to reveal compound risks. My findings highlight the importance of robust retrieval and multi-metric evaluation for reducing citation hallucination. Recommendations include enhancing embedding models and integrating re-ranking mechanisms to improve source grounding and citation accuracy.
Supervisor	de la Rubia, Eduardo Arino; Böjte, Berta Eszter
Department	Economics MSc
Full text	https://www.etd.ceu.edu/2025/zheng_ying.pdf

Visit the CEU Library.

CEU eTD Collection (2025); Zheng, Ying: LLM Citation Hallucination Evaluation: A Multi-Metric Analysis Using the HAGRID Dataset