LLM-based NLG Evaluation: Current Status and Challenges
Computational Linguistics (CL) 2024
A comprehensive survey of LLM-based NLG evaluation methods covering ~100 works across 9 tasks, revealing critical biases (position, verbosity, self-enhancement) and charting future directions for more reliable automated text evaluation.
Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.
The Evaluation Problem
How do you know if machine-generated text is good? This seemingly simple question has puzzled NLP researchers for decades. Traditional metrics like BLEU and ROUGE count matching n-grams -- but a perfectly valid paraphrase might score terribly, while grammatical nonsense could score well.
With the rise of large language models, a new paradigm has emerged: using LLMs themselves as evaluators. Models like GPT-4 can read text and provide quality judgments that often correlate better with human assessments than any traditional metric. But this promising approach comes with its own challenges.
A Comprehensive Survey
This survey provides the first comprehensive review of LLM-based NLG evaluation. We analyzed methods across diverse tasks -- summarization, dialogue, machine translation, creative writing -- to understand what works, what doesn't, and why.
The landscape is rich and rapidly evolving. Reference-based methods compare outputs to gold standards. Reference-free methods evaluate quality directly. Comparative methods rank outputs against each other. Each approach has strengths and weaknesses that depend on the task and evaluation criteria.
Four Evaluation Paradigms
Our taxonomy organizes the rapidly growing field into four coherent paradigms, each with distinct trade-offs between cost, accuracy, and flexibility.
LLM-Derived Metrics
Extract evaluation signals from internal LLM representations -- embeddings, attention patterns, and generation probabilities -- without explicit prompting.
Prompting LLMs
Design prompts that instruct LLMs to score, compare, rank, or analyze errors in generated text. Five distinct methods: scoring, comparison, ranking, Boolean QA, and error analysis.
Fine-tuning LLMs
Train specialized evaluation models (PandaLM, Prometheus, TIGERScore, and others) on human judgment data for more reliable and consistent assessments.
Human-LLM Collaboration
Combine the scalability of LLM evaluation with the reliability of human judgment, leveraging the strengths of both approaches.
The Hidden Biases
Our analysis revealed troubling biases in LLM evaluators. Position bias: models prefer options based on presentation order. Verbosity bias: longer outputs get higher scores regardless of quality. Self-enhancement bias: models favor their own outputs. These biases can silently corrupt evaluation results if not carefully addressed.
Key Challenges Identified
- Position Bias: LLMs show preference based on the order of presented options
- Verbosity Bias: Longer outputs often receive higher scores regardless of quality
- Self-Enhancement Bias: Models tend to favor their own outputs
- Inconsistency: Evaluations can vary across multiple runs
- Hallucination: LLM evaluators may generate plausible but incorrect assessments
The Path Forward
Despite these challenges, LLM-based evaluation represents a significant advance over traditional metrics. The key is understanding the failure modes and designing systems that mitigate them. Calibration techniques, ensemble methods, and fine-tuned evaluators all show promise.
This survey serves as both a comprehensive reference for current methods and a roadmap for future research. As LLMs continue to improve, so too will their potential as evaluators -- but only if we remain vigilant about their limitations.
Survey Scope
Evaluation Paradigms: Reference-based, reference-free, and comparative evaluation approaches. Task Coverage: Summarization, dialogue, translation, question answering, creative writing, and more. LLM Types: GPT-4, Claude, Llama, and specialized evaluation models. Evaluation Dimensions: Fluency, coherence, relevance, factuality, and task-specific criteria.