LLM-based NLG Evaluation: Current Status and Challenges

Mingqi Gao, Xinyu Hu, Jie Ruan, Xunjian Yin, Xiaojun Wan

Computational Linguistics (CL) 2024

TL;DR

A comprehensive survey of LLM-based NLG evaluation methods covering ~100 works across 9 tasks, revealing critical biases (position, verbosity, self-enhancement) and charting future directions for more reliable automated text evaluation.

Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.

Taxonomy of LLM-based NLG evaluation: four categories including LLM-derived metrics, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation — Figure 1. Taxonomy of LLM-based NLG evaluation methods. The survey organizes approaches into four paradigms: LLM-derived metrics, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation.

The Evaluation Problem

How do you know if machine-generated text is good? This seemingly simple question has puzzled NLP researchers for decades. Traditional metrics like BLEU and ROUGE count matching n-grams -- but a perfectly valid paraphrase might score terribly, while grammatical nonsense could score well.

With the rise of large language models, a new paradigm has emerged: using LLMs themselves as evaluators. Models like GPT-4 can read text and provide quality judgments that often correlate better with human assessments than any traditional metric. But this promising approach comes with its own challenges.

A Comprehensive Survey

This survey provides the first comprehensive review of LLM-based NLG evaluation. We analyzed methods across diverse tasks -- summarization, dialogue, machine translation, creative writing -- to understand what works, what doesn't, and why.

The landscape is rich and rapidly evolving. Reference-based methods compare outputs to gold standards. Reference-free methods evaluate quality directly. Comparative methods rank outputs against each other. Each approach has strengths and weaknesses that depend on the task and evaluation criteria.

Example of prompting LLMs to evaluate text summarization consistency, showing prompt structure with role, task instructions, criteria, and LLM-generated results — Figure 2. An example of prompting LLMs to evaluate summary consistency, illustrating the structured prompt design with role definition, task instructions, evaluation criteria, and model output.

Four Evaluation Paradigms

Our taxonomy organizes the rapidly growing field into four coherent paradigms, each with distinct trade-offs between cost, accuracy, and flexibility.

LLM-Derived Metrics

Extract evaluation signals from internal LLM representations -- embeddings, attention patterns, and generation probabilities -- without explicit prompting.

Prompting LLMs

Design prompts that instruct LLMs to score, compare, rank, or analyze errors in generated text. Five distinct methods: scoring, comparison, ranking, Boolean QA, and error analysis.

Fine-tuning LLMs

Train specialized evaluation models (PandaLM, Prometheus, TIGERScore, and others) on human judgment data for more reliable and consistent assessments.

Human-LLM Collaboration

Combine the scalability of LLM evaluation with the reliability of human judgment, leveraging the strengths of both approaches.

The Hidden Biases

Our analysis revealed troubling biases in LLM evaluators. Position bias: models prefer options based on presentation order. Verbosity bias: longer outputs get higher scores regardless of quality. Self-enhancement bias: models favor their own outputs. These biases can silently corrupt evaluation results if not carefully addressed.

      Key Challenges Identified
      Position Bias: LLMs show preference based on the order of presented options
Verbosity Bias: Longer outputs often receive higher scores regardless of quality
Self-Enhancement Bias: Models tend to favor their own outputs
Inconsistency: Evaluations can vary across multiple runs
Hallucination: LLM evaluators may generate plausible but incorrect assessments

    

The Path Forward

Despite these challenges, LLM-based evaluation represents a significant advance over traditional metrics. The key is understanding the failure modes and designing systems that mitigate them. Calibration techniques, ensemble methods, and fine-tuned evaluators all show promise.

This survey serves as both a comprehensive reference for current methods and a roadmap for future research. As LLMs continue to improve, so too will their potential as evaluators -- but only if we remain vigilant about their limitations.

Survey Scope

Evaluation Paradigms: Reference-based, reference-free, and comparative evaluation approaches. Task Coverage: Summarization, dialogue, translation, question answering, creative writing, and more. LLM Types: GPT-4, Claude, Llama, and specialized evaluation models. Evaluation Dimensions: Fluency, coherence, relevance, factuality, and task-specific criteria.

Citation

@article{gao2024llm, title={LLM-based NLG Evaluation: Current Status and Challenges}, author={Gao, Mingqi and Hu, Xinyu and Ruan, Jie and Yin, Xunjian and Wan, Xiaojun}, journal={Computational Linguistics}, pages={1--44}, year={2024}, publisher={MIT Press}, doi={10.1162/coli_a_00540} }