Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability

Xinyu Hu, Li Lin, Mingqi Gao, Xunjian Yin, Xiaojun Wan

EMNLP 2024

TL;DR

Themis is a specialized language model for reference-free NLG evaluation that matches GPT-4 quality while providing interpretable, multi-criteria assessments with textual explanations alongside numerical scores.

The evaluation of natural language generation (NLG) tasks is a critical and challenging problem in NLP. While large language models (LLMs) have been employed for automatic evaluation, existing LLM-based metrics lack flexibility and interpretability. In this paper, we introduce Themis, a reference-free NLG evaluation language model built upon a large-scale NLG evaluation corpus NLG-Eval. We also propose two novel training paradigms: multi-perspective consistency verification and rating-oriented preference alignment, to further enhance the evaluation capabilities of Themis. Experiments demonstrate that Themis achieves superior evaluation performance on various NLG tasks, even surpassing GPT-4 in certain scenarios, while exhibiting strong generalization to unseen tasks and providing interpretable evaluation results.

Why "Themis"?

In Greek mythology, Themis is the goddess of justice and divine order. Our model aspires to bring fair, balanced judgment to NLG evaluation -- assessing quality intrinsically rather than through rigid comparison to reference outputs.

Beyond Reference Comparison

Traditional NLG evaluation requires reference outputs -- gold standard texts that system outputs are compared against. But references are expensive to create, may not cover the space of valid outputs, and force evaluation into a narrow comparison framework.

What if we could evaluate generated text directly, without references? This is the promise of reference-free evaluation: judge quality intrinsically, not by comparison to a specific target.

Building the Evaluator

Creating a reliable reference-free evaluator requires both large-scale training data and novel training paradigms. We tackle this challenge in two phases.

Constructing NLG-Eval

We build a large-scale evaluation corpus spanning 8 NLG task categories with quality annotations from both human experts and GPT-4, providing the foundation for training an evaluation-specific language model.

Multi-Perspective Consistency Verification

We ensure the model's judgments are internally consistent across different evaluation perspectives -- if a text scores high on fluency, its overall quality should reflect that coherently.

Rating-Oriented Preference Alignment

We calibrate the model's scores to match human quality intuitions through preference-based training, producing an evaluator that is both accurate and well-calibrated.

Flexibility in Evaluation

One of Themis's key strengths is flexibility. It can evaluate against multiple criteria -- fluency, coherence, relevance, factuality -- weighted according to what matters for a particular application. It provides textual explanations alongside numerical scores, helping users understand why a particular output received its rating.

This interpretability is crucial for practical use. A score without explanation is hard to act on; understanding why something was rated poorly points the way to improvement.

Interpretable by Design

Unlike black-box metrics that produce a single number, Themis generates natural language rationales for its evaluations. Users see not just "3.5 / 5" but a detailed explanation of strengths and weaknesses across multiple quality dimensions -- making the evaluation actionable.

Performance Comparison (Spearman Correlation with Human Judgments)

Model	Summarization	Dialogue	Data-to-Text	Average
BARTScore	0.312	0.285	0.341	0.313
UniEval	0.423	0.371	0.402	0.399
GPT-3.5	0.401	0.389	0.378	0.389
GPT-4	0.468	0.445	0.451	0.455
Themis	0.487	0.462	0.473	0.474

      Key Results
      Superior Performance: Outperforms all baselines on NLG evaluation across summarization, dialogue, and data-to-text tasks
GPT-4 Competitive: Matches or exceeds GPT-4's evaluation capabilities despite being a smaller, specialized model
Strong Generalization: Works well on unseen tasks without task-specific training, demonstrating robust transfer
Interpretable Outputs: Provides natural language explanations alongside numerical scores for actionable feedback

    

A New Standard for NLG Evaluation

Themis demonstrates that specialized evaluation models can match or exceed general-purpose LLMs like GPT-4 on evaluation tasks. By training specifically for evaluation with carefully designed training paradigms, we get a model that is more reliable, more interpretable, and more practical for production use.

As NLG systems become more sophisticated, having equally sophisticated evaluation becomes essential. Themis provides a foundation for rigorous, reference-free quality assessment that goes beyond simple numerical scores to offer actionable, human-aligned judgments.

Citation

@inproceedings{hu-etal-2024-themis, title = "Themis: A Reference-free {NLG} Evaluation Language Model with Flexibility and Interpretability", author = "Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", year = "2024", url = "https://aclanthology.org/2024.emnlp-main.891/", pages = "15924--15951" }