Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability
EMNLP 2024
Themis is a specialized language model for reference-free NLG evaluation that matches GPT-4 quality while providing interpretable, multi-criteria assessments with textual explanations alongside numerical scores.
The evaluation of natural language generation (NLG) tasks is a critical and challenging problem in NLP. While large language models (LLMs) have been employed for automatic evaluation, existing LLM-based metrics lack flexibility and interpretability. In this paper, we introduce Themis, a reference-free NLG evaluation language model built upon a large-scale NLG evaluation corpus NLG-Eval. We also propose two novel training paradigms: multi-perspective consistency verification and rating-oriented preference alignment, to further enhance the evaluation capabilities of Themis. Experiments demonstrate that Themis achieves superior evaluation performance on various NLG tasks, even surpassing GPT-4 in certain scenarios, while exhibiting strong generalization to unseen tasks and providing interpretable evaluation results.
Why "Themis"?
In Greek mythology, Themis is the goddess of justice and divine order. Our model aspires to bring fair, balanced judgment to NLG evaluation -- assessing quality intrinsically rather than through rigid comparison to reference outputs.
Beyond Reference Comparison
Traditional NLG evaluation requires reference outputs -- gold standard texts that system outputs are compared against. But references are expensive to create, may not cover the space of valid outputs, and force evaluation into a narrow comparison framework.
What if we could evaluate generated text directly, without references? This is the promise of reference-free evaluation: judge quality intrinsically, not by comparison to a specific target.
Building the Evaluator
Creating a reliable reference-free evaluator requires both large-scale training data and novel training paradigms. We tackle this challenge in two phases.
Constructing NLG-Eval
We build a large-scale evaluation corpus spanning 8 NLG task categories with quality annotations from both human experts and GPT-4, providing the foundation for training an evaluation-specific language model.
Multi-Perspective Consistency Verification
We ensure the model's judgments are internally consistent across different evaluation perspectives -- if a text scores high on fluency, its overall quality should reflect that coherently.
Rating-Oriented Preference Alignment
We calibrate the model's scores to match human quality intuitions through preference-based training, producing an evaluator that is both accurate and well-calibrated.
Flexibility in Evaluation
One of Themis's key strengths is flexibility. It can evaluate against multiple criteria -- fluency, coherence, relevance, factuality -- weighted according to what matters for a particular application. It provides textual explanations alongside numerical scores, helping users understand why a particular output received its rating.
This interpretability is crucial for practical use. A score without explanation is hard to act on; understanding why something was rated poorly points the way to improvement.
Interpretable by Design
Unlike black-box metrics that produce a single number, Themis generates natural language rationales for its evaluations. Users see not just "3.5 / 5" but a detailed explanation of strengths and weaknesses across multiple quality dimensions -- making the evaluation actionable.
Performance Comparison (Spearman Correlation with Human Judgments)
| Model | Summarization | Dialogue | Data-to-Text | Average |
|---|---|---|---|---|
| BARTScore | 0.312 | 0.285 | 0.341 | 0.313 |
| UniEval | 0.423 | 0.371 | 0.402 | 0.399 |
| GPT-3.5 | 0.401 | 0.389 | 0.378 | 0.389 |
| GPT-4 | 0.468 | 0.445 | 0.451 | 0.455 |
| Themis | 0.487 | 0.462 | 0.473 | 0.474 |
Key Results
- Superior Performance: Outperforms all baselines on NLG evaluation across summarization, dialogue, and data-to-text tasks
- GPT-4 Competitive: Matches or exceeds GPT-4's evaluation capabilities despite being a smaller, specialized model
- Strong Generalization: Works well on unseen tasks without task-specific training, demonstrating robust transfer
- Interpretable Outputs: Provides natural language explanations alongside numerical scores for actionable feedback
A New Standard for NLG Evaluation
Themis demonstrates that specialized evaluation models can match or exceed general-purpose LLMs like GPT-4 on evaluation tasks. By training specifically for evaluation with carefully designed training paradigms, we get a model that is more reliable, more interpretable, and more practical for production use.
As NLG systems become more sophisticated, having equally sophisticated evaluation becomes essential. Themis provides a foundation for rigorous, reference-free quality assessment that goes beyond simple numerical scores to offer actionable, human-aligned judgments.