Exploring Context-Aware Evaluation Metrics for Machine Translation
EMNLP 2023 Findings
Sentence-level MT metrics miss discourse-level errors. Cont-COMET extends the COMET framework with preceding and subsequent context, plus a content selection method to extract relevant information, improving both system-level and segment-level evaluation on WMT benchmarks.
Previous studies on machine translation evaluation mostly focused on the quality of individual sentences, while overlooking the important role of contextual information. Although WMT Metrics Shared Tasks have introduced context content into the human annotations of translation evaluation since 2019, the relevant metrics and methods still did not take advantage of the corresponding context. In this paper, we propose a context-aware machine translation evaluation metric called Cont-COMET, built upon the effective COMET framework. Our approach simultaneously considers the preceding and subsequent contexts of the sentence to be evaluated and trains our metric to be aligned with the setting during human annotation. We also introduce a content selection method to extract and utilize the most relevant information. The experiments and evaluation of Cont-COMET on the official test framework from WMT show improvements in both system-level and segment-level assessments.
The Sentence in Isolation
Machine translation evaluation has traditionally focused on individual sentences. Take a sentence, translate it, compare to a reference, compute a score. This works for simple cases, but real translation happens in context -- documents where sentences connect, where pronouns refer back, where terminology must stay consistent.
A translation that is perfect in isolation might be terrible in context. "She went to the bank" could be correct or wrong depending on whether the previous sentence talked about rivers or money. Sentence-level metrics cannot capture this. Since 2019, WMT Metrics Shared Tasks have incorporated context in human annotations, yet the automatic metrics themselves have not caught up.
The Gap
Human annotators at WMT already evaluate translations with surrounding context in view. But the automatic metrics they are compared against still score sentences one at a time -- creating a fundamental mismatch between how quality is judged and how it is measured.
Discourse Matters
We investigated how to incorporate document context into MT evaluation. The challenge is knowing what context helps and how to use it. Do we simply concatenate surrounding sentences? Weight them by proximity? Build more sophisticated representations of document structure?
Different strategies suit different purposes. Catching coreference errors requires attention to pronouns and their antecedents. Measuring lexical consistency needs document-wide vocabulary tracking. Evaluating coherence requires understanding discourse relations.
Context Window
Gather the preceding and subsequent sentences around the target sentence to form a context window for evaluation.
Content Selection
Apply a content selection method to extract the most relevant contextual information, filtering noise from less useful surrounding text.
Context-Aware Encoding
Extend the COMET framework to encode source, hypothesis, reference, and their contexts jointly, producing context-sensitive representations.
Aligned Training
Train the metric to match WMT human annotations that were produced with context visible, aligning the metric's behavior with how humans actually judge translations.
Context Integration Strategies
We systematically compared different ways of incorporating context: fixed windows of surrounding sentences, weighted attention to relevant context, and learned representations that capture discourse structure. Each approach has trade-offs between effectiveness and computational cost. The content selection method proved essential -- blindly including all surrounding text can introduce noise that hurts rather than helps.
What Context Reveals
Our experiments showed that context-aware metrics catch errors that sentence-level metrics miss. Pronoun translation errors, inconsistent terminology, and coherence breaks all become visible when evaluation considers the broader document. Human ratings on document-level quality correlate better with context-aware automatic metrics.
But context is not free -- it adds computational overhead and can sometimes introduce noise if the context itself is poorly translated. The key is knowing when context helps and how much to use. Our content selection approach addresses this by identifying and extracting the most informative parts of the surrounding text.
Evaluation on WMT Metrics Shared Task
| Metric | Context | System-Level | Segment-Level |
|---|---|---|---|
| COMET (baseline) | No | Baseline | Baseline |
| Cont-COMET (full context) | Yes | Improved | Improved |
| Cont-COMET (content selection) | Selected | Best | Best |
Key Findings
- Discourse Sensitivity: Context-aware metrics better capture coreference and coherence errors invisible to sentence-level evaluation
- Human Correlation: Improved alignment with human document-level judgments at both system and segment levels
- Content Selection Matters: Selectively extracting relevant context outperforms blindly concatenating all surrounding text
- Practical Guidelines: Recommendations for when and how to incorporate context into MT evaluation pipelines
Toward Document-Level Quality
As machine translation matures, document-level quality becomes increasingly important. This work provides a foundation for evaluation metrics that see beyond the sentence -- measuring not just whether individual sentences are translated correctly, but whether they work together as coherent documents.
Cont-COMET demonstrates that aligning metric training with the way humans actually evaluate -- in context -- is a simple but effective principle. The content selection mechanism ensures that this context is used wisely, not wastefully. Together, these ideas point toward a future where MT evaluation truly reflects the document-level quality that matters to real users.