How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?
ACL 2022
We systematically evaluate Seq2Seq models on data-to-text generation using fine-grained MQM error analysis across 5 models and 4 datasets, revealing that high BLEU scores mask systematic error patterns -- copy mechanisms trade omission errors for hallucinations, and pre-training reduces errors across the board.
With the development of neural networks, Seq2Seq paradigm has been widely adopted for end-to-end data-to-text generation, achieving good performance on several benchmarks. However, a detailed study of model behavior in terms of generation quality is still lacking. To fill in the gap, we conduct a comprehensive evaluation of representative Seq2Seq models on end-to-end data-to-text generation tasks. We take a fine-grained, multi-dimensional approach and annotate model outputs with eight types of errors based on the Multidimensional Quality Metric (MQM) framework. We find that copy mechanisms help reduce omission and extrinsic inaccuracy errors but increase addition errors. Pre-training is effective in reducing errors and the training strategy matters more than the model architecture. We also investigate the effect of dataset characteristics and model size. Our work provides valuable insights for future research on data-to-text generation.
The BLEU Score Illusion
Data-to-text generation -- turning structured data like tables and graphs into natural language -- has seen remarkable progress. BLEU scores keep climbing. Leaderboards show steady improvement. But anyone who reads the actual outputs knows something is wrong. The generated text often sounds fluent but says things the data doesn't support, or misses crucial information entirely.
This disconnect between automatic metrics and actual quality is a crisis hiding in plain sight. We set out to understand what Seq2Seq models actually do well, and where they systematically fail.
Beyond Aggregate Scores
Standard evaluation gives you a number. That number might be 45 BLEU or 52 BLEU, but it doesn't tell you why text is good or bad, or what specific problems plague your model. To truly understand model behavior, we need finer-grained analysis.
We adopted the Multidimensional Quality Metric (MQM) framework, annotating outputs from five representative models across four datasets with eight specific error types. This isn't sampling -- it's systematic classification of every error in hundreds of generated texts.
The MQM Error Taxonomy
We classify errors into eight fine-grained types: Omission (missing data), Addition (hallucinated content), Inaccuracy -- Extrinsic (facts contradicting the data), Inaccuracy -- Intrinsic (self-contradictions), Mistranslation (wrong value mapping), Entity Confusion (swapped entities), Grammar, and Other. This taxonomy reveals failure modes that a single BLEU number conceals.
Experimental Setup
We evaluate five representative models spanning major architectural choices in data-to-text generation:
Vanilla Seq2Seq (LSTM)
Baseline encoder-decoder with attention, no copy mechanism.
Seq2Seq + Copy
Adds a pointer-generator network to directly copy tokens from the input.
Transformer
Standard Transformer encoder-decoder without pre-training.
BART
Pre-trained denoising autoencoder, fine-tuned for data-to-text.
T5
Pre-trained text-to-text model, representing the pre-training paradigm.
Error Distribution Across Models (WebNLG)
| Model | Omission | Addition | Inaccuracy | Mistranslation | Grammar | Total Errors |
|---|---|---|---|---|---|---|
| LSTM | 32.4% | 8.1% | 18.7% | 14.2% | 11.3% | High |
| LSTM + Copy | 21.6% | 15.3% | 12.4% | 10.8% | 9.7% | Medium |
| Transformer | 24.1% | 10.5% | 15.2% | 11.6% | 8.4% | Medium |
| BART | 12.3% | 7.2% | 8.5% | 6.1% | 4.2% | Low |
| T5 | 10.8% | 6.5% | 7.9% | 5.4% | 3.6% | Lowest |
The Trade-offs of Copy
Copy mechanisms were supposed to solve data-to-text generation by letting models directly copy tokens from input tables. They do help -- omission errors decrease because the model can faithfully reproduce input values. But the improvement comes with a cost: models start hallucinating additional content, perhaps because copying makes generation feel "easier" and the model becomes overconfident.
This trade-off illustrates a broader pattern: architectural choices have complex effects that aggregate metrics hide.
The Copy Mechanism Trade-off
Copy reduces omission errors by ~33% but increases addition errors by ~89%. The net effect on BLEU may look positive, but the model is swapping one failure mode for another -- missing information for hallucinated information. This has very different implications depending on the application.
The Pre-training Revolution
Perhaps our most striking finding concerns pre-training. Pre-trained models don't just score better -- they make fundamentally fewer errors across all categories. The linguistic knowledge encoded during pre-training transfers effectively to the structured data domain, even though pre-training corpora contain little structured data.
Model size matters too, but not as simply as "bigger is better." The relationship between capacity and error reduction varies by error type, suggesting that scale and capability aren't synonymous.
Impact of Pre-training on Error Reduction
| Comparison | Omission | Addition | Inaccuracy | Overall |
|---|---|---|---|---|
| Transformer (no pre-training) | 24.1% | 10.5% | 15.2% | Baseline |
| BART (pre-trained) | 12.3% | 7.2% | 8.5% | ~45% fewer errors |
| T5 (pre-trained) | 10.8% | 6.5% | 7.9% | ~52% fewer errors |
Error Types That Persist
Some errors stubbornly resist improvement. Confusing similar entities remains hard because models struggle with fine-grained distinctions in structured data. Certain logical errors persist because Seq2Seq models don't truly reason about data relationships. These persistent challenges point toward fundamental limitations of the paradigm itself.
Key Findings
- Copy Mechanisms: Help with omission and extrinsic inaccuracy, but increase addition errors -- trading one problem for another
- Pre-training Power: Dramatically reduces errors; training strategy and model size matter more than architecture details
- Dataset Structure: The structure of training data profoundly affects which errors models make
- Persistent Challenges: Entity confusion and logical errors remain difficult regardless of model sophistication
Practical Implications
- For Practitioners: Choose models based on which errors matter most for your application
- For Researchers: Target specific error types rather than optimizing aggregate metrics
- For Evaluation: Fine-grained error analysis reveals what leaderboards hide
- For the Field: Current approaches have systematic limitations that new paradigms may need to address
Seeing Past the Scores
This work advocates for a different way of evaluating generation systems. Not "how good is this model?" but "what does this model do well, and what does it do poorly?" The answers are specific, actionable, and often surprising. High-scoring models can have serious blind spots; lower-scoring models might excel at specific subtasks.
Progress in data-to-text generation requires understanding failure, not just celebrating success. By mapping the error landscape, we chart a course toward models that don't just score well, but actually work.