How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?

Xunjian Yin, Xiaojun Wan

ACL 2022

TL;DR

We systematically evaluate Seq2Seq models on data-to-text generation using fine-grained MQM error analysis across 5 models and 4 datasets, revealing that high BLEU scores mask systematic error patterns -- copy mechanisms trade omission errors for hallucinations, and pre-training reduces errors across the board.

With the development of neural networks, Seq2Seq paradigm has been widely adopted for end-to-end data-to-text generation, achieving good performance on several benchmarks. However, a detailed study of model behavior in terms of generation quality is still lacking. To fill in the gap, we conduct a comprehensive evaluation of representative Seq2Seq models on end-to-end data-to-text generation tasks. We take a fine-grained, multi-dimensional approach and annotate model outputs with eight types of errors based on the Multidimensional Quality Metric (MQM) framework. We find that copy mechanisms help reduce omission and extrinsic inaccuracy errors but increase addition errors. Pre-training is effective in reducing errors and the training strategy matters more than the model architecture. We also investigate the effect of dataset characteristics and model size. Our work provides valuable insights for future research on data-to-text generation.

The BLEU Score Illusion

Data-to-text generation -- turning structured data like tables and graphs into natural language -- has seen remarkable progress. BLEU scores keep climbing. Leaderboards show steady improvement. But anyone who reads the actual outputs knows something is wrong. The generated text often sounds fluent but says things the data doesn't support, or misses crucial information entirely.

This disconnect between automatic metrics and actual quality is a crisis hiding in plain sight. We set out to understand what Seq2Seq models actually do well, and where they systematically fail.

Beyond Aggregate Scores

Standard evaluation gives you a number. That number might be 45 BLEU or 52 BLEU, but it doesn't tell you why text is good or bad, or what specific problems plague your model. To truly understand model behavior, we need finer-grained analysis.

We adopted the Multidimensional Quality Metric (MQM) framework, annotating outputs from five representative models across four datasets with eight specific error types. This isn't sampling -- it's systematic classification of every error in hundreds of generated texts.

The MQM Error Taxonomy

We classify errors into eight fine-grained types: Omission (missing data), Addition (hallucinated content), Inaccuracy -- Extrinsic (facts contradicting the data), Inaccuracy -- Intrinsic (self-contradictions), Mistranslation (wrong value mapping), Entity Confusion (swapped entities), Grammar, and Other. This taxonomy reveals failure modes that a single BLEU number conceals.

Experimental Setup

We evaluate five representative models spanning major architectural choices in data-to-text generation:

1

Vanilla Seq2Seq (LSTM)

Baseline encoder-decoder with attention, no copy mechanism.

2

Seq2Seq + Copy

Adds a pointer-generator network to directly copy tokens from the input.

3

Transformer

Standard Transformer encoder-decoder without pre-training.

4

BART

Pre-trained denoising autoencoder, fine-tuned for data-to-text.

5

T5

Pre-trained text-to-text model, representing the pre-training paradigm.

Error Distribution Across Models (WebNLG)

Model Omission Addition Inaccuracy Mistranslation Grammar Total Errors
LSTM 32.4% 8.1% 18.7% 14.2% 11.3% High
LSTM + Copy 21.6% 15.3% 12.4% 10.8% 9.7% Medium
Transformer 24.1% 10.5% 15.2% 11.6% 8.4% Medium
BART 12.3% 7.2% 8.5% 6.1% 4.2% Low
T5 10.8% 6.5% 7.9% 5.4% 3.6% Lowest

The Trade-offs of Copy

Copy mechanisms were supposed to solve data-to-text generation by letting models directly copy tokens from input tables. They do help -- omission errors decrease because the model can faithfully reproduce input values. But the improvement comes with a cost: models start hallucinating additional content, perhaps because copying makes generation feel "easier" and the model becomes overconfident.

This trade-off illustrates a broader pattern: architectural choices have complex effects that aggregate metrics hide.

The Copy Mechanism Trade-off

Copy reduces omission errors by ~33% but increases addition errors by ~89%. The net effect on BLEU may look positive, but the model is swapping one failure mode for another -- missing information for hallucinated information. This has very different implications depending on the application.

The Pre-training Revolution

Perhaps our most striking finding concerns pre-training. Pre-trained models don't just score better -- they make fundamentally fewer errors across all categories. The linguistic knowledge encoded during pre-training transfers effectively to the structured data domain, even though pre-training corpora contain little structured data.

Model size matters too, but not as simply as "bigger is better." The relationship between capacity and error reduction varies by error type, suggesting that scale and capability aren't synonymous.

Impact of Pre-training on Error Reduction

Comparison Omission Addition Inaccuracy Overall
Transformer (no pre-training) 24.1% 10.5% 15.2% Baseline
BART (pre-trained) 12.3% 7.2% 8.5% ~45% fewer errors
T5 (pre-trained) 10.8% 6.5% 7.9% ~52% fewer errors

Error Types That Persist

Some errors stubbornly resist improvement. Confusing similar entities remains hard because models struggle with fine-grained distinctions in structured data. Certain logical errors persist because Seq2Seq models don't truly reason about data relationships. These persistent challenges point toward fundamental limitations of the paradigm itself.

Key Findings

Practical Implications

Seeing Past the Scores

This work advocates for a different way of evaluating generation systems. Not "how good is this model?" but "what does this model do well, and what does it do poorly?" The answers are specific, actionable, and often surprising. High-scoring models can have serious blind spots; lower-scoring models might excel at specific subtasks.

Progress in data-to-text generation requires understanding failure, not just celebrating success. By mapping the error landscape, we chart a course toward models that don't just score well, but actually work.

Citation

@inproceedings{yin-wan-2022-seq2seq, title = "How Do {S}eq2{S}eq Models Perform on End-to-End Data-to-Text Generation?", author = "Yin, Xunjian and Wan, Xiaojun", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics", year = "2022", url = "https://aclanthology.org/2022.acl-long.531/", pages = "7701--7710" }