Improving Contextual Modeling for Omission and Coreference Resolution in Chinese ASR
COLING 2024
We tackle two pervasive error types in Chinese ASR -- omission and coreference errors -- by introducing discourse-aware contextual modeling that looks beyond individual utterances to recover missing content and resolve ambiguous references, significantly improving transcript coherence for downstream NLP tasks.
Automatic speech recognition (ASR) systems have achieved remarkable word-level accuracy, yet the resulting transcripts often suffer from discourse-level errors that hinder downstream understanding. In Chinese, two error types are especially pervasive: omission errors, where contextually-implied subjects and objects are dropped due to pro-drop constructions and ellipsis, and coreference errors, where pronouns and references become ambiguous without prosodic cues. We propose contextual modeling approaches that maintain discourse state across utterances, tracking entities as they are introduced, following coreference chains, and identifying semantic gaps caused by ellipsis. Our methods significantly reduce both error types and produce more coherent transcripts that benefit summarization, question answering, and information extraction.
When Speech Becomes Text
Automatic speech recognition has become remarkably accurate at transcribing words. But spoken language is not just words -- it is a continuous stream of meaning where speakers drop implied subjects, use pronouns freely, and rely on context that listeners naturally fill in. When ASR produces a transcript, much of this contextual richness is lost.
Consider a conversation in Chinese: "小明去了商店。买了苹果。" The second sentence has no explicit subject -- Chinese allows this naturally, with listeners inferring "he" from context. But the ASR transcript loses this connection, producing text that reads as fragmentary to downstream systems.
Two Pervasive Error Types
We focus on two specific error types that plague Chinese ASR. Omission errors occur when contextually-implied content is not explicitly represented -- subjects dropped in pro-drop constructions, ellipsis that spoken language handles naturally. Coreference errors arise when pronouns and references become ambiguous without the prosodic cues of speech. Both require understanding context that extends beyond the immediate utterance.
Our Approach
The solution requires looking beyond individual utterances. We developed contextual modeling approaches that maintain discourse state across multiple turns of conversation.
Entity Tracking
Track entities as they are introduced in the discourse, building a running inventory of who and what has been mentioned.
Coreference Chain Resolution
Follow coreference chains across utterances, linking pronouns and references back to their antecedents using discourse structure.
Omission Detection & Recovery
Identify semantic gaps caused by ellipsis and pro-drop, then infer from discourse context what the speaker intended and generate appropriate completions.
Filling in the Gaps
Detecting omissions is only half the challenge -- we also need to recover what is missing. This requires inferring from discourse context what the speaker intended, then generating appropriate completions that restore the full meaning.
For coreference, the task is disambiguation: when a transcript says "他" (he), which "he" does it mean? The surrounding context usually makes this clear to human readers, but extracting this clarity algorithmically requires modeling the discourse structure explicitly.
Why Chinese Is Especially Challenging
Chinese is a pro-drop language with topic-comment sentence structure, making subject omission natural and extremely frequent. Unlike English, where omitted subjects are grammatically marked, Chinese omissions are invisible at the surface level -- only discourse context reveals what is missing. This makes contextual modeling not just helpful but essential.
Results
Our experiments demonstrate that contextual modeling significantly reduces both error types. More importantly, the resulting transcripts are more useful for downstream applications.
Impact of Contextual Modeling
| Aspect | Without Context | With Context (Ours) | Improvement |
|---|---|---|---|
| Omission Error Rate | High | Significantly Reduced | Large Decrease |
| Coreference Error Rate | High | Significantly Reduced | Large Decrease |
| Downstream Task Quality | Baseline | Enhanced | Consistent Gains |
Key Contributions
- Error Taxonomy: Systematic analysis of omission and coreference patterns specific to Chinese ASR output
- Contextual Methods: Discourse-aware modeling that maintains state across utterances for error detection and recovery
- Omission Recovery: Techniques to identify and restore contextually-implied content dropped by pro-drop and ellipsis
- Coreference Resolution: Disambiguation of pronominal and nominal references in transcribed speech
Key Results
- Error Reduction: Significant decrease in both omission and coreference errors across evaluation sets
- Discourse Effectiveness: Cross-utterance context proves crucial for both error types
- Downstream Benefit: Improved transcript quality enhances summarization, QA, and information extraction
- Language Sensitivity: Approach tailored to Chinese linguistic characteristics generalizes across domains
From Speech to Understanding
Perfect word-level transcription does not equal perfect understanding. The gap between acoustic accuracy and semantic completeness requires bridging -- understanding not just what was said, but what was meant in context. This work takes a step toward ASR systems that produce not just transcripts, but coherent representations of spoken discourse.
The approach is particularly valuable for Chinese, where pro-drop and topic-comment structure make omission natural and frequent, but the principles extend to any language where spoken conventions differ from written expectations.