Improving Contextual Modeling for Omission and Coreference Resolution in Chinese ASR

Xiaoyu Liu, Xunjian Yin, Xiaojun Wan

COLING 2024

TL;DR

We tackle two pervasive error types in Chinese ASR -- omission and coreference errors -- by introducing discourse-aware contextual modeling that looks beyond individual utterances to recover missing content and resolve ambiguous references, significantly improving transcript coherence for downstream NLP tasks.

Automatic speech recognition (ASR) systems have achieved remarkable word-level accuracy, yet the resulting transcripts often suffer from discourse-level errors that hinder downstream understanding. In Chinese, two error types are especially pervasive: omission errors, where contextually-implied subjects and objects are dropped due to pro-drop constructions and ellipsis, and coreference errors, where pronouns and references become ambiguous without prosodic cues. We propose contextual modeling approaches that maintain discourse state across utterances, tracking entities as they are introduced, following coreference chains, and identifying semantic gaps caused by ellipsis. Our methods significantly reduce both error types and produce more coherent transcripts that benefit summarization, question answering, and information extraction.

When Speech Becomes Text

Automatic speech recognition has become remarkably accurate at transcribing words. But spoken language is not just words -- it is a continuous stream of meaning where speakers drop implied subjects, use pronouns freely, and rely on context that listeners naturally fill in. When ASR produces a transcript, much of this contextual richness is lost.

Consider a conversation in Chinese: "小明去了商店。买了苹果。" The second sentence has no explicit subject -- Chinese allows this naturally, with listeners inferring "he" from context. But the ASR transcript loses this connection, producing text that reads as fragmentary to downstream systems.

Two Pervasive Error Types

We focus on two specific error types that plague Chinese ASR. Omission errors occur when contextually-implied content is not explicitly represented -- subjects dropped in pro-drop constructions, ellipsis that spoken language handles naturally. Coreference errors arise when pronouns and references become ambiguous without the prosodic cues of speech. Both require understanding context that extends beyond the immediate utterance.

Our Approach

The solution requires looking beyond individual utterances. We developed contextual modeling approaches that maintain discourse state across multiple turns of conversation.

1

Entity Tracking

Track entities as they are introduced in the discourse, building a running inventory of who and what has been mentioned.

2

Coreference Chain Resolution

Follow coreference chains across utterances, linking pronouns and references back to their antecedents using discourse structure.

3

Omission Detection & Recovery

Identify semantic gaps caused by ellipsis and pro-drop, then infer from discourse context what the speaker intended and generate appropriate completions.

Filling in the Gaps

Detecting omissions is only half the challenge -- we also need to recover what is missing. This requires inferring from discourse context what the speaker intended, then generating appropriate completions that restore the full meaning.

For coreference, the task is disambiguation: when a transcript says "他" (he), which "he" does it mean? The surrounding context usually makes this clear to human readers, but extracting this clarity algorithmically requires modeling the discourse structure explicitly.

Why Chinese Is Especially Challenging

Chinese is a pro-drop language with topic-comment sentence structure, making subject omission natural and extremely frequent. Unlike English, where omitted subjects are grammatically marked, Chinese omissions are invisible at the surface level -- only discourse context reveals what is missing. This makes contextual modeling not just helpful but essential.

Results

Our experiments demonstrate that contextual modeling significantly reduces both error types. More importantly, the resulting transcripts are more useful for downstream applications.

Impact of Contextual Modeling

Aspect Without Context With Context (Ours) Improvement
Omission Error Rate High Significantly Reduced Large Decrease
Coreference Error Rate High Significantly Reduced Large Decrease
Downstream Task Quality Baseline Enhanced Consistent Gains

Key Contributions

Key Results

From Speech to Understanding

Perfect word-level transcription does not equal perfect understanding. The gap between acoustic accuracy and semantic completeness requires bridging -- understanding not just what was said, but what was meant in context. This work takes a step toward ASR systems that produce not just transcripts, but coherent representations of spoken discourse.

The approach is particularly valuable for Chinese, where pro-drop and topic-comment structure make omission natural and frequent, but the principles extend to any language where spoken conventions differ from written expectations.

Citation

@inproceedings{liu-etal-2024-improving, title = "Improving Contextual Modeling for Omission and Coreference Resolution in Chinese ASR", author = "Liu, Xiaoyu and Yin, Xunjian and Wan, Xiaojun", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation", year = "2024", url = "https://aclanthology.org/2024.lrec-main.1301/", pages = "14950--14959" }