Error-Robust Retrieval for Chinese Spelling Check

Xunjian Yin, Xinyu Hu, Xiaojun Wan

COLING 2024

TL;DR

Standard retrieval fails when queries contain spelling errors because misspelled text has different semantics. We train error-robust embeddings via contrastive learning so that correct and misspelled text map to similar representations, enabling accurate retrieval-augmented Chinese spelling correction.

Retrieval-augmented methods have shown promising results in Chinese Spelling Check (CSC) by finding relevant examples to guide correction. However, existing retrieval approaches suffer from a fundamental limitation: when the query itself contains spelling errors, standard semantic embeddings capture the erroneous meaning rather than the intended one, leading to irrelevant retrievals. We propose error-robust retrieval, a contrastive learning approach that trains text encoders to produce similar representations for both correct and misspelled text. By learning embeddings invariant to common Chinese spelling error patterns -- phonetic confusions, visual similarities, and common typos -- our method retrieves relevant correction examples even from broken queries. Experiments show consistent improvements across multiple CSC approaches including LLM-based, neural, and traditional methods.

The Paradox of Broken Queries

Retrieval-augmented methods have transformed how we approach many NLP tasks. Need to correct a spelling error? Find similar examples from a database and use them to guide the correction. The approach is elegant and effective -- except for one fundamental problem.

When your query itself contains spelling errors, standard retrieval breaks down. The misspelled text has different semantics from what the user intended. "I live in Pairs" retrieves documents about mathematical pairs, not the city the user meant. The very errors we're trying to fix sabotage our ability to find relevant examples.

The Semantic Drift Problem

Chinese spelling errors are particularly challenging because a single wrong character can completely change meaning. The character for "horse" and "mother" differ by just one stroke, but their semantics are worlds apart. Standard embedding models faithfully capture these semantic differences -- which is exactly what we don't want when the "difference" is actually an error.

Learning to See Through Errors

We developed error-robust retrieval: embeddings that map both correct and misspelled text to similar representations. The key insight is training the encoder to recognize that "I live in Pairs" and "I live in Paris" should have nearly identical representations, despite their surface difference.

This isn't about ignoring spelling -- the model still distinguishes genuine semantic differences. Rather, it learns to be invariant to the specific patterns of errors that occur in Chinese spelling: phonetic confusions, visual similarities, and common typos.

1

Error Pattern Collection

Collect common Chinese spelling error patterns including phonetic confusions, visually similar characters, and frequent typos to build training pairs.

2

Contrastive Training

Train text encoders with contrastive learning to produce similar embeddings for correct-misspelled text pairs while maintaining discriminative power for genuinely different semantics.

3

Error-Robust Retrieval

Use the trained encoder to retrieve relevant correction examples from a database, even when the query contains spelling errors that would mislead standard retrievers.

4

Augmented Correction

Feed the retrieved examples into downstream CSC models -- LLMs, neural models, or traditional methods -- as in-context demonstrations or candidate corrections.

Error-Invariant Representations

Our approach trains embeddings using contrastive learning on pairs of correct and corrupted text. The model learns that certain differences are noise (spelling errors) while others are signal (genuine semantic differences). This creates a representation space where retrieval works correctly even when queries are broken.

Better Examples, Better Corrections

With error-robust retrieval, we can find relevant examples even when searching with misspelled queries. These examples then serve multiple purposes: they provide in-context demonstrations for LLMs, candidate corrections for edit-based models, and statistical patterns for frequency-based methods.

The improvement is consistent across different CSC approaches. Whatever correction method you use, it works better when the retrieved examples are actually relevant to the intended meaning rather than the erroneous surface form.

Improvement Across CSC Methods

Metric Standard Retrieval Error-Robust Retrieval
Retrieval Precision Baseline Substantially Higher
LLM-based CSC Baseline Improved
Neural CSC Baseline Improved
Overall Detection & Correction Baseline Consistent Gains

Key Results

Retrieval That Understands Intent

The broader lesson extends beyond spelling correction. Many retrieval scenarios involve queries that don't perfectly express user intent -- whether through errors, imprecision, or incomplete information. Building retrieval systems that understand intent despite surface imperfections is crucial for real-world robustness.

Error-robust retrieval demonstrates that we can train embeddings to distinguish signal from noise, capturing intended meaning even when the literal text points elsewhere.

Citation

@inproceedings{yin-etal-2024-error, title = "Error-Robust Retrieval for Chinese Spelling Check", author = "Yin, Xunjian and Hu, Xinyu and Wan, Xiaojun", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation", year = "2024", url = "https://aclanthology.org/2024.lrec-main.1057/", pages = "12086--12096" }