Error-Robust Retrieval for Chinese Spelling Check
COLING 2024
Standard retrieval fails when queries contain spelling errors because misspelled text has different semantics. We train error-robust embeddings via contrastive learning so that correct and misspelled text map to similar representations, enabling accurate retrieval-augmented Chinese spelling correction.
Retrieval-augmented methods have shown promising results in Chinese Spelling Check (CSC) by finding relevant examples to guide correction. However, existing retrieval approaches suffer from a fundamental limitation: when the query itself contains spelling errors, standard semantic embeddings capture the erroneous meaning rather than the intended one, leading to irrelevant retrievals. We propose error-robust retrieval, a contrastive learning approach that trains text encoders to produce similar representations for both correct and misspelled text. By learning embeddings invariant to common Chinese spelling error patterns -- phonetic confusions, visual similarities, and common typos -- our method retrieves relevant correction examples even from broken queries. Experiments show consistent improvements across multiple CSC approaches including LLM-based, neural, and traditional methods.
The Paradox of Broken Queries
Retrieval-augmented methods have transformed how we approach many NLP tasks. Need to correct a spelling error? Find similar examples from a database and use them to guide the correction. The approach is elegant and effective -- except for one fundamental problem.
When your query itself contains spelling errors, standard retrieval breaks down. The misspelled text has different semantics from what the user intended. "I live in Pairs" retrieves documents about mathematical pairs, not the city the user meant. The very errors we're trying to fix sabotage our ability to find relevant examples.
The Semantic Drift Problem
Chinese spelling errors are particularly challenging because a single wrong character can completely change meaning. The character for "horse" and "mother" differ by just one stroke, but their semantics are worlds apart. Standard embedding models faithfully capture these semantic differences -- which is exactly what we don't want when the "difference" is actually an error.
Learning to See Through Errors
We developed error-robust retrieval: embeddings that map both correct and misspelled text to similar representations. The key insight is training the encoder to recognize that "I live in Pairs" and "I live in Paris" should have nearly identical representations, despite their surface difference.
This isn't about ignoring spelling -- the model still distinguishes genuine semantic differences. Rather, it learns to be invariant to the specific patterns of errors that occur in Chinese spelling: phonetic confusions, visual similarities, and common typos.
Error Pattern Collection
Collect common Chinese spelling error patterns including phonetic confusions, visually similar characters, and frequent typos to build training pairs.
Contrastive Training
Train text encoders with contrastive learning to produce similar embeddings for correct-misspelled text pairs while maintaining discriminative power for genuinely different semantics.
Error-Robust Retrieval
Use the trained encoder to retrieve relevant correction examples from a database, even when the query contains spelling errors that would mislead standard retrievers.
Augmented Correction
Feed the retrieved examples into downstream CSC models -- LLMs, neural models, or traditional methods -- as in-context demonstrations or candidate corrections.
Error-Invariant Representations
Our approach trains embeddings using contrastive learning on pairs of correct and corrupted text. The model learns that certain differences are noise (spelling errors) while others are signal (genuine semantic differences). This creates a representation space where retrieval works correctly even when queries are broken.
Better Examples, Better Corrections
With error-robust retrieval, we can find relevant examples even when searching with misspelled queries. These examples then serve multiple purposes: they provide in-context demonstrations for LLMs, candidate corrections for edit-based models, and statistical patterns for frequency-based methods.
The improvement is consistent across different CSC approaches. Whatever correction method you use, it works better when the retrieved examples are actually relevant to the intended meaning rather than the erroneous surface form.
Improvement Across CSC Methods
| Metric | Standard Retrieval | Error-Robust Retrieval |
|---|---|---|
| Retrieval Precision | Baseline | Substantially Higher |
| LLM-based CSC | Baseline | Improved |
| Neural CSC | Baseline | Improved |
| Overall Detection & Correction | Baseline | Consistent Gains |
Key Results
- Retrieval Precision: Substantially higher quality in finding relevant correction examples
- Error Invariance: Embeddings robust to common Chinese spelling error patterns
- CSC Improvement: Consistent gains in detection and correction accuracy
- General Applicability: Works with LLMs, neural models, and traditional methods
Retrieval That Understands Intent
The broader lesson extends beyond spelling correction. Many retrieval scenarios involve queries that don't perfectly express user intent -- whether through errors, imprecision, or incomplete information. Building retrieval systems that understand intent despite surface imperfections is crucial for real-world robustness.
Error-robust retrieval demonstrates that we can train embeddings to distinguish signal from noise, capturing intended meaning even when the literal text points elsewhere.