DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models
AAAI 2025
LLM-based GEC systems produce diverse corrections that break traditional metrics. DSGram dynamically weights three sub-metrics -- Semantic Coherence, Edit Level, and Fluency -- using the Analytic Hierarchy Process to evaluate corrections the way humans actually judge quality.
Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.
The New GEC Landscape
Grammatical error correction has been transformed by large language models. Where traditional systems produced predictable corrections that could be compared against fixed references, LLMs generate diverse, creative corrections that may be perfectly valid but differ from any reference. This diversity breaks traditional evaluation metrics.
Consider correcting "I goes to store yesterday." A traditional system might output "I went to the store yesterday." But an LLM might produce "Yesterday, I went to the store" -- equally correct, but different. Reference-based metrics penalize such variations unfairly.
Beyond Fixed Metrics
We realized that different corrections need different evaluation criteria. A correction that changes word order should be evaluated differently than one that fixes subject-verb agreement. Lumping all corrections into a single score loses important information.
DSGram addresses this by dynamically weighting multiple sub-metrics -- Semantic Coherence, Edit Level, and Fluency -- based on the specific characteristics of each correction. The weights adapt to what matters most for each case.
Sub-Metric Decomposition
Evaluate each correction along three orthogonal dimensions: Semantic Coherence (meaning preservation), Edit Level (correction granularity), and Fluency (naturalness of output).
LLM Weight Generation
Use an LLM to assess the relative importance of each sub-metric for the specific sentence pair being evaluated, producing pairwise comparison judgments.
AHP Consistency Check
Apply the Analytic Hierarchy Process to construct a judgment matrix and verify consistency, ensuring the dynamic weights are logically coherent.
Dynamic Score Combination
Combine sub-metric scores using the context-dependent weights to produce a final evaluation score that reflects what matters most for each correction.
Why Dynamic Weighting?
Rather than applying fixed weights to evaluation dimensions, DSGram learns to adjust weights based on the correction type. A meaning-preserving rephrasing emphasizes semantic similarity; a grammar fix emphasizes correctness. This adaptive weighting produces scores that better match human intuitions about quality.
Multi-Dimensional Quality
GEC isn't just about fixing errors -- it's about producing text that is grammatical, fluent, and faithful to the original meaning. These dimensions can trade off against each other. A correction that maximizes grammaticality might sacrifice naturalness; one that preserves meaning perfectly might miss obvious fixes.
DSGram captures these trade-offs explicitly, providing not just a score but insight into which dimensions a correction excels at or falls short on.
System-Level Meta-Evaluation (SEEDA)
| Metric | SEEDA-S (r) | SEEDA-S (ρ) | SEEDA-E (r) | SEEDA-E (ρ) |
|---|---|---|---|---|
| GLEU | 0.847 | 0.886 | -- | -- |
| SOME | 0.892 | 0.867 | 0.901 | 0.951 |
| IMPARA | 0.911 | 0.874 | 0.889 | 0.944 |
| DSGram (GPT-4) | 0.880 | 0.909 | 0.927 | 0.944 |
Sentence-Level Meta-Evaluation (SEEDA-S)
| Metric | Accuracy | Kendall τ |
|---|---|---|
| IMPARA | 0.761 | 0.540 |
| SOME | 0.768 | 0.555 |
| DSGram (GPT-4) | 0.776 | 0.551 |
Key Results
- Higher Correlation: Better alignment with human quality judgments than existing metrics across both system-level and sentence-level evaluation
- Robustness: Stable performance across different GEC systems and error types (Cronbach's Alpha 0.76--0.82 across domains)
- Interpretability: Sub-metric weights reveal what makes corrections good or bad in each specific context
- LLM Compatibility: Handles diverse outputs from modern GEC systems that diverge from gold references
Evaluation for the LLM Era
As LLMs become the dominant approach to GEC, evaluation must evolve to match. DSGram provides a framework that respects the diversity of modern systems while maintaining meaningful quality assessment. It's evaluation designed for the way GEC is actually done today.
The dynamic weighting mechanism ensures that evaluation criteria adapt to each correction, rather than forcing all corrections through the same rigid rubric. This makes DSGram not just a metric, but a lens for understanding what different GEC systems do well and where they fall short.