DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan

AAAI 2025

TL;DR

LLM-based GEC systems produce diverse corrections that break traditional metrics. DSGram dynamically weights three sub-metrics -- Semantic Coherence, Edit Level, and Fluency -- using the Analytic Hierarchy Process to evaluate corrections the way humans actually judge quality.

Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

Running examples showing GEC outputs with over-corrections and poor fluency, highlighting limitations of BLEU and SOME metrics — Existing metrics fail on LLM-generated corrections. Over-corrections (blue) and fluency issues (red) are not properly captured by traditional reference-based evaluation.

The New GEC Landscape

Grammatical error correction has been transformed by large language models. Where traditional systems produced predictable corrections that could be compared against fixed references, LLMs generate diverse, creative corrections that may be perfectly valid but differ from any reference. This diversity breaks traditional evaluation metrics.

Consider correcting "I goes to store yesterday." A traditional system might output "I went to the store yesterday." But an LLM might produce "Yesterday, I went to the store" -- equally correct, but different. Reference-based metrics penalize such variations unfairly.

DSGram architecture showing workflow from sentence pairs through LLM weight generation, judgment matrix construction, and consistency checks to final scores — The DSGram framework: sentence pairs flow through LLM-based weight generation, judgment matrix construction via AHP, consistency verification, and dynamic sub-metric combination to produce final evaluation scores.

Beyond Fixed Metrics

We realized that different corrections need different evaluation criteria. A correction that changes word order should be evaluated differently than one that fixes subject-verb agreement. Lumping all corrections into a single score loses important information.

DSGram addresses this by dynamically weighting multiple sub-metrics -- Semantic Coherence, Edit Level, and Fluency -- based on the specific characteristics of each correction. The weights adapt to what matters most for each case.

Sub-Metric Decomposition

Evaluate each correction along three orthogonal dimensions: Semantic Coherence (meaning preservation), Edit Level (correction granularity), and Fluency (naturalness of output).

LLM Weight Generation

Use an LLM to assess the relative importance of each sub-metric for the specific sentence pair being evaluated, producing pairwise comparison judgments.

AHP Consistency Check

Apply the Analytic Hierarchy Process to construct a judgment matrix and verify consistency, ensuring the dynamic weights are logically coherent.

Dynamic Score Combination

Combine sub-metric scores using the context-dependent weights to produce a final evaluation score that reflects what matters most for each correction.

Why Dynamic Weighting?

Rather than applying fixed weights to evaluation dimensions, DSGram learns to adjust weights based on the correction type. A meaning-preserving rephrasing emphasizes semantic similarity; a grammar fix emphasizes correctness. This adaptive weighting produces scores that better match human intuitions about quality.

Heat map of SOME sub-metric correlations showing 0.89 between Grammaticality and Fluency — SOME's sub-metrics are highly correlated (0.89 between Grammaticality and Fluency), reducing their discriminative power.

Heat map of DSGram sub-metric correlations showing more even distribution — DSGram's sub-metrics exhibit more balanced correlations, capturing distinct quality dimensions.

Multi-Dimensional Quality

GEC isn't just about fixing errors -- it's about producing text that is grammatical, fluent, and faithful to the original meaning. These dimensions can trade off against each other. A correction that maximizes grammaticality might sacrifice naturalness; one that preserves meaning perfectly might miss obvious fixes.

DSGram captures these trade-offs explicitly, providing not just a score but insight into which dimensions a correction excels at or falls short on.

DSGram scoring for casual dialogue vs formal text, showing different weight distributions — DSGram adapts its weighting to context: casual dialogue emphasizes Fluency, while formal text prioritizes Edit Level assessment, mirroring how humans evaluate corrections differently across domains.

System-Level Meta-Evaluation (SEEDA)

Metric	SEEDA-S (r)	SEEDA-S (ρ)	SEEDA-E (r)	SEEDA-E (ρ)
GLEU	0.847	0.886	--	--
SOME	0.892	0.867	0.901	0.951
IMPARA	0.911	0.874	0.889	0.944
DSGram (GPT-4)	0.880	0.909	0.927	0.944

Sentence-Level Meta-Evaluation (SEEDA-S)

Metric	Accuracy	Kendall τ
IMPARA	0.761	0.540
SOME	0.768	0.555
DSGram (GPT-4)	0.776	0.551

      Key Results
      Higher Correlation: Better alignment with human quality judgments than existing metrics across both system-level and sentence-level evaluation
Robustness: Stable performance across different GEC systems and error types (Cronbach's Alpha 0.76--0.82 across domains)
Interpretability: Sub-metric weights reveal what makes corrections good or bad in each specific context
LLM Compatibility: Handles diverse outputs from modern GEC systems that diverge from gold references

    

Evaluation for the LLM Era

As LLMs become the dominant approach to GEC, evaluation must evolve to match. DSGram provides a framework that respects the diversity of modern systems while maintaining meaningful quality assessment. It's evaluation designed for the way GEC is actually done today.

The dynamic weighting mechanism ensures that evaluation criteria adapt to each correction, rather than forcing all corrections through the same rigid rubric. This makes DSGram not just a metric, but a lens for understanding what different GEC systems do well and where they fall short.

Citation

@inproceedings{xie2025dsgram, title={DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models}, author={Xie, Jinxiang and Yin, Xunjian and Wan, Xiaojun}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2025}, url={https://ojs.aaai.org/index.php/AAAI/article/view/34775} }