MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
ACL 2025 Findings
We introduce MC-MKE, a fine-grained benchmark that separates multimodal knowledge errors into misreading and misrecognition types, revealing that current editing methods struggle to maintain consistency between visual and textual knowledge after corrections.
Multimodal large language models (MLLMs) are prone to generating incorrect information that conflicts with existing knowledge. Knowledge editing has been proposed as an efficient method to update model knowledge without retraining. However, existing multimodal knowledge editing benchmarks lack fine-grained classification of errors made by MLLMs, and do not adequately evaluate whether the edited model maintains consistency of knowledge across modalities. We introduce MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. MC-MKE categorizes MLLM errors into misreading and misrecognition types and designs evaluation metrics to assess modality consistency after knowledge editing. We evaluate six representative knowledge editing methods on MC-MKE, revealing that current methods have significant room for improvement in maintaining cross-modal knowledge consistency.
When Vision and Language Disagree
Multimodal language models can see images and understand text -- but what happens when their knowledge about the two becomes inconsistent? A model might know that the Eiffel Tower is in Paris from text, but misidentify it in an image. Or it might correctly recognize a person's face while getting their name wrong.
These inconsistencies matter. As we deploy multimodal models in real applications, we need to be able to correct their knowledge -- to edit what they believe. But multimodal knowledge editing is harder than it sounds.
Two Types of Errors
We realized that multimodal errors are not monolithic. They fall into two distinct categories:
Misreading Errors
The model processes visual information correctly but retrieves wrong textual knowledge. Fixing these requires editing textual knowledge associations.
Misrecognition Errors
The visual processing itself fails -- the model cannot correctly identify what it sees. Fixing these requires correcting visual representations.
This distinction is crucial for knowledge editing. Previous benchmarks conflated these error types, making it impossible to diagnose what is actually going wrong with an editing method.
The Modality Consistency Challenge
When you edit multimodal knowledge, you want the change to be consistent across modalities. If you correct a model's knowledge about the Eiffel Tower, it should get the answer right whether you ask in text or show an image. But we found that existing editing methods often fail at this -- they might fix the text pathway while leaving the visual pathway broken, or vice versa.
Building MC-MKE
We constructed MC-MKE as a fine-grained benchmark that explicitly separates misreading and misrecognition scenarios. The benchmark design follows a principled methodology:
Error Classification
Categorize MLLM errors into misreading (correct visual processing, wrong textual knowledge) and misrecognition (failed visual processing) types.
Test Case Construction
Design each test case to probe a specific type of error with paired visual and textual queries that target the same knowledge.
Consistency Evaluation
Verify whether editing achieves consistency across both modalities, measuring if the correction transfers between visual and textual pathways.
Evaluation Results
We evaluated six representative knowledge editing methods on MC-MKE across three core dimensions: reliability (does the edit succeed?), generality (does the edit generalize?), and locality (does the edit avoid breaking unrelated knowledge?).
Modality Consistency Performance of Editing Methods
| Method | Reliability | Generality | Locality | Consistency |
|---|---|---|---|---|
| FT-LLM | High | Low | Low | Poor |
| FT-Full | High | Moderate | Low | Poor |
| KE | Moderate | Moderate | Moderate | Poor |
| MEND | Moderate | Low | High | Poor |
| SERAC | Moderate | Moderate | High | Moderate |
| IKE | Moderate | Moderate | High | Moderate |
Key Findings
- Modality Gap: Current methods struggle to maintain consistency between visual and textual knowledge after editing -- none achieve strong consistency
- Error-Specific Needs: Different error types (misreading vs. misrecognition) require different editing strategies for optimal correction
- Locality-Consistency Trade-off: Methods that preserve locality better (SERAC, IKE) tend to show somewhat better consistency, but still fall short
- Benchmark Value: MC-MKE reveals significant room for improvement that previous benchmarks could not detect
The Path Forward
As multimodal models become more prevalent, the ability to correct and update their knowledge becomes increasingly important. MC-MKE provides the foundation for developing editing techniques that respect the multimodal nature of these systems -- ensuring that when we fix what a model knows, the fix is complete and consistent across all modalities.
Our work motivates research into modality-aware editing techniques that can jointly update both visual and textual knowledge pathways, moving beyond the single-modality approaches that dominate current methods.