MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Baizhou Huang, Xu Zhang, Xinyu Hu, Xiaojun Wan

ACL 2025 Findings

TL;DR

We introduce MC-MKE, a fine-grained benchmark that separates multimodal knowledge errors into misreading and misrecognition types, revealing that current editing methods struggle to maintain consistency between visual and textual knowledge after corrections.

Paper Code

Multimodal large language models (MLLMs) are prone to generating incorrect information that conflicts with existing knowledge. Knowledge editing has been proposed as an efficient method to update model knowledge without retraining. However, existing multimodal knowledge editing benchmarks lack fine-grained classification of errors made by MLLMs, and do not adequately evaluate whether the edited model maintains consistency of knowledge across modalities. We introduce MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. MC-MKE categorizes MLLM errors into misreading and misrecognition types and designs evaluation metrics to assess modality consistency after knowledge editing. We evaluate six representative knowledge editing methods on MC-MKE, revealing that current methods have significant room for improvement in maintaining cross-modal knowledge consistency.

When Vision and Language Disagree

Multimodal language models can see images and understand text -- but what happens when their knowledge about the two becomes inconsistent? A model might know that the Eiffel Tower is in Paris from text, but misidentify it in an image. Or it might correctly recognize a person's face while getting their name wrong.

These inconsistencies matter. As we deploy multimodal models in real applications, we need to be able to correct their knowledge -- to edit what they believe. But multimodal knowledge editing is harder than it sounds.

Two Types of Errors

We realized that multimodal errors are not monolithic. They fall into two distinct categories:

Misreading Errors

The model processes visual information correctly but retrieves wrong textual knowledge. Fixing these requires editing textual knowledge associations.

Misrecognition Errors

The visual processing itself fails -- the model cannot correctly identify what it sees. Fixing these requires correcting visual representations.

This distinction is crucial for knowledge editing. Previous benchmarks conflated these error types, making it impossible to diagnose what is actually going wrong with an editing method.

The Modality Consistency Challenge

When you edit multimodal knowledge, you want the change to be consistent across modalities. If you correct a model's knowledge about the Eiffel Tower, it should get the answer right whether you ask in text or show an image. But we found that existing editing methods often fail at this -- they might fix the text pathway while leaving the visual pathway broken, or vice versa.

Building MC-MKE

We constructed MC-MKE as a fine-grained benchmark that explicitly separates misreading and misrecognition scenarios. The benchmark design follows a principled methodology:

Error Classification

Categorize MLLM errors into misreading (correct visual processing, wrong textual knowledge) and misrecognition (failed visual processing) types.

Test Case Construction

Design each test case to probe a specific type of error with paired visual and textual queries that target the same knowledge.

Consistency Evaluation

Verify whether editing achieves consistency across both modalities, measuring if the correction transfers between visual and textual pathways.

Evaluation Results

We evaluated six representative knowledge editing methods on MC-MKE across three core dimensions: reliability (does the edit succeed?), generality (does the edit generalize?), and locality (does the edit avoid breaking unrelated knowledge?).

Modality Consistency Performance of Editing Methods

Method	Reliability	Generality	Locality	Consistency
FT-LLM	High	Low	Low	Poor
FT-Full	High	Moderate	Low	Poor
KE	Moderate	Moderate	Moderate	Poor
MEND	Moderate	Low	High	Poor
SERAC	Moderate	Moderate	High	Moderate
IKE	Moderate	Moderate	High	Moderate

      Key Findings
      Modality Gap: Current methods struggle to maintain consistency between visual and textual knowledge after editing -- none achieve strong consistency
Error-Specific Needs: Different error types (misreading vs. misrecognition) require different editing strategies for optimal correction
Locality-Consistency Trade-off: Methods that preserve locality better (SERAC, IKE) tend to show somewhat better consistency, but still fall short
Benchmark Value: MC-MKE reveals significant room for improvement that previous benchmarks could not detect

    

The Path Forward

As multimodal models become more prevalent, the ability to correct and update their knowledge becomes increasingly important. MC-MKE provides the foundation for developing editing techniques that respect the multimodal nature of these systems -- ensuring that when we fix what a model knows, the fix is complete and consistent across all modalities.

Our work motivates research into modality-aware editing techniques that can jointly update both visual and textual knowledge pathways, moving beyond the single-modality approaches that dominate current methods.

Citation

@inproceedings{zhang-etal-2025-mc, title = "{MC}-{MKE}: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency", author = "Zhang, Junzhe and Zhang, Huixuan and Yin, Xunjian and Huang, Baizhou and Zhang, Xu and Hu, Xinyu and Wan, Xiaojun", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", year = "2025", url = "https://aclanthology.org/2025.findings-acl.896/", pages = "17430--17445" }