EAMA: Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Xiaojun Wan

TOMM 2025

TL;DR

EAMA aligns multimodal language models with entity-aware auxiliary tasks so they can accurately name people, places, and organizations in news image captions -- without needing any external entity recognition modules.

News image captioning requires a model to generate an informative caption rich in entities, given the news image and the associated news article. Multimodal large language models (MLLMs) have shown remarkable ability across various vision-language tasks. However, their performance in news image captioning remains suboptimal due to challenges in handling entity-rich content across modalities. We propose EAMA, an Entity-Aware Multimodal Alignment approach that aligns MLLMs through entity-aware auxiliary tasks before caption generation. Our method introduces two alignment tasks -- Entity-Aware Sentence Selection and Entity Selection -- that train the model to reason about entities across text and image. During inference, the aligned model self-extracts entity-related information to supplement its input, keeping the pipeline simple and end-to-end. EAMA achieves state-of-the-art results on both GoodNews and NYTimes800k benchmarks, demonstrating superior entity handling without external modules.

The Entity Problem in News Captioning

News image captioning differs fundamentally from generic image captioning. A news photo of a press conference isn't just "a person speaking at a podium" -- it's a specific politician, at a specific event, discussing a specific topic. Entities -- people, organizations, locations -- are the backbone of news captions.

Multimodal large language models have made impressive strides across vision-language tasks, yet they struggle with this entity-rich setting. In zero-shot mode, they lack the contextual knowledge to name specific people or events. Even with fine-tuning, they often fail to ground entity information correctly across text and image modalities.

EAMA framework overview showing alignment training with entity-aware sentence selection, entity selection, and captioning tasks on the left, and self-supplemented generation for inference on the right.
Figure 1. Overview of the EAMA framework. Left: alignment training with three tasks -- entity-aware sentence selection, entity selection, and captioning. Right: self-supplemented generation at inference time, where the aligned model extracts its own entity context.

Aligning Models to Entity Awareness

Rather than treating entity handling as an afterthought, EAMA builds entity awareness directly into the alignment process. We design two auxiliary tasks alongside the main captioning objective:

1

Entity-Aware Sentence Selection

Trains the model to identify which sentences in an article are most relevant to a given image, forcing it to reason about the visual-textual entity correspondence.

2

Entity Selection

Trains the model to pick the correct entities associated with an image from candidates, building explicit entity-grounding capability across modalities.

3

News Image Captioning

The primary task where the aligned model generates entity-rich captions, now equipped with stronger cross-modal entity understanding from the auxiliary tasks.

Comparison of zero-shot, supervised fine-tuning, and EAMA responses for a news image captioning example.
Figure 2. Qualitative comparison. Zero-shot MLLMs produce generic descriptions; supervised fine-tuning captures some entities but hallucinates others. EAMA generates accurate, entity-rich captions grounded in both image and article.

Self-Extracted Entity Information

A key insight of EAMA is that once aligned, the model itself can extract entity-related information to supplement its input during caption generation. There is no need for external entity recognition modules or knowledge bases -- the aligned model serves as its own entity extractor, keeping the pipeline simple and end-to-end.

Balancing Sufficiency and Conciseness

News captioning faces a tension: the model needs enough context to generate entity-rich captions, but too much input introduces noise and confusion. EAMA addresses this by letting the aligned model decide what information is relevant, naturally filtering the textual input to balance sufficiency and conciseness during generation.

Main Results

EAMA achieves state-of-the-art performance on both GoodNews and NYTimes800k benchmarks, outperforming previous methods across caption quality metrics and entity handling scores.

Performance on GoodNews

Method BLEU-4 METEOR ROUGE CIDEr Entity P Entity R
Xu et al. (2024a) 8.49 12.88 26.22 83.52 30.19 26.57
InstructBLIP (OSFT) 9.53 13.54 25.61 78.03 25.89 27.33
EAMA (Ours) 10.04 13.95 27.06 87.70 27.58 28.92

Performance on NYTimes800k

Method BLEU-4 METEOR ROUGE CIDEr Entity P Entity R
Base (no alignment) 10.88 14.09 26.60 82.70 29.22 31.32
Align (SENT+CAP) 10.92 14.08 26.68 83.60 29.42 31.45
Align (ENT+CAP) 10.97 14.20 26.87 84.28 29.26 31.58
EAMA (Full) 11.03 14.22 27.15 87.00 29.79 32.24
Bar chart comparing three MLLMs (LLaVA-v1.5, MiniGPT-v2, InstructBLIP) under zero-shot and OSFT settings across both datasets.
Figure 3. Performance comparison of three MLLMs (LLaVA-v1.5, MiniGPT-v2, InstructBLIP) under zero-shot and official supervised fine-tuning (OSFT) settings, showing the entity gap that EAMA addresses.

Key Results

Toward Entity-Grounded Multimodal Understanding

EAMA demonstrates that targeted alignment tasks can address specific weaknesses in MLLMs. By designing training objectives that explicitly require entity reasoning across modalities, we bridge the gap between generic multimodal understanding and the entity-centric demands of news captioning. The approach points toward a broader principle: alignment tasks tailored to downstream challenges can unlock capabilities that general-purpose training leaves on the table.

Citation

@misc{zhang2024eamaentityawaremultimodal, title={EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning}, author={Junzhe Zhang and Huixuan Zhang and Xunjian Yin and Xiaojun Wan}, year={2024}, eprint={2402.19404}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2402.19404}, }