EAMA: Entity-Aware Multimodal Alignment Based Approach for News Image Captioning
TOMM 2025
EAMA aligns multimodal language models with entity-aware auxiliary tasks so they can accurately name people, places, and organizations in news image captions -- without needing any external entity recognition modules.
News image captioning requires a model to generate an informative caption rich in entities, given the news image and the associated news article. Multimodal large language models (MLLMs) have shown remarkable ability across various vision-language tasks. However, their performance in news image captioning remains suboptimal due to challenges in handling entity-rich content across modalities. We propose EAMA, an Entity-Aware Multimodal Alignment approach that aligns MLLMs through entity-aware auxiliary tasks before caption generation. Our method introduces two alignment tasks -- Entity-Aware Sentence Selection and Entity Selection -- that train the model to reason about entities across text and image. During inference, the aligned model self-extracts entity-related information to supplement its input, keeping the pipeline simple and end-to-end. EAMA achieves state-of-the-art results on both GoodNews and NYTimes800k benchmarks, demonstrating superior entity handling without external modules.
The Entity Problem in News Captioning
News image captioning differs fundamentally from generic image captioning. A news photo of a press conference isn't just "a person speaking at a podium" -- it's a specific politician, at a specific event, discussing a specific topic. Entities -- people, organizations, locations -- are the backbone of news captions.
Multimodal large language models have made impressive strides across vision-language tasks, yet they struggle with this entity-rich setting. In zero-shot mode, they lack the contextual knowledge to name specific people or events. Even with fine-tuning, they often fail to ground entity information correctly across text and image modalities.
Aligning Models to Entity Awareness
Rather than treating entity handling as an afterthought, EAMA builds entity awareness directly into the alignment process. We design two auxiliary tasks alongside the main captioning objective:
Entity-Aware Sentence Selection
Trains the model to identify which sentences in an article are most relevant to a given image, forcing it to reason about the visual-textual entity correspondence.
Entity Selection
Trains the model to pick the correct entities associated with an image from candidates, building explicit entity-grounding capability across modalities.
News Image Captioning
The primary task where the aligned model generates entity-rich captions, now equipped with stronger cross-modal entity understanding from the auxiliary tasks.
Self-Extracted Entity Information
A key insight of EAMA is that once aligned, the model itself can extract entity-related information to supplement its input during caption generation. There is no need for external entity recognition modules or knowledge bases -- the aligned model serves as its own entity extractor, keeping the pipeline simple and end-to-end.
Balancing Sufficiency and Conciseness
News captioning faces a tension: the model needs enough context to generate entity-rich captions, but too much input introduces noise and confusion. EAMA addresses this by letting the aligned model decide what information is relevant, naturally filtering the textual input to balance sufficiency and conciseness during generation.
Main Results
EAMA achieves state-of-the-art performance on both GoodNews and NYTimes800k benchmarks, outperforming previous methods across caption quality metrics and entity handling scores.
Performance on GoodNews
| Method | BLEU-4 | METEOR | ROUGE | CIDEr | Entity P | Entity R |
|---|---|---|---|---|---|---|
| Xu et al. (2024a) | 8.49 | 12.88 | 26.22 | 83.52 | 30.19 | 26.57 |
| InstructBLIP (OSFT) | 9.53 | 13.54 | 25.61 | 78.03 | 25.89 | 27.33 |
| EAMA (Ours) | 10.04 | 13.95 | 27.06 | 87.70 | 27.58 | 28.92 |
Performance on NYTimes800k
| Method | BLEU-4 | METEOR | ROUGE | CIDEr | Entity P | Entity R |
|---|---|---|---|---|---|---|
| Base (no alignment) | 10.88 | 14.09 | 26.60 | 82.70 | 29.22 | 31.32 |
| Align (SENT+CAP) | 10.92 | 14.08 | 26.68 | 83.60 | 29.42 | 31.45 |
| Align (ENT+CAP) | 10.97 | 14.20 | 26.87 | 84.28 | 29.26 | 31.58 |
| EAMA (Full) | 11.03 | 14.22 | 27.15 | 87.00 | 29.79 | 32.24 |
Key Results
- State-of-the-Art Performance: Superior results on both GoodNews and NYTimes800k benchmarks across all major metrics
- Better Entity Handling: Improved named entity precision and recall over previous methods
- End-to-End Simplicity: No external entity modules needed -- the aligned model self-extracts entity information
- Effective Alignment: Two auxiliary tasks significantly improve entity-aware multimodal understanding, as shown in the ablation study
Toward Entity-Grounded Multimodal Understanding
EAMA demonstrates that targeted alignment tasks can address specific weaknesses in MLLMs. By designing training objectives that explicitly require entity reasoning across modalities, we bridge the gap between generic multimodal understanding and the entity-centric demands of news captioning. The approach points toward a broader principle: alignment tasks tailored to downstream challenges can unlock capabilities that general-purpose training leaves on the table.