Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Sitao Cheng, Liangming Pan, Xunjian Yin, Xinyi Wang, William Yang Wang

KnowLM Workshop @ ACL 2025

TL;DR

LLMs suppress their internal knowledge whenever external context is present -- even when that context is irrelevant or complementary. We introduce EchoQA, a benchmark with four knowledge-relationship types, revealing that models default to "defer to context" and struggle to integrate parametric and contextual knowledge.

Large language models (LLMs) encode vast amounts of knowledge during pre-training (parametric knowledge, or PK) and can further be enhanced by incorporating contextual knowledge (CK). Can LLMs effectively integrate their internal PK with external CK to solve complex problems? In this paper, we investigate the dynamic interaction between PK and CK, categorizing their relationships into four types: Supportive, Complementary, Conflicting, and Irrelevant. To support this investigation, we introduce EchoQA, a benchmark spanning scientific, factual, and commonsense knowledge. Our results show that LLMs tend to suppress their PK when contextual information is available, even when it is complementary or irrelevant. While tailored instructions can encourage LLMs to rely more on their PK, they still struggle to fully leverage it. These findings reveal a key vulnerability in LLMs, raising concerns about their reliability in knowledge-intensive tasks.

EchoQA Benchmark Overview — The EchoQA benchmark categorizes knowledge relationships into four types: Supportive (context confirms what the model knows), Complementary (context adds to internal knowledge), Conflicting (context contradicts internal knowledge), and Irrelevant (context provides no useful information).

Two Sources of Truth

Large language models carry two kinds of knowledge. There's parametric knowledge -- facts and patterns baked into their weights during pre-training. And there's contextual knowledge -- information provided in the prompt, like retrieved documents or user-supplied context. Ideally, models should seamlessly integrate both to answer complex questions.

But how do these two knowledge sources actually interact? When they agree, when they conflict, when one fills gaps in the other -- how does the model decide what to believe?

A Taxonomy of Relationships

We developed a systematic taxonomy to study these interactions. Contextual knowledge can be supportive (confirming what the model already knows), complementary (adding necessary pieces the model lacks), conflicting (contradicting the model's beliefs), or irrelevant (providing no useful information). Each relationship presents different challenges.

To study these systematically, we created EchoQA -- a benchmark spanning scientific, factual, and commonsense knowledge domains, with carefully controlled examples of each relationship type.

Complementary Reasoning Results — Accuracy on Complementary Reasoning tasks: models struggle to effectively combine their internal knowledge with contextual information, even when both are necessary to answer correctly.

The Troubling Discovery

Our experiments revealed a concerning pattern: LLMs tend to suppress their parametric knowledge whenever contextual information is present -- even when that context is irrelevant or when the model's own knowledge would be more helpful. The models seem to have learned "defer to context" as a default behavior, sometimes to their detriment.

The Failure of Complementary Reasoning

The most striking failure came in complementary reasoning tasks -- questions that require combining internal and external knowledge. The model knows some facts; the context provides others; the answer requires integrating both. This should be the ideal use case for retrieval-augmented generation.

Instead, models struggled significantly. They would often ignore their parametric knowledge entirely, trying to answer using only the context -- and failing when the context alone was insufficient.

Unknown Ratio (%) for Complementary Reasoning on ALCUNA

Model	w/o K	Neutral	Trust Yourself	Golden K
OpenAI o1	65.78	46.12	35.43	20.63
GPT-4o	36.45	59.90	26.94	8.08
Llama 3.1-70B	23.89	62.72	23.88	0.08
Qwen 2-7B	40.28	81.26	73.62	28.60

Conflicting Reasoning Results — Memorization Ratio for Conflicting Reasoning: how often models stick with their parametric knowledge when context contradicts it. The results reveal inconsistent behavior across different models and scenarios.

Irrelevant Reasoning Results — Accuracy for Irrelevant Reasoning with multiple documents: models are distracted by irrelevant contextual information, leading to degraded performance compared to relying on parametric knowledge alone.

Unknown Ratio (%) for Irrelevant Reasoning on ConflictQA

Model	Neutral	Trust Yourself	Speak Out Loud
OpenAI o1	6.12	6.12	0.98
GPT-4o	50.38	13.46	0.53
Llama 3.1-70B	55.14	27.36	2.33
Qwen 2-7B	80.59	41.82	4.57

      Key Findings
      Context Dominance: LLMs prioritize context even when parametric knowledge would be more reliable
Complementary Failure: Models struggle to combine internal and external knowledge effectively
Instruction Sensitivity: Prompting helps but doesn't fully solve the problem
RAG Implications: These behaviors raise concerns for retrieval-augmented systems in practice

    

Memorization Ratio by Entity Popularity — Memorization Ratio on ConflictQA by entity popularity categories: models are more likely to retain parametric knowledge for popular entities, suggesting that knowledge frequency during pre-training affects the balance between internal and contextual knowledge.

Implications for RAG

These findings have serious implications for retrieval-augmented generation. RAG systems assume that providing relevant context will help models answer better. But if models can't properly integrate context with their own knowledge -- or worse, if they suppress useful internal knowledge when context is present -- then RAG might not work as intended.

Our work calls for more careful design of RAG systems, including mechanisms to help models know when to trust context, when to trust themselves, and when to synthesize both.

Knowledge Fabrication Methods — Memorization Ratio for Conflicting Reasoning under different knowledge fabrication methods: the way conflicting information is constructed significantly affects how models balance parametric and contextual knowledge.

Citation

@inproceedings{cheng2025understanding, title={Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models}, author={Sitao Cheng and Liangming Pan and Xunjian Yin and Xinyi Wang and William Yang Wang}, booktitle={Knowledgeable Foundation Models at ACL 2025}, year={2025}, url={https://openreview.net/forum?id=4uisAcagzw} }