Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Jiatong Shi*, Xunjian Yin*, Xiaojun Wan

NAACL 2025 Findings

TL;DR

We systematically evaluate when LLMs can write their own retrieval context instead of fetching it externally, finding that self-generated documents help reasoning tasks but risk hallucinations for factual recall -- and that combining both strategies works best.

Retrieval-augmented generation (RAG) has become a standard approach for grounding large language models in external knowledge. Recently, an alternative paradigm has emerged: using LLM-generated documents as context instead of retrieved ones. In this work, we present the first comprehensive evaluation of self-generated documents for RAG. We investigate multiple factors including task type, model capability, document quality, and the interplay between self-generated and retrieved content. Our findings reveal that self-generated documents are more effective for reasoning-intensive tasks but riskier for factual recall, that larger models produce more reliable self-generated content, and that combining self-generated with retrieved documents often yields the best results. We provide practical guidelines for when and how to use self-generated documents in RAG pipelines.

When Models Write Their Own Context

Retrieval-augmented generation (RAG) has become the standard approach for grounding language models in external knowledge. But what if the model could generate its own context? Recent work has explored using self-generated documents as an alternative to retrieval -- letting the model write background information before answering.

The idea is appealing: no need for external databases, no retrieval latency, and the generated context is perfectly tailored to the question. But how reliable is self-generated content? When does it help, and when does it hurt?

A Systematic Investigation

We conducted the first comprehensive evaluation of self-generated documents in RAG systems. Rather than assuming self-generation is universally helpful or harmful, we sought to understand the factors that determine its effectiveness.

Generate Self-Documents

Prompt the LLM to write background documents relevant to the input question, creating tailored context without external retrieval.

Compare RAG Paradigms

Evaluate three paradigms side-by-side: retrieval-only, self-generation-only, and a hybrid combining both sources of context.

Analyze Across Dimensions

Systematically vary task type, model size, and document quality to identify when self-generation helps or hurts performance.

The Double-Edged Sword

Self-generated documents can improve performance on knowledge-intensive tasks -- the model's pre-trained knowledge, when properly elicited, provides useful context. But self-generation also risks introducing hallucinations and factual errors that compound rather than correct the model's limitations.

When Self-Generation Works

We found that task type matters enormously. For reasoning tasks, where the challenge is organizing and applying knowledge the model already has, self-generated context helps structure thinking. For factual recall tasks, where accuracy depends on specific memorized facts, self-generation is riskier -- the model may confidently generate plausible but incorrect information.

Model capability also plays a crucial role. Larger, more capable models produce more reliable self-generated content. Smaller models are more likely to hallucinate, making self-generation counterproductive.

Performance Across RAG Paradigms

Context Source	Reasoning Tasks	Factual Recall	Overall
No Context (Closed-book)	Baseline	Baseline	Baseline
Retrieved Documents	Good	Strong	Good
Self-Generated Documents	Strong	Variable	Good
Hybrid (Retrieved + Self-Generated)	Best	Best	Best

      Key Findings
      Task Dependency: Self-generation works better for reasoning than factual recall
Model Capability: Larger models produce more reliable self-generated content
Complementary Use: Combining self-generated and retrieved documents often yields best results
Verification Need: Self-generated content requires additional verification to avoid hallucinations

    

Practical Guidelines

Based on our analysis, we propose guidelines for practitioners. Self-generation should be used selectively, considering task type and model capability. When factual accuracy is critical, retrieval from verified sources remains essential. But for reasoning and synthesis tasks, self-generated context can be a powerful complement.

The most robust approach combines both: retrieve external documents for factual grounding, and use self-generation to bridge gaps and structure reasoning.

The Takeaway

Self-generated documents are not a replacement for retrieval but a complementary tool. The key is knowing when to trust a model's own knowledge versus when to ground it in external sources -- and combining both strategies yields the most robust RAG systems.

Citation

@inproceedings{shi-yin-2025-evaluating, title = "Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models", author = "Shi, Jiatong and Yin, Xunjian and Wan, Xiaojun", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025", year = "2025", url = "https://aclanthology.org/2025.findings-naacl.215/" }