Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
NAACL 2025 Findings
We systematically evaluate when LLMs can write their own retrieval context instead of fetching it externally, finding that self-generated documents help reasoning tasks but risk hallucinations for factual recall -- and that combining both strategies works best.
Retrieval-augmented generation (RAG) has become a standard approach for grounding large language models in external knowledge. Recently, an alternative paradigm has emerged: using LLM-generated documents as context instead of retrieved ones. In this work, we present the first comprehensive evaluation of self-generated documents for RAG. We investigate multiple factors including task type, model capability, document quality, and the interplay between self-generated and retrieved content. Our findings reveal that self-generated documents are more effective for reasoning-intensive tasks but riskier for factual recall, that larger models produce more reliable self-generated content, and that combining self-generated with retrieved documents often yields the best results. We provide practical guidelines for when and how to use self-generated documents in RAG pipelines.
When Models Write Their Own Context
Retrieval-augmented generation (RAG) has become the standard approach for grounding language models in external knowledge. But what if the model could generate its own context? Recent work has explored using self-generated documents as an alternative to retrieval -- letting the model write background information before answering.
The idea is appealing: no need for external databases, no retrieval latency, and the generated context is perfectly tailored to the question. But how reliable is self-generated content? When does it help, and when does it hurt?
A Systematic Investigation
We conducted the first comprehensive evaluation of self-generated documents in RAG systems. Rather than assuming self-generation is universally helpful or harmful, we sought to understand the factors that determine its effectiveness.
Generate Self-Documents
Prompt the LLM to write background documents relevant to the input question, creating tailored context without external retrieval.
Compare RAG Paradigms
Evaluate three paradigms side-by-side: retrieval-only, self-generation-only, and a hybrid combining both sources of context.
Analyze Across Dimensions
Systematically vary task type, model size, and document quality to identify when self-generation helps or hurts performance.
The Double-Edged Sword
Self-generated documents can improve performance on knowledge-intensive tasks -- the model's pre-trained knowledge, when properly elicited, provides useful context. But self-generation also risks introducing hallucinations and factual errors that compound rather than correct the model's limitations.
When Self-Generation Works
We found that task type matters enormously. For reasoning tasks, where the challenge is organizing and applying knowledge the model already has, self-generated context helps structure thinking. For factual recall tasks, where accuracy depends on specific memorized facts, self-generation is riskier -- the model may confidently generate plausible but incorrect information.
Model capability also plays a crucial role. Larger, more capable models produce more reliable self-generated content. Smaller models are more likely to hallucinate, making self-generation counterproductive.
Performance Across RAG Paradigms
| Context Source | Reasoning Tasks | Factual Recall | Overall |
|---|---|---|---|
| No Context (Closed-book) | Baseline | Baseline | Baseline |
| Retrieved Documents | Good | Strong | Good |
| Self-Generated Documents | Strong | Variable | Good |
| Hybrid (Retrieved + Self-Generated) | Best | Best | Best |
Key Findings
- Task Dependency: Self-generation works better for reasoning than factual recall
- Model Capability: Larger models produce more reliable self-generated content
- Complementary Use: Combining self-generated and retrieved documents often yields best results
- Verification Need: Self-generated content requires additional verification to avoid hallucinations
Practical Guidelines
Based on our analysis, we propose guidelines for practitioners. Self-generation should be used selectively, considering task type and model capability. When factual accuracy is critical, retrieval from verified sources remains essential. But for reasoning and synthesis tasks, self-generated context can be a powerful complement.
The most robust approach combines both: retrieve external documents for factual grounding, and use self-generation to bridge gaps and structure reasoning.
The Takeaway
Self-generated documents are not a replacement for retrieval but a complementary tool. The key is knowing when to trust a model's own knowledge versus when to ground it in external sources -- and combining both strategies yields the most robust RAG systems.