From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

ArXiv Preprint 2025

TL;DR

Reinforcement learning acts as a reasoning synthesizer, not just a probability amplifier—it enables models to compose novel complex reasoning from mastered atomic skills, while supervised fine-tuning alone memorizes shortcuts and fails out-of-distribution.

A key question about reinforcement learning (RL) is whether it synthesizes new reasoning skills or merely amplifies existing ones. We study this through Complementary Reasoning, a task that requires integrating parametric knowledge (what the model knows) with contextual information (what is provided). Using a controlled synthetic dataset, we decompose this into two atomic skills: Parametric Reasoning (internal knowledge) and Contextual Reasoning (external information). Our experiments reveal the SFT Generalization Paradox: while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, especially in novel compositional settings. Models trained on composite tasks achieve strong in-distribution accuracy but fail on out-of-distribution scenarios due to memorization patterns. Crucially, we demonstrate that RL acts as a reasoning synthesizer rather than a probability amplifier, but requires models to first master independent atomic skills through supervised fine-tuning. Our findings suggest that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

The SFT Generalization Paradox overview
Figure 1. The SFT Generalization Paradox: Models trained on atomic skills generalize better via RL than models trained directly on the composite task. (a) Example of Complementary Reasoning requiring both Parametric and Contextual Reasoning. (b) Evaluation protocol through three levels of difficulty. (c) RL composes new skills only when the base model has sufficient atomic abilities.

The Question That Drives Us

There's a heated debate in AI: does reinforcement learning actually teach models new reasoning skills, or does it merely amplify behaviors they already possess? This question has profound implications for how we should train the next generation of AI systems.

We designed a rigorous experiment to answer this question, and what we found surprised us.

The Setup: Complementary Reasoning

We focused on "complementary reasoning"—tasks that require integrating what a model knows internally (parametric knowledge) with information provided in context (contextual knowledge). Think of it like a detective who must combine their expertise with clues at a crime scene.

Using a carefully constructed dataset of human biographies, we decomposed this complex skill into two atomic components: parametric reasoning (using internal knowledge) and contextual reasoning (using external information). This separation allowed us to precisely measure what models learn and when they fail.

1

Decompose into Atomic Skills

Split complementary reasoning into parametric reasoning (internal knowledge) and contextual reasoning (external information) using a controlled biography dataset.

2

Train Atomic Skills via SFT

Fine-tune models separately on each atomic skill to build a strong foundation before attempting composite tasks.

3

Apply RL for Composition

Use reinforcement learning on composite tasks to synthesize complex reasoning from the mastered atomic primitives.

4

Evaluate OOD Generalization

Test on three difficulty levels including novel combinations never seen during training to measure true generalization.

The SFT Generalization Paradox

Models trained with supervised fine-tuning (SFT) on composite reasoning tasks achieve near-perfect accuracy on in-distribution tests. But when evaluated on novel combinations—situations they haven't seen but could logically solve—their performance collapses. They weren't learning to reason; they were memorizing shortcuts. Perfect training scores masked a fundamental failure to generalize.

RL vs. SFT: Generalization Comparison

Training Strategy In-Distribution Level 1 (Easy OOD) Level 2 (Hard OOD) Level 3 (Full OOD)
SFT on Composite ~100% Low Very Low Collapse
SFT on Atomics only Moderate Moderate Low Low
SFT Atomics + RL Composite High High High Strong
SFT Composite + RL Composite High Moderate Low Low
RL vs SFT performance comparison
Figure 2. Comparison of reinforcement learning on different base models across complementary data proportions. The top row shows RL gains; the bottom row shows absolute performance. LLMs generalize to complementary reasoning only from SFT models trained on both parametric and contextual reasoning.

Enter Reinforcement Learning

When we switched from supervised learning to reinforcement learning, something remarkable happened. RL-trained models showed genuine generalization—they could solve novel combinations of atomic skills even without explicit training on those combinations.

But here's the crucial caveat: RL only worked when the base model had first mastered the individual atomic skills through SFT. Without this foundation, RL couldn't synthesize what wasn't there to begin with.

Necessity of atomic skills for RL generalization
Figure 3. Necessity of atomic skills for RL generalization. RL with the same composite data from different SFT-trained models: only SFT on both Mem+Ctx generalizes well across all evaluation levels.

Key Finding: RL as a Reasoning Synthesizer

RL does not merely amplify existing probability distributions. When given a foundation of mastered atomic skills, RL actively synthesizes novel composite reasoning strategies the model has never been explicitly trained on. This is evidence of genuine skill composition, not memorization.

The Recipe for Generalization

Performance of different training strategies
Figure 4. Performance of training with different strategies over 12.8k composite samples, showing the breakdown across difficulty levels.

Deeper Analysis

Our Pass@k analysis reveals a fundamental difference between the two approaches. For SFT-trained models on composite tasks, increasing the number of sampled answers barely improves performance on OOD tests—the correct reasoning path simply doesn't exist in the model's distribution. For atomically-trained models after RL, more samples consistently yield better results, confirming that the model has genuinely learned the compositional skill.

Pass@k comparison
Figure 5. Pass@k comparison for SFT on atomics vs. SFT on composite. RL synthesizes new compositional skills only when built on models with sufficient atomic abilities.
Model scaling analysis
Figure 6. Model scaling analysis across Qwen 0.5B, 1.5B, and 3B parameters, showing the pattern holds across model sizes.
PCA analysis of representation shifts
Figure 7. PCA analysis of representation shifts during RL training. Scatter points represent layer-wise shifts; large markers show the global centroid shift for each reasoning type, illustrating how RL reorganizes internal representations.

What This Means

Our findings challenge the view that RL merely amplifies existing behaviors. When given the right foundation, RL can actively synthesize complex reasoning strategies from simpler learned primitives—without ever being explicitly shown those complex strategies.

This suggests a scalable path forward: instead of trying to supervise models on every complex task, we can train atomic skills and use RL to enable their combination. It's a more modular, more generalizable approach to building reasoning systems.

Looking Ahead

The SFT Generalization Paradox serves as a cautionary tale. High training accuracy can be deceiving—true understanding requires out-of-distribution generalization. By decomposing complex skills into atomic components and using RL to enable their synthesis, we may be able to build AI systems that truly reason rather than merely recall.

Citation

@misc{cheng2025atomiccompositereinforcementlearning, title={From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning}, author={Sitao Cheng and Xunjian Yin and Ruiwen Zhou and Yuxuan Li and Xinyi Wang and Liangming Pan and William Yang Wang and Victor Zhong}, year={2025}, eprint={2512.01970}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2512.01970} }