Differentiable Evolutionary Reinforcement Learning

Sitao Cheng*, Tianle Li*, Xuhan Huang*, Xunjian Yin, Difan Zou

ArXiv Preprint 2025

TL;DR

DERL makes reward function design differentiable by evolving structured reward primitives with meta-gradients, enabling autonomous discovery of optimal reward signals that achieve state-of-the-art on robotic, scientific, and mathematical reasoning tasks.

The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bi-level framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its meta-optimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling self-improving agent alignment without human intervention.

Overview of DERL framework showing Meta-Optimizer generating parameterized Meta-Reward with validation performance feedback, alongside performance comparisons across tasks — **Figure 1.** Overview of DERL. Left: the Meta-Optimizer generates a parameterized Meta-Reward; validation performance feeds back to update it via policy gradients. Right: performance comparisons of outcome reward, average reward, and Meta-Reward across tasks.

The Reward Design Problem

Reinforcement learning has a dirty secret: the hardest part isn't the algorithm -- it's the reward function. Getting the reward right often requires extensive trial and error, domain expertise, and sometimes luck. A poorly designed reward leads to agents that technically optimize what you asked for but not what you actually wanted.

Existing approaches to automatic reward design treat the reward function as a black box, using evolutionary heuristics to search for better rewards. But evolution is slow and inefficient -- it can't see the causal relationship between reward structure and task performance.

How DERL Works

DERL bridges this gap with a key insight: what if we could make the evolutionary process differentiable? Instead of blind mutation and selection, what if we could compute gradients that tell us exactly how to improve the reward function?

Compose Structured Rewards

The Meta-Optimizer generates reward functions by composing structured atomic primitives into computational graphs, producing a parameterized Meta-Reward.

Train Inner-Loop Policy

An inner-loop policy is trained using GRPO with the generated Meta-Reward, learning to solve tasks guided by the evolved reward signal.

Evaluate on Validation

The trained policy is evaluated on a held-out validation set. This validation performance serves as the learning signal for the Meta-Optimizer.

Update via Meta-Gradient

The Meta-Optimizer is updated through reinforcement learning, approximating the meta-gradient of task success with respect to reward structure -- making evolution differentiable.

Bi-level evolutionary training structure showing Meta-Optimizer evolution and inner-loop policy training — **Figure 2.** Bi-level optimization structure. The blue block depicts Meta-Optimizer evolution with n generated Meta-Rewards. The green block illustrates inner-loop policy training using GRPO, with validation performance v serving as reward signal for Meta-Optimizer evolution.

The Technical Innovation

DERL approximates the meta-gradient of task success with respect to reward structure. This allows it to progressively learn to generate denser, more actionable feedback. Unlike black-box evolution, DERL understands why certain reward functions work better and can generalize this understanding to design better rewards.

From Robots to Math

We validated DERL across three distinct domains that test different aspects of reward design. In ALFWorld, robotic agents must navigate and manipulate objects in household environments. In ScienceWorld, agents conduct scientific experiments in simulation. In GSM8K and MATH, agents solve mathematical reasoning problems.

Across all domains, DERL discovers reward functions that outperform hand-designed baselines -- without any domain-specific engineering of the rewards themselves.

ALFWorld & ScienceWorld Performance

Method	ALFWorld L0	ALFWorld L1	ALFWorld L2	SciWorld L0	SciWorld L1	SciWorld L2
GRPO w/ Outcome	76.6	71.1	29.7	21.1	13.7	10.9
GRPO w/ Avg Reward	88.1	85.4	30.5	37.9	31.3	18.0
GiGPO	86.7	83.2	48.0	25.8	15.2	4.7
RLVMR	89.1	87.9	56.3	46.9	34.4	26.5
DERL	91.0	89.1	65.0	47.7	43.0	30.1
DERL-pop.	91.8	88.3	76.4	98.2	95.3	31.3

Mathematical Reasoning Performance (GSM8K & MATH)

Reward Function	Training Data	GSM8K	MATH
Outcome	MATH+GSM8K	82.6	58.8
Outcome + Format	MATH+GSM8K	86.4	55.9
Avg Reward	MATH+GSM8K	86.5	55.8
Outcome	MATH	82.9	59.1
Outcome + Format	MATH	83.9	56.8
Avg Reward	MATH	83.6	54.9
DERL	MATH+GSM8K	87.0	60.2
DERL-pop.	MATH+GSM8K	87.6	60.2
DERL	MATH	83.2	60.5
DERL-pop.	MATH	84.1	60.9

      Key Results
      ALFWorld: Improved robotic task completion through learned reward shaping, with DERL-pop. reaching 76.4% on the hardest L2 split (vs. 56.3% for RLVMR)
ScienceWorld: Dramatic improvement in scientific reasoning -- 98.2% on L0 (vs. 46.9% for RLVMR)
GSM8K & MATH: Enhanced mathematical reasoning through meta-learned rewards, achieving 87.6% and 60.9% respectively
Generalization: Strongest gains in out-of-distribution scenarios where heuristic rewards prove less effective

    

Training trajectories on ALFWorld showing Meta-Optimizer convergence — **Figure 3.** Training trajectories on ALFWorld, GSM8K, and MATH benchmarks showing Meta-Optimizer convergence. Blue and orange lines represent average validation and testing performance, demonstrating stable optimization without overfitting.

Evolution of Reward Structures

Analysis of the evolutionary trajectory reveals that DERL learns to progressively favor mathematically robust reward structures. Over the course of training, the proportion of stable reward primitives increases while unstable ones are naturally selected against -- the system discovers not just good rewards, but structurally sound ones.

Proportion of stable versus unstable reward structures across training iterations on ALFWorld — **Figure 4.** Evolution dynamics: proportion of stable versus unstable reward structures across outer-loop training on ALFWorld, showing the system's increasing preference for mathematically robust structures.

Toward Autonomous Reward Design

DERL represents a step toward truly autonomous reinforcement learning -- systems that can design their own training signals without human intervention. By making the reward design process differentiable, we open the door to more efficient and effective automatic curriculum design, where agents learn not just to solve tasks but to teach themselves how to learn.

Training loop using graph neural networks to represent computational graphs for Meta-Reward — **Figure 5.** Differentiable evolutionary reward training loop using graph neural networks to represent computational graphs. The Meta-Reward is parameterized by twelve learnable weights across computational nodes.

Citation

@misc{cheng2025differentiableevolutionaryreinforcementlearning, title={Differentiable Evolutionary Reinforcement Learning}, author={Sitao Cheng and Tianle Li and Xuhan Huang and Xunjian Yin and Difan Zou}, year={2025}, eprint={2512.13399}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2512.13399} }