Value Flows

Abstract

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on 37 state-based and 25 image-based benchmark tasks demonstrate that Value Flows achieves a 1.3x improvement on average in success rates.

Overview

Value Flows is a framework for modeling the return distribution using modern, flexible flow-based models.

Problem: predominant methods for estimating the return distribution model it as a categorical distribution over discrete bins or estimating a finite number of quantiles, leaving questions unanswered about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making.

Key idea: formulate a distributional flow-matching objective that generates probability density paths satisfying the distributional Bellman equation automatically, which can be used to estimate the full future return distributions and identify those states with high return variance.

Algorithm

Click to see the full algorithm

The objective of Value Flows consists of three main components: (1) the vector field estimating the return distribution, (2) the confidence weight incorporating aleatoric uncertainty, and (3) the flow policy selecting actions.

Vector field: Distributional conditional flow matching (DCFM) + \( \lambda \) Bootstrapped conditional flow matching (BCFM)

Confidence weight: We incorporate the return uncertainty information into Value Flows to prioritize learning a more accurate return distribution at certain transitions: a higher return variance indicates a demand for more accurate return estimation at that transition. We estimate the return variance \( \text{Var}(Z^{\pi}(s, a)) \) using a first-order Taylor approximation on the corresponding (diffeomorphic) flow \( \phi: \mathbb{R} \times [0, 1] \times \mathcal{S} \times \mathcal{A} \to \mathbb{R} \) of the vector field \( v \).

We define the confidence weight \( w(s, a, \epsilon) = \sigma \left( - \tau \left/ \left\lvert \frac{\partial \phi}{\partial \epsilon} (\epsilon \mid 1, s, a) \right\rvert \right. \right) + 0.5 \) and apply this weighting to the vector field loss.

Policy selecting action: We consider two different behavioral-regularized policy extraction strategies for offline RL and offline-to-online RL. First, for offline RL, we use rejection sampling to maximize Q estimates while implicitly imposing a KL constraint toward a fixed behavioral cloning (BC) policy. Second, for online fine-tuning in offline-to-online RL, following prior work, we learn a stochastic one-step policy to maximize the Q estimates while distilling it toward the fixed BC flow policy.

Experiments

Offline RL

Click to see the full table (62 tasks)

Offline RL results averaged over 8 seeds on 62 continous control tasks from OGBench and D4RL.

Value flows achieves the best or near-best performance on 9 out of 11 domains.
On those more challenging state-based tasks, Value Flows achieves 1.6× higher success rates than the best performing baseline.

Offline-to-online RL

Click to see the full table (6 tasks)

Offline-to-online RL results averaged over 8 seeds on 6 continous control tasks from OGBench.

Value Flows demonstrates strong performance in online settings and achieves the best performance averaged over all the environments.
Value Flows can be used without any modifications to the vector field objective.

BibTeX

@misc{dong2025value,
      title={Value Flows}, 
      author={Perry Dong and Chongyi Zheng and Chelsea Finn and Dorsa Sadigh and Benjamin Eysenbach},
      year={2025},
      eprint={2510.07650},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.07650}, 
}