The goal of RL is to get high rewards, and most RL algorithms have to predict how much reward you'll get in the future. However, often there is some uncertainty in these future rewards — in some states, you might sometimes get high future rewards and sometimes get low future rewards.
In this paper, we use modern, flexible generative AI models (flow matching) to predict the full distribution over future rewards. Unlike prior methods, our method does not require rounding the distribution over future rewards into discrete bins or predict a finite number of quantiles.
Intuitively, we can think of the distribution over values as flowing through time (see the figure). We prove that this flow has a consistency equation, analogous to the standard Bellman equation. This means that we can update the flow directly using a certain loss, without converting back to value predictions.
🌊Return vector fields: We formulate a distributional flow-matching objective to learn return vector fields satisfying the distributional Bellman equation automatically. Additionally, we including a regularization term for stability in practice.
⚖️Confidence weights: Our confidence weights prioritize learning a more accurate return distribution at transitions with higher return variance, as estimated by the return vector field.
🎯 Policy extraction:
@misc{dong2025value,
title={Value Flows},
author={Perry Dong and Chongyi Zheng and Chelsea Finn and Dorsa Sadigh and Benjamin Eysenbach},
year={2025},
eprint={2510.07650},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.07650},
}