FASTER: Value-Guided Sampling for Fast RL

Stanford University
* Equal contribution
 Instead of denoising all N candidates and selecting the best action post-hoc (best-of-
N ), FASTER learns a denoise critic Qdn that scores action samples during denoising, often directly on the initial noise itself.

FASTER recovers the benefits of sampling-based test-time scaling without suffering its computational cost. We model the denoising of multiple candidate samples and the selection of the best one as a Markov Decision Process (MDP) where the goal is to progressively filter candidates before denoising is complete. With this MDP, we learn a denoise Q-function and policy with traditional temporal difference learning that decide which actions to keep and remove while maximizing the returns.


Abstract

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements.


Key Idea

Modeling Action Denoising as an MDP

Action Filtering MDP diagram
Action Filtering MDP. We model the process of denoising action candidates and selecting the best one as an MDP where the goal is to filter action samples during denoising while maximizing returns.

Diffusion policies generate actions by iteratively denoising random noise — but not all noise candidates are equally promising. Rather than waiting until denoising is complete to pick the best one, we want to identify and discard poor candidates early, saving computation and improving quality.

To do this, we frame the denoising process as a sequential decision problem: at each denoising step, a learned policy looks at all surviving candidates and decides which ones to keep. Unpromising candidates are dropped early; the best survivor is ultimately executed in the environment.

States. At each denoising step, the policy observes the current environment state, how far along denoising has progressed, and the current (noisy) form of each surviving action candidate.

Actions. The policy decides which candidates to keep and which to discard — at least one must always survive.

Transitions. Surviving candidates are denoised one step further. The process ends when only one candidate remains or denoising completes, at which point the best survivor is executed.

Reward. Non-terminal steps receive zero reward. At termination the reward is Q value of the surviving action in the environment.

Learning the Filtering Policy

We train a critic that estimates the quality of any filtering decision at any denoising step. The critic is trained with standard temporal-difference learning — it bootstraps from its own future predictions, gradually learning to predict which early filtering choices lead to high-quality final actions.

At test time, we start with a pool of random noise candidates and progressively filter them down using the learned critic, until a single candidate remains. That candidate is fully denoised and executed as the final action.

Simplifying to a Single Filtering Step: In practice, we find that instead of filtering candidates throughout the entire denoising process, iltering candidates at the noise level—which is the cheapest instantiation computationally—achieves performance equivalent to fully denoising all action samples and selecting the highest value action.


Results

We implement FASTER on top of EXPO and IDQL and evaluate on a set of 9 challenging manipulation tasks from Robomimic and LIBERO.

Online

Success rates of FASTER and baselines in the online settings.
Scaling results across different model sizes.

Top: Success rates of FASTER and baselines in the online settings. FASTER-EXPO outperforms strong baselines in sample efficiency. Bottom: Compute comparisons of FASTER-EXPO and EXPO. FASTER eliminates extra denoising during training and inference, yielding large FLOP reductions relative to the EXPO with comparable task performance.

Scaling results across different model sizes.

Success rate and compute comparisons of FASTER-IDQL and IDQL in the online setting. FASTER can be applied to IDQL to eliminate extra denoising rollouts at inference while obtaining the same performance in success rates.



Batch Online

Success rates of FASTER and baselines in the online settings.
Scaling results across different model sizes.

Top: Success rates of FASTER and baselines in the batch online settings. FASTER-EXPO matches the performance of EXPO in iterations. Bottom: Compute comparisonsof FASTER-EXPO and EXPO. Like in the online setting, in the batch-online setting FASTER-EXPO yields a large FLOP reduction compared to EXPO from not needing to denoise all action samples.



Applied to Vision-Language-Action Models

Success rates of FASTER and baselines in the online settings.
Scaling results across different model sizes.

FASTER-EXPO compared to EXPO on π0.5. Top: Success rates of FASTER is competitive compared to EXPO. Bottom: Compute comparisons of FASTER-EXPO and EXPO. FASTER-EXPO performs significantly better than EXPO under the same compute as FASTER-EXPO chooses the best action sample without denoising all sampled actions in inference and training.

Success rates of FASTER and baselines in the online settings.

Training and inference timing of FASTER-EXPO compared to EXPO. FASTER-EXPO achieves 1.7x improvement in inference time and 4.5x improvement in the update step time.


BibTeX

@misc{dong2026fastervalueguidedsamplingfast, title={FASTER: Value-Guided Sampling for Fast RL}, author={Perry Dong and Alexander Swerdlow and Dorsa Sadigh and Chelsea Finn}, year={2026}, eprint={2604.19730}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2604.19730}, } }