EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision‑Language‑Action Models

Perry Dong* Kuo-Han Hung* Tian Gao Dorsa Sadigh Chelsea Finn
Stanford University
*Equal contribution

Policy Rollouts

Final policies executing each task in the real world.

Flower Insert
String Light Routing
Egg Flip
Candy Scoop
Pool Shot
Cube Pick

Robustness

Policies trained with EXPO-FT recovering from perturbations, distractors, and visual variation. ← scroll →

Background distractions
Object variation
Object pose perturbation
Physical disturbance

Q Visualization

The learned Q-function over selected actions on a successful and on failed rollouts.

Q Visualization - Success
Q Visualization - Failure
Q Visualization - Failure

Sampling Visualization

Candidate action chunks proposed by the VLA and the edit policy.

Sampling Visualization
Sampling Visualization

Edit Action Visualization

How the edit actor refines the proposed VLA actions before execution.

Edit Action Visualization
Edit Action Visualization

Training

Time-lapse of policies improving over the course of online RL. ← scroll →

String Light Routing - Route I
String Light Routing - Route II
String Light Routing - Insert
Pool Shot
Egg Flip
Flower Insert
Candy Scoop
Cube Pick

Evaluation

Quantitative results across all tasks.

Light - Route I

Autonomous 2x

Success rate

Successful trials out of 30 per task.

Method

Learning procedure and policy architecture.

EXPO-FT method overview: learning procedure and policy architecture

EXPO-FT finetunes a pretrained Vision-Language-Action (VLA) policy with online reinforcement learning to a highly reliable performance with only a small amount of real-world interaction.

  • Edit policy. An edit actor predicts residual corrections to each sampled action chunk — refining pretrained behavior while preserving large-scale pretraining priors.
  • Q-guided sampling. Multiple candidate action chunks are sampled from the VLA policy and refined by the edit actor; a learned Q-function then selects the candidate with the highest Q-value.
  • Human-in-the-loop. Operators can override actions in failure-prone states; these corrections enter the replay buffer to accelerate exploration.

Initial State Randomization

Visualization of the initial state randomization across tasks.

Initial State Randomization

The orange regions indicate the randomized initialization areas used during training. The tasks in our evaluations feature large initial state spaces.

BibTeX

@misc{dong2026expoft,
      title={EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models}, 
      author={Perry Dong and Kuo-Han Hung and Tian Gao and Dorsa Sadigh and Chelsea Finn},
      year={2026},
      eprint={2605.25477},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.25477}, 
}
}