EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

Policy Rollouts

Final policies executing each task in the real world.

Flower Insert

String Light Routing

Egg Flip

Candy Scoop

Pool Shot

Cube Pick

Robustness

Policies trained with EXPO-FT recovering from perturbations, distractors, and visual variation. ← scroll →

Background distractions

Object variation

Object pose perturbation

Physical disturbance

Q Visualization

The learned Q-function over selected actions on a successful and on failed rollouts.

Q Visualization - Success

Q Visualization - Failure

Sampling Visualization

Candidate action chunks proposed by the VLA and the edit policy.

Sampling Visualization

Edit Action Visualization

How the edit actor refines the proposed VLA actions before execution.

Edit Action Visualization

Training

Time-lapse of policies improving over the course of online RL. ← scroll →

String Light Routing - Route I

String Light Routing - Route II

String Light Routing - Insert

Pool Shot

Egg Flip

Flower Insert

Candy Scoop

Cube Pick

Evaluation

Quantitative results across all tasks.

Light - Route I

Autonomous 2x

Success rate

Successful trials out of 30 per task.

Method

Learning procedure and policy architecture.

EXPO-FT finetunes a pretrained Vision-Language-Action (VLA) policy with online reinforcement learning to a highly reliable performance with only a small amount of real-world interaction.

Edit policy. An edit actor predicts residual corrections to each sampled action chunk — refining pretrained behavior while preserving large-scale pretraining priors.
Q-guided sampling. Multiple candidate action chunks are sampled from the VLA policy and refined by the edit actor; a learned Q-function then selects the candidate with the highest Q-value.
Human-in-the-loop. Operators can override actions in failure-prone states; these corrections enter the replay buffer to accelerate exploration.

Initial State Randomization

Visualization of the initial state randomization across tasks.

The orange regions indicate the randomized initialization areas used during training. The tasks in our evaluations feature large initial state spaces.

BibTeX

@misc{dong2026expoft,
      title={EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models}, 
      author={Perry Dong and Kuo-Han Hung and Tian Gao and Dorsa Sadigh and Chelsea Finn},
      year={2026},
      eprint={2605.25477},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.25477}, 
}
}