TQL: Scaling Q-Functions with Transformers
by Preventing Attention Collapse

Stanford University
* Equal contribution
TQL scaling results showing performance improvement as network size increases, while prior methods degrade.

TQL unlocks scaling of value functions in RL. Scaling results of TQL compared with prior approaches across critic model sizes from 0.4M to 26M parameters. While prior methods suffer from up to 10% average performance degradation when scaling up, TQL achieves a 43% improvement, demonstrating consistent and effective scaling.


Abstract

Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions — including with a transformer architecture, which is known to be highly scalable — often results in learning instability and worse performance. In this work, we ask: what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.


Key Ideas

🔍 Attention Collapse: We identify the critical failure mode when scaling transformers for value function learning — attention entropy collapses as model capacity increases, causing the model to attend to only a handful of tokens and producing non-smooth value surfaces.

🎯 Entropy-Guided Training: TQL introduces per-layer learnable temperature parameters to control the entropy of attention scores toward a target value, preventing collapse and enabling stable training at scale.

📈 Effective Scaling: While prior methods suffer from up to 10.6% average performance degradation when scaling up, TQL achieves a 43% improvement from the smallest (0.4M) to the largest (26M) model, demonstrating consistent and effective scaling.


Why Do Transformers Fail to Scale for Value Functions?

Contrary to typical scaling trends in language and vision, we observe a strong negative scaling pattern for transformer-based value functions: performance degrades with increased model size. To diagnose this, we analyze Q-value landscapes and attention distributions across network scales. Larger networks produce non-smooth value surfaces with high-frequency oscillations, and attention entropy decreases substantially, indicating increasingly peaked and brittle attention patterns.

Analysis showing entropy collapse and degraded Q-value landscapes in larger transformers.

Scaling transformers for value functions results in entropy collapse and worse performance. Left: Success rate and attention entropy across model sizes. Right: Q-value landscapes and attention maps for the smallest (0.4M) and largest (26M) models. The larger transformer learns highly non-smooth value surfaces with high-frequency oscillations absent in smaller models.


Scaling Results

We compare TQL against representative methods across critic sizes from 0.4M to 26M parameters. Across all generative model backbones (MLP, flow-matching, transformer), prior methods scale poorly with additional capacity. In contrast, TQL mitigates the attention collapse failure mode and achieves stable, consistent scaling with up to 43% performance improvement from smallest to largest model.

Scaling results across different model sizes.

Scaling results. Average success rate difference compared to the smallest model (0.4M) for each method. While baselines suffer from performance degradation at larger scales, TQL consistently scales well across all environments.


Benchmark Results

We evaluate TQL on the OGBench benchmark across 25 challenging continuous control tasks spanning five domains. TQL achieves the highest average performance on 4 out of 5 domains and the best average across all 25 tasks, demonstrating consistent improvements over a comprehensive set of offline RL baselines.

OGBench benchmark results.

OGBench evaluation results. TQL achieves the highest average performance on 4 out of 5 domains, as well as the best average performance across all 25 tasks. Bold values indicate performance within 95% of the best result per task.


Ablation Study

We analyze the key components of TQL: (1) entropy guidance prevents attention collapse, (2) automatic tuning toward a target outperforms fixed entropy penalties, and (3) layer-wise and token-wise temperatures allow each layer and the [VALUE] token to independently maintain appropriate entropy levels for stable training.

Ablation study results.

Component ablation. Each component contributes to TQL's overall performance.

Attention maps of ablations.

Attention maps. TQL achieves the most balanced attention across all tokens.


BibTeX

@misc{dong2026tqlscalingqfunctionstransformers, title={TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse}, author={Perry Dong and Kuo-Han Hung and Alexander Swerdlow and Dorsa Sadigh and Chelsea Finn}, year={2026}, eprint={2602.01439}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.01439}, }