🔍 Attention Collapse: We identify the critical failure mode when scaling transformers for value function learning — attention entropy collapses as model capacity increases, causing the model to attend to only a handful of tokens and producing non-smooth value surfaces.
🎯 Entropy-Guided Training: TQL introduces per-layer learnable temperature parameters to control the entropy of attention scores toward a target value, preventing collapse and enabling stable training at scale.
📈 Effective Scaling: While prior methods suffer from up to 10.6% average performance degradation when scaling up, TQL achieves a 43% improvement from the smallest (0.4M) to the largest (26M) model, demonstrating consistent and effective scaling.