Moonshot AI Introduces Attention Residuals to Optimize Transformer Scaling
These articles are AI-generated summaries. Please check the original sources for full details.
Moonshot AI Releases π¨ππππππππ πΉππππ ππππ to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Moonshot AI has developed Attention Residuals (AttnRes) to replace the standard fixed residual accumulation found in modern Transformers. The new architecture achieves validation losses comparable to standard models trained with 25% more compute.
Why This Matters
Standard Transformer architectures suffer from PreNorm dilution, where fixed unit weights in residual connections cause hidden-state magnitudes to grow with depth, weakening individual layer contributions. While ideal models assume all layers contribute equally, the technical reality is that irreversible information loss and lack of selective access create a bottleneck that limits scaling efficiency and forces deeper layers to produce larger outputs to remain influential.
Key Insights
- Moonshot AIβs scaling laws (2026) show Block AttnRes achieves lower validation loss across all compute ranges compared to PreNorm baselines.
- The concept of selective access allows layers to aggregate specific earlier representations using softmax attention rather than a single compressed residual stream.
- Block AttnRes, used in Moonshotβs Kimi Linear model (48B parameters), reduces depth-wise memory overhead from O(Ld) to O(Nd) by partitioning layers into blocks.
- Performance on the MMLU benchmark improved from 73.5 to 74.6 when integrating AttnRes into MoE architectures with 3B activated parameters.
- Initializing pseudo-query vectors to zero allows AttnRes to behave like equal-weight averaging at the start of training, preventing early instability.
Practical Applications
- Large-scale MoE training (Kimi Linear + 1.4T tokens): Using Block AttnRes maintains training stability by keeping output magnitudes bounded, but failing to use block-level representations can lead to significant O(Ld) memory overhead in pipeline parallelism.
- High-reasoning tasks (Math/HumanEval evaluation): AttnRes improved Math scores from 53.5 to 57.1, though neglecting RMSNorm on layer outputs before attention can allow large-magnitude layers to dominate depth-wise weights.
References:
Continue reading
Next article
AI News Weekly Summary: Mar 07 - Mar 15, 2026
Related Content
Moonshot AI Introduces Kimi K2 Thinking: A Breakthrough in Long-Horizon Reasoning and Tool Use
Moonshot AI releases Kimi K2 Thinking, an open-source thinking model capable of executing 200β300 sequential tool calls without human intervention, optimized for long-horizon reasoning and agentic tasks.
Building Autonomous ML Research Loops with Karpathyβs AutoResearch Framework
Implement an automated ML research pipeline in Google Colab using Andrej Karpathyβs AutoResearch framework to iteratively optimize hyperparameters and track validation bits-per-byte metrics.
Safely Deploying ML Models to Production: Four Controlled Strategies
Master ML deployment using A/B, Canary, Interleaved, and Shadow testing to mitigate risks and evaluate real-world performance safely.