Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining
These articles are AI-generated summaries. Please check the original sources for full details.
Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context
Researchers at Nous Research have developed Lighthouse Attention, a training-only method that addresses the quadratic compute scaling of traditional transformer attention. The system achieves a 1.40× to 1.69× end-to-end wall-clock speedup compared to cuDNN-backed SDPA baselines while maintaining or improving final training loss.
Why This Matters
Standard Scaled Dot-Product Attention (SDPA) scales quadratically, making long-context training prohibitively expensive even with IO-aware tiling like FlashAttention. While inference-time sparse methods often degrade performance or require custom kernels, Lighthouse Attention utilizes a symmetric pyramid pooling design that allows researchers to leverage optimized dense-attention kernels during training, effectively bypassing the Θ(N²) bottleneck for sequences up to 1 million tokens.
Key Insights
- Lighthouse Attention delivers 21× faster forward passes on NVIDIA B200 GPUs at 512K context by symmetric pooling (Nous Research, 2026).
- Symmetric pooling of Q, K, and V transforms computational cost to O(S² d), where S is the size of the gathered dense sub-sequence.
- The chunked-bitonic top-K kernel acts as a stratified selector to prevent selection collapse onto narrow spans during the selection stage.
- Two-stage training recovery: Models resume dense SDPA after Lighthouse training, crossing below the dense baseline loss by step 16,000 (~50.3B tokens).
- Context Parallelism integration: Lighthouse scales to 1M tokens across 32 Blackwell GPUs using standard ring attention because the gathered sub-sequence is dense.
Practical Applications
- High-throughput Pretraining: Using Lighthouse for 1.69x end-to-end wall-clock speedup on long sequences; pitfall is using overly deep pyramids (L=5) which can underperform shallower L=3 configurations.
- Large-scale Context Processing: Scaling to 1M tokens via context parallelism (CP degree 8); pitfall is ignoring the 10% throughput overhead from ring rotation during rotation-heavy phases.
- Needle-in-a-Haystack Retrieval: Implementing larger k-values (e.g., 2048) for high-accuracy information retrieval; pitfall is using a projection-norm scorer for retrieval-heavy tasks, which can reduce mean retrieval rates.
References:
Continue reading
Next article
NVIDIA SANA-WM: 2.6B-Parameter World Model for 720p Minute-Scale Video on Single GPUs
Related Content
NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression
NVIDIA’s KVPress framework enables memory-efficient LLM inference by pruning KV cache pairs with compression ratios up to 0.7, significantly reducing GPU memory overhead for long-context tasks.
Moonshot AI Releases FlashKDA: 2.22x Faster Prefill for Kimi Delta Attention
Moonshot AI open-sources FlashKDA, a CUTLASS-based kernel delivering up to 2.22x prefill speedups for Kimi Delta Attention on NVIDIA H20 GPUs.
Google's Deep-Thinking Ratio: Boosting LLM Accuracy While Slashing Inference Costs by 50%
Google researchers introduce the Deep-Thinking Ratio (DTR), a metric that improves LLM accuracy while cutting inference costs by 49% on AIME 2025 benchmarks.