Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Researchers at Nous Research have developed Lighthouse Attention, a training-only method that addresses the quadratic compute scaling of traditional transformer attention. The system achieves a 1.40× to 1.69× end-to-end wall-clock speedup compared to cuDNN-backed SDPA baselines while maintaining or improving final training loss.

Why This Matters

Standard Scaled Dot-Product Attention (SDPA) scales quadratically, making long-context training prohibitively expensive even with IO-aware tiling like FlashAttention. While inference-time sparse methods often degrade performance or require custom kernels, Lighthouse Attention utilizes a symmetric pyramid pooling design that allows researchers to leverage optimized dense-attention kernels during training, effectively bypassing the Θ(N²) bottleneck for sequences up to 1 million tokens.

Key Insights

Lighthouse Attention delivers 21× faster forward passes on NVIDIA B200 GPUs at 512K context by symmetric pooling (Nous Research, 2026).
Symmetric pooling of Q, K, and V transforms computational cost to O(S² d), where S is the size of the gathered dense sub-sequence.
The chunked-bitonic top-K kernel acts as a stratified selector to prevent selection collapse onto narrow spans during the selection stage.
Two-stage training recovery: Models resume dense SDPA after Lighthouse training, crossing below the dense baseline loss by step 16,000 (~50.3B tokens).
Context Parallelism integration: Lighthouse scales to 1M tokens across 32 Blackwell GPUs using standard ring attention because the gathered sub-sequence is dense.

Practical Applications

High-throughput Pretraining: Using Lighthouse for 1.69x end-to-end wall-clock speedup on long sequences; pitfall is using overly deep pyramids (L=5) which can underperform shallower L=3 configurations.
Large-scale Context Processing: Scaling to 1M tokens via context parallelism (CP degree 8); pitfall is ignoring the 10% throughput overhead from ring rotation during rotation-heavy phases.
Needle-in-a-Haystack Retrieval: Implementing larger k-values (e.g., 2048) for high-accuracy information retrieval; pitfall is using a projection-norm scorer for retrieval-heavy tasks, which can reduce mean retrieval rates.

References:

https://www.marktechpost.com/2026/05/16/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context/

On This Page

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression

Moonshot AI Releases FlashKDA: 2.22x Faster Prefill for Kimi Delta Attention

Google's Deep-Thinking Ratio: Boosting LLM Accuracy While Slashing Inference Costs by 50%