Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
These articles are AI-generated summaries. Please check the original sources for full details.
Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
Google and MediaTek’s new LiteRT NeuroPilot Accelerator streamlines on-device AI, allowing generative models to run directly on phones, laptops, and IoT devices without constant data center reliance. It integrates the LiteRT runtime with MediaTek’s NeuroPilot NPU stack, offering a unified API for deploying LLMs and embedding models.
Why This Matters
Historically, on-device ML relied heavily on CPUs and GPUs, while NPUs required fragmented, vendor-specific tools and complex debugging. This fragmentation resulted in a combinatorial explosion of binaries and significant development overhead, increasing costs and time-to-market for deploying models on diverse hardware.
Key Insights
- AOT Compilation Recommendation: On-device compilation of models like Gemma-3-270M can take over a minute, making Ahead-of-Time (AOT) compilation the practical choice for production LLM deployments.
- Unified API: LiteRT NeuroPilot provides a single
Accelerator.NPUabstraction, simplifying code and reducing conditional logic for targeting different hardware backends (CPU, GPU, NPU). - Zero-Copy Buffers: LiteRT integrates with Android’s
AHardwareBufferand GPU buffers, enabling zero-copy tensor transfers for performance-critical tasks like real-time video processing.
Working Example
// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);
Practical Applications
- Vivo X300 Pro: Achieves over 1600 tokens per second in prefill and 28 tokens per second in decode with Gemma-3n E2B on the Dimensity 9500 NPU.
- Pitfall: Relying on on-device compilation for larger LLMs can introduce significant latency, negatively impacting user experience and making AOT compilation essential for production deployments.
References:
Continue reading
Next article
How to Streamline Zero Trust Using the Shared Signals Framework
Related Content
Zyphra ZAYA1-8B-Diffusion: Achieving 7.7x Speedup via Autoregressive to MoE Diffusion Conversion
Zyphra releases ZAYA1-8B-Diffusion-Preview, the first MoE diffusion model converted from an LLM, achieving up to 7.7x inference speedup on AMD hardware.
Securing the Agentic Web: Leveraging Gemini Omni and Antigravity 2.0 for Multi-Agent Systems
Google I/O 2026 introduces Gemini Omni and Managed Agents API to enable secure, sandboxed execution for autonomous multi-agent workflows.
From Prompting to State Engineering: The Shift Toward Agent Execution Layers
Google I/O 2026 marks a pivot from model capabilities to the emergence of an Agent Execution Layer for persistent AI infrastructure.