New llama.cpp Server Feature: Dynamic Model Management
These articles are AI-generated summaries. Please check the original sources for full details.
New in llama.cpp: Model Management
The llama.cpp server now supports “router mode,” allowing dynamic loading, unloading, and switching between multiple models without requiring server restarts. This addresses a frequent user request for Ollama-style model management within the llama.cpp ecosystem.
This feature improves resource utilization and uptime, as traditional model switching necessitates a full server restart, disrupting active requests and potentially causing downtime. In production environments, even brief interruptions can translate to significant financial losses or degraded user experience.
Key Insights
- Ollama-style management: Inspired by the popular Ollama framework, offering a familiar workflow.
- Multi-process architecture: Each model runs in its own process, isolating failures and improving stability.
- LRU eviction: Least Recently Used models are automatically unloaded when the maximum number of loaded models (
--models-max) is reached, freeing up VRAM.
Working Example
# Start the server in router mode (no model specified)
llama-server
# List available models
curl http://localhost:8080/models
# Manually load a model
curl -X POST http://localhost:8080/models/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model.gguf"}'
Practical Applications
- A/B Testing: Run different model versions concurrently to compare performance on real-world data.
- Multi-tenant deployments: Serve multiple users or applications with different model requirements on a single server.
References:
Continue reading
Next article
OpenAI Introduces GPT-5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work
Related Content
Nemotron 3 Nano - A new Standard for Efficient, Open, and Intelligent Agentic Models
NVIDIA’s Nemotron 3 Nano 30B A3B model achieves up to 3.3x higher throughput than leading models while maintaining best-in-class reasoning accuracy.
State.js: Implementing CSS-Driven Reactivity Without JavaScript Logic
State.js introduces a new mental model that transforms HTML attributes into live CSS variables to enable reactive UIs without a build step.
Generalist AI Introduces GEN-θ: A New Era of Embodied Foundation Models for Robotics
Generalist AI's GEN-θ is a groundbreaking embodied foundation model trained on real-world physical interaction data, enabling scalable robotics through Harmonic Reasoning and large-scale multimodal pre-training.