How Salesforce Migrated from Cluster Autoscaler to Karpenter Across Their Fleet of 1,000 EKS Clusters
These articles are AI-generated summaries. Please check the original sources for full details.
How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters
Salesforce operates one of the world’s most complex Kubernetes platforms, managing over 1,000 Amazon EKS clusters. Facing challenges with scalability and efficiency of their previous auto scaling approach, Salesforce migrated to Karpenter, an open-source Kubernetes auto scaler built by AWS. This migration reduced scaling latency from minutes to seconds and improved node utilization.
Why This Matters
Traditional Kubernetes cluster scaling often relies on manual configuration of node groups and auto scaling, which becomes unsustainable at scale. Inefficient bin-packing and slow response to demand spikes can lead to wasted resources and degraded performance. Salesforce’s previous system suffered from these inefficiencies, creating operational bottlenecks and hindering innovation, with the potential for significant cost overruns.
Key Insights
- 1,000+ EKS clusters: Salesforce manages over 1,000 Amazon EKS clusters.
- Karpenter transition tool: Salesforce developed an in-house tool for safe and consistent migration to Karpenter.
- 5% cost savings: Salesforce achieved 5% cost savings in FY2026 through improved bin-packing and reduced idle capacity.
Working Example
metadata:
name: m5.8xlarge-min-300-max-2500
data:
k8s_instance_type: m6i.8xlarge
k8s_root_volume_size: '100'
k8s_root_volume_iops: '3000'
k8s_root_volume_type: 'gp3'
k8s_root_volume_throughput: '125'
k8s_min_node_number: '300'
k8s_max_node_number: '2500'
multi_az_provisioned_workers: 'false'
asg_launch_type: 'launch_template'
gpu_enabled: 'false'
Practical Applications
- Use Case: Salesforce enabled developers to self-define node pool requirements, accelerating infrastructure provisioning.
- Pitfall: Overly restrictive Pod Disruption Budgets (PDBs) can block node replacements during migration; proper PDB configuration is essential.
References:
Continue reading
Next article
How This Agentic Memory Research Unifies Long Term and Short Term Memory for LLM Agents
Related Content
Implementing DNS Governance in OpenShift with Red Hat Advanced Cluster Management
Secure OpenShift environments by using RHACM policies to monitor CoreDNS health and prevent configuration drift across multiple clusters.
Solving AI Tenant Chargeback Disputes with Evidence Anchors
Reduce AI cost dispute cycles by implementing a six-field evidence-anchor bundle to ensure auditability over formula debates.
Google Managed Agents API: Transitioning AI Agents to Serverless Compute
Google's Managed Agents API reduces agent infrastructure setup from three weeks of plumbing to eleven lines of code.