Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling
Right-size infrastructure & use autoscaling
Kubernetes Nodes autoscaler
In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:
- it adds new nodes according to demand – up scale
- it consolidates under utilized nodes – down scale
It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.
Key Configuration Flags
-cloud-provider=aws: Specifies the cloud provider (e.g., aws, gce, azure, vsphere, etc.).
-nodes=1:5:: Defines the minimum and maximum number of nodes allowed for autoscaling.
-balance-similar-node-groups: Ensures that workloads are evenly distributed across similar node groups.
-skip-nodes-with-system-pods=false: Allows system pods to be moved when scaling down nodes.
-scale-down-enabled=true: Enables the ability to remove underutilized nodes.
-scale-down-delay-after-add=10m: Prevents scale-down from happening immediately after a node is added; instead, it waits for 10 minutes.
-scale-down-unneeded-time=10m: Defines how long a node must be underutilized before being removed.
-scale-down-utilization-threshold=0.5: If a node’s utilization drops below 50%, it becomes a candidate for removal.
Horizontal Pods Autoscaling (HPA)
HPA is working in conjunction with Nodes autoscaler. HPA scale up pods vertically by increase the number of replicas for Deployment and Nodes autoscaler make sure there is enough Nodes(resources) available to schedule new Pods.
Instead of setting Deployment replicas static it’s recommended to use HPA for variable workload services such as web services which depends on the traffic patterns.
To start using HPA you have to make sure Metrics Server is installed.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Verify metrics server
kubectl get apiservices | grep metrics kubectl top nodes kubectl top pods
Define HPA that scales pods based on CPU for nginx deployment(assuming you install one)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- The deployment starts with 1 pod.
- If CPU utilization exceeds 50%, HPA scales up pods.
- If CPU utilization drops below 50%, HPA scales down pods.
- The number of pods is between 1 (min) and 10 (max).
Check the HPA status:
kubectl get hpa
Force CPU load test (optional):
kubectl run --rm -it load-generator --image=busybox -- /bin/sh -c "while true; do wget -q -O- http://nginx-app; done"
Check if pods scale up:
kubectl get pods -w
HPA with Memory-Based Autoscaling
Modify the metrics
section in HPA YAML:
metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 75 # Scale if memory exceeds 75%
HPA with Custom Metrics (Prometheus Adapter)
HPA can scale based on custom application metrics using Prometheus Adapter.
metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: 100m
Vertical Pods Autoscaling (VPA)
Some workloads might consume more CPU or Memory in peak hours and be idle in low hours. In general overtime cpu/memory patterns might change and it’s where VPA can help to adjust workloads to the optimal resources based on historical data.
There are several modes in which VPA can work:
- Auto – Currently, Recreate, but starting with Kubernetes v1.27 [alpha] can restart in-place without restarting Pod
- Recreate – recreate pods with new resources
- Initial – assign resource request without pods restart.
- Off – provides only recommendation for resources
The project doesn’t come with Kubernetes, but can be found on GitHub.
Deploy VPA using Helm
helm repo add fairwinds-stable https://charts.fairwinds.com/stable helm repo update helm install vpa fairwinds-stable/vpa --namespace kube-system
Alternatively, install via kubectl:
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
Verify that VPA components are running:
kubectl get pods -n kube-system | grep vpa
VPA Definition for nginx deployment(assuming it installed)
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: nginx-vpa spec: targetRef: apiVersion: "apps/v1" kind: Deployment name: nginx-app updatePolicy: updateMode: "Auto" # Can be "Auto", "Off", or "Initial" resourcePolicy: containerPolicies: - containerName: nginx minAllowed: cpu: "100m" memory: "128Mi" maxAllowed: cpu: "2" memory: "4Gi" controlledResources: ["cpu", "memory"]
- VPA monitors real CPU and memory usage and adjusts requests dynamically.
updateMode: Auto
allows VPA to automatically restart pods with new resource requests.- Min & Max limits (
minAllowed
&maxAllowed
) prevent over-provisioning. - VPA does not modify limits, only requests.
- Pods are restarted when VPA applies changes.
Over commitment strategy
The percentage of Idle CPU and unused Memory in Kubernetes cluster is one of the biggest issues.
Assuming you are not using VPA in Auto or Recreate mode this strategy will be very helpful to utilize cluster resources.
The problem: Over-Provisioning of Resource Requests
Applications typically go through different resource consumption stages:
- Bootstrap Stage – High CPU and memory usage for initialization (e.g., loading libraries, caching).
- Normal Run Stage – Uses only 2% of the resources needed during bootstrap.
- Peak Hours Stage – Spikes to 10-50% of bootstrap resource usage.
The workload type might be burstable and follow the described pattern, so there is no ideal resource parameters.
For example, if an application requests 1 CPU for bootstrap, but later runs on 20-100 millicores, 90-98% of the allocated CPU remains unused most of the time.
This isn’t just an over-provisioning issue—it’s a natural application behavior. The bootstrap phase requires more resources for efficiency like loading libraries, loading caches, etc, but steady-state operation is far less compute intensive.
The solution: Over commitment for Resource Requests
The key is over committing resource requests while keeping limits high.
Since Kubernetes scales based on requested resources, setting lower requests but higher limits allows 400-800% over commitment without impacting performance. This enables better resource sharing across the cluster and reduces costs.
A common approach is to set CPU limits 4x to 8x higher than requests, ensuring efficient utilization while preventing unnecessary over-provisioning.
Example of over commitment:
apiVersion: v1 kind: Pod metadata: name: resource-optimized-pod namespace: default spec: containers: - name: app-container image: app-image resources: requests: cpu: "250m" # Requests 250 millicores (0.25 CPU) memory: "256Mi" # Requests 256MB of memory limits: cpu: "2000m" # Limits CPU to 2 cores (8x of 250m) memory: "2Gi" # Limits Memory to 2GB (8x of 256Mi)
When to use this strategy?
- Suitable for batch jobs, web servers, or API services that have variable traffic.
- Not ideal for stateful applications like databases, as they require steady resources.
- Works best when combined with Horizontal Pod Autoscaler (HPA) to scale based on demand.
Efficient namespace & quotas
- Enforce ResourceQuotas and LimitRanges to prevent over provisioning of CPU, memory and other resources.
- Implement PriorityClasses to ensure critical workloads get resources first.
Example of ResourceQuotas
apiVersion: v1 kind: ResourceQuota metadata: name: namespace-quota namespace: dev-team spec: hard: pods: "10" # Max 10 pods allowed in the namespace requests.cpu: "2" # Total CPU request across all pods must be ≤ 2 cores requests.memory: "4Gi" # Total memory request across all pods must be ≤ 4GB limits.cpu: "8" # Max CPU usage across all pods is 8 cores limits.memory: "16Gi" # Max memory usage across all pods is 16GB persistentvolumeclaims: "5" # Max 5 PVCs can be created
- The namespace
dev-team
cannot exceed 10 pods. - The total requested CPU cannot exceed 2 cores, and memory cannot exceed 4GB.
- Pods can burst up to 8 cores and 16GB memory in total.
- Limits how many PVCs (persistent volume claims) can be created.
Example of LimitRange
apiVersion: v1 kind: LimitRange metadata: name: container-limits namespace: dev-team spec: limits: - type: Container default: cpu: "500m" # Default request: 0.5 CPU memory: "512Mi" # Default request: 512MB memory defaultRequest: cpu: "250m" # If no request is specified, it defaults to 250m (0.25 CPU) memory: "256Mi" # If no request is specified, it defaults to 256MB memory max: cpu: "2" # Max CPU per container: 2 cores memory: "4Gi" # Max memory per container: 4GB min: cpu: "100m" # Min CPU request: 100m (0.1 CPU) memory: "128Mi" # Min memory request: 128MB
- Any container that does not specify a request will be given 250m CPU & 256Mi memory by default.
- Any container that does not specify a limit will be given 500m CPU & 512Mi memory by default.
- Containers cannot exceed 2 CPU cores or 4GB memory.
- Containers must request at least 100m CPU and 128Mi memory.
Example of PriorityClass
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 100000 globalDefault: false description: "This priority is for critical workloads like API services." --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 1000 globalDefault: false description: "This priority is for non-essential background tasks."
value
field determines priority (higher is better).globalDefault:
false
ensures this priority is not applied to all pods by default.- The high-priority class (
100000
) is for critical services, while the low-priority class (1000
) is for batch jobs.
Now, assign priority classes to pods.
apiVersion: v1 kind: Pod metadata: name: critical-app namespace: production spec: priorityClassName: high-priority # Assign high priority containers: - name: app image: nginx
- If the cluster runs out of resources, Kubernetes evicts lower-priority pods first.
- If a high-priority pod is scheduled but there’s no room, Kubernetes preempts lower-priority pods to make space.
- Useful for ensuring critical apps always run, even in resource-constrained environments.
Choose the right instance type
Selecting the ideal instance type depends on your workload:
- Compute, Memory, or GPU Intensive? Optimize based on resource needs.
- Spot or Preemptible Instances? Ideal for non-critical workloads at lower cost.
- Reserved Instances? Cost-effective for long-term, stable workloads.
A balanced cluster may include multiple instance types:
- 30% Reserved Nodes – Prepaid nodes with guaranteed CPU and memory for critical workloads, ensuring stability for the next year.
- 70% Spot/Preemptible Nodes – 60-90% cheaper than on-demand, perfect for workloads using HPA with at least 5 replicas distributed across multiple nodes.
Your optimal configuration depends on workload demands, balancing cost savings with reliability.
Clusters consolidation for Cost Efficiency
Running separate clusters for Testing, Staging, and Production can lead to unnecessary overhead costs. Consolidating them into a single cluster can reduce infrastructure expenses, improve resource utilization, and simplify management.
Key Strategies for Consolidation:
- Multi-Tenancy with Namespaces – Instead of separate clusters, use namespaces to isolate environments while sharing resources.
- Resource Quotas & LimitRanges – Prevent resource overuse and ensure fair allocation within a shared cluster.
- Node Pools for Environment Isolation – Use different node groups for Testing, Staging, and Production to maintain stability while optimizing costs.
Scaling Down Non-Critical Environments:
- Scale down Testing & Staging clusters to zero during weekends or non-working hours to cut costs.
- Use Cluster Autoscaler and Scheduled Scaling to dynamically adjust resources when needed.
By consolidating clusters and optimizing scaling, you can significantly reduce infrastructure costs while maintaining performance and reliability.