Cost saving strategy for Kubernetes platform - Site Reliability Engineer Blog

Working on cost saving strategy involves looking at the problem from very different dimensions.
Overall Kubernetes costs can be split on compute costs, networking costs, storage costs, licensing cost and SaaS costs.
In this part I will cover: Right-size infrastructure & use autoscaling

Right-size infrastructure & use autoscaling

Kubernetes Nodes autoscaler

In a cloud environment Kubernetes Nodes autoscaler plays important role in delivering just enough resources for your cluster. In a nutshell:

it adds new nodes according to demand – up scale
it consolidates under utilized nodes – down scale

It directly depends on Pod resource requests. Choosing the right requests is the key component for cost-effective strategy.
For self hosted solutions you might want to look into project like Karpenter.
Alternative scaling strategy is to add or remove nodes by schedule for a specific time plan. Using that approach you can simple specify the time of the day when you need to upscale and the time when you downscale.

Key Configuration Flags

-cloud-provider=aws: Specifies the cloud provider (e.g., aws, gce, azure, vsphere, etc.).
-nodes=1:5:: Defines the minimum and maximum number of nodes allowed for autoscaling.
-balance-similar-node-groups: Ensures that workloads are evenly distributed across similar node groups.
-skip-nodes-with-system-pods=false: Allows system pods to be moved when scaling down nodes.
-scale-down-enabled=true: Enables the ability to remove underutilized nodes.
-scale-down-delay-after-add=10m: Prevents scale-down from happening immediately after a node is added; instead, it waits for 10 minutes.
-scale-down-unneeded-time=10m: Defines how long a node must be underutilized before being removed.
-scale-down-utilization-threshold=0.5: If a node’s utilization drops below 50%, it becomes a candidate for removal.

Horizontal Pods Autoscaling (HPA)

HPA is working in conjunction with Nodes autoscaler. HPA scale up pods vertically by increase the number of replicas for Deployment and Nodes autoscaler make sure there is enough Nodes(resources) available to schedule new Pods.

Instead of setting Deployment replicas static it’s recommended to use HPA for variable workload services such as web services which depends on the traffic patterns.

To start using HPA you have to make sure Metrics Server is installed.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify metrics server

kubectl get apiservices | grep metrics
kubectl top nodes
kubectl top pods

Define HPA that scales pods based on CPU for nginx deployment(assuming you install one)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-app
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

The deployment starts with 1 pod.
If CPU utilization exceeds 50%, HPA scales up pods.
If CPU utilization drops below 50%, HPA scales down pods.
The number of pods is between 1 (min) and 10 (max).

Check the HPA status:

kubectl get hpa

Force CPU load test (optional):

kubectl run --rm -it load-generator --image=busybox -- /bin/sh -c "while true; do wget -q -O- http://nginx-app; done"

Check if pods scale up:

kubectl get pods -w

HPA with Memory-Based Autoscaling

Modify the metrics section in HPA YAML:

metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75  # Scale if memory exceeds 75%

HPA with Custom Metrics (Prometheus Adapter)

HPA can scale based on custom application metrics using Prometheus Adapter.

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 100m

Vertical Pods Autoscaling (VPA)

Some workloads might consume more CPU or Memory in peak hours and be idle in low hours. In general overtime cpu/memory patterns might change and it’s where VPA can help to adjust workloads to the optimal resources based on historical data.

There are several modes in which VPA can work:

Auto – Currently, Recreate, but starting with Kubernetes v1.27 [alpha] can restart in-place without restarting Pod
Recreate – recreate pods with new resources
Initial – assign resource request without pods restart.
Off – provides only recommendation for resources

The project doesn’t come with Kubernetes, but can be found on GitHub.

Deploy VPA using Helm

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm repo update
helm install vpa fairwinds-stable/vpa 
  --namespace kube-system

Alternatively, install via kubectl:

kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

Verify that VPA components are running:

kubectl get pods -n kube-system | grep vpa

VPA Definition for nginx deployment(assuming it installed)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: nginx-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: nginx-app
  updatePolicy:
    updateMode: "Auto"  # Can be "Auto", "Off", or "Initial"
  resourcePolicy:
    containerPolicies:
      - containerName: nginx
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "4Gi"
        controlledResources: ["cpu", "memory"]

VPA monitors real CPU and memory usage and adjusts requests dynamically.
updateMode: Auto allows VPA to automatically restart pods with new resource requests.
Min & Max limits (minAllowed & maxAllowed) prevent over-provisioning.
VPA does not modify limits, only requests.
Pods are restarted when VPA applies changes.

Over commitment strategy

The percentage of Idle CPU and unused Memory in Kubernetes cluster is one of the biggest issues.

Assuming you are not using VPA in Auto or Recreate mode this strategy will be very helpful to utilize cluster resources.

The problem: Over-Provisioning of Resource Requests

Applications typically go through different resource consumption stages:

Bootstrap Stage – High CPU and memory usage for initialization (e.g., loading libraries, caching).
Normal Run Stage – Uses only 2% of the resources needed during bootstrap.
Peak Hours Stage – Spikes to 10-50% of bootstrap resource usage.

The workload type might be burstable and follow the described pattern, so there is no ideal resource parameters.

For example, if an application requests 1 CPU for bootstrap, but later runs on 20-100 millicores, 90-98% of the allocated CPU remains unused most of the time.

This isn’t just an over-provisioning issue—it’s a natural application behavior. The bootstrap phase requires more resources for efficiency like loading libraries, loading caches, etc, but steady-state operation is far less compute intensive.

The solution: Over commitment for Resource Requests

The key is over committing resource requests while keeping limits high.

Since Kubernetes scales based on requested resources, setting lower requests but higher limits allows 400-800% over commitment without impacting performance. This enables better resource sharing across the cluster and reduces costs.

A common approach is to set CPU limits 4x to 8x higher than requests, ensuring efficient utilization while preventing unnecessary over-provisioning.

Example of over commitment:

apiVersion: v1
kind: Pod
metadata:
  name: resource-optimized-pod
  namespace: default
spec:
  containers:
    - name: app-container
      image: app-image
      resources:
        requests:
          cpu: "250m"    # Requests 250 millicores (0.25 CPU)
          memory: "256Mi" # Requests 256MB of memory
        limits:
          cpu: "2000m"   # Limits CPU to 2 cores (8x of 250m)
          memory: "2Gi"   # Limits Memory to 2GB (8x of 256Mi)

When to use this strategy?

Suitable for batch jobs, web servers, or API services that have variable traffic.
Not ideal for stateful applications like databases, as they require steady resources.
Works best when combined with Horizontal Pod Autoscaler (HPA) to scale based on demand.

Efficient namespace & quotas

Enforce ResourceQuotas and LimitRanges to prevent over provisioning of CPU, memory and other resources.
Implement PriorityClasses to ensure critical workloads get resources first.

Example of ResourceQuotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: dev-team
spec:
  hard:
    pods: "10"                # Max 10 pods allowed in the namespace
    requests.cpu: "2"         # Total CPU request across all pods must be ≤ 2 cores
    requests.memory: "4Gi"    # Total memory request across all pods must be ≤ 4GB
    limits.cpu: "8"           # Max CPU usage across all pods is 8 cores
    limits.memory: "16Gi"     # Max memory usage across all pods is 16GB
    persistentvolumeclaims: "5" # Max 5 PVCs can be created

The namespace dev-team cannot exceed 10 pods.
The total requested CPU cannot exceed 2 cores, and memory cannot exceed 4GB.
Pods can burst up to 8 cores and 16GB memory in total.
Limits how many PVCs (persistent volume claims) can be created.

Example of LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: container-limits
  namespace: dev-team
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"           # Default request: 0.5 CPU
        memory: "512Mi"       # Default request: 512MB memory
      defaultRequest:
        cpu: "250m"           # If no request is specified, it defaults to 250m (0.25 CPU)
        memory: "256Mi"       # If no request is specified, it defaults to 256MB memory
      max:
        cpu: "2"              # Max CPU per container: 2 cores
        memory: "4Gi"         # Max memory per container: 4GB
      min:
        cpu: "100m"           # Min CPU request: 100m (0.1 CPU)
        memory: "128Mi"       # Min memory request: 128MB

Any container that does not specify a request will be given 250m CPU & 256Mi memory by default.
Any container that does not specify a limit will be given 500m CPU & 512Mi memory by default.
Containers cannot exceed 2 CPU cores or 4GB memory.
Containers must request at least 100m CPU and 128Mi memory.

Example of PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "This priority is for critical workloads like API services."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "This priority is for non-essential background tasks."

value field determines priority (higher is better).
globalDefault: false ensures this priority is not applied to all pods by default.
The high-priority class (100000) is for critical services, while the low-priority class (1000) is for batch jobs.

Now, assign priority classes to pods.

apiVersion: v1
kind: Pod
metadata:
  name: critical-app
  namespace: production
spec:
  priorityClassName: high-priority  # Assign high priority
  containers:
    - name: app
      image: nginx

If the cluster runs out of resources, Kubernetes evicts lower-priority pods first.
If a high-priority pod is scheduled but there’s no room, Kubernetes preempts lower-priority pods to make space.
Useful for ensuring critical apps always run, even in resource-constrained environments.

Choose the right instance type

Selecting the ideal instance type depends on your workload:

Compute, Memory, or GPU Intensive? Optimize based on resource needs.
Spot or Preemptible Instances? Ideal for non-critical workloads at lower cost.
Reserved Instances? Cost-effective for long-term, stable workloads.

A balanced cluster may include multiple instance types:

30% Reserved Nodes – Prepaid nodes with guaranteed CPU and memory for critical workloads, ensuring stability for the next year.
70% Spot/Preemptible Nodes – 60-90% cheaper than on-demand, perfect for workloads using HPA with at least 5 replicas distributed across multiple nodes.

Your optimal configuration depends on workload demands, balancing cost savings with reliability.

Clusters consolidation for Cost Efficiency

Running separate clusters for Testing, Staging, and Production can lead to unnecessary overhead costs. Consolidating them into a single cluster can reduce infrastructure expenses, improve resource utilization, and simplify management.

Key Strategies for Consolidation:

Multi-Tenancy with Namespaces – Instead of separate clusters, use namespaces to isolate environments while sharing resources.
Resource Quotas & LimitRanges – Prevent resource overuse and ensure fair allocation within a shared cluster.
Node Pools for Environment Isolation – Use different node groups for Testing, Staging, and Production to maintain stability while optimizing costs.

Scaling Down Non-Critical Environments:

Scale down Testing & Staging clusters to zero during weekends or non-working hours to cut costs.
Use Cluster Autoscaler and Scheduled Scaling to dynamically adjust resources when needed.

By consolidating clusters and optimizing scaling, you can significantly reduce infrastructure costs while maintaining performance and reliability.