Backup and restore of Etcd cluster

Kubernetes disaster recovery plan is usually consist of backing up etcd cluster and having infrastructure as a code to provision new set of servers in the cloud. Let’s see how to do first – backup etcd in two basic and easy ways.

Etcd backup

The only stateful component of Kubernetes cluster is etcd server. The etcd server is where Kuberenetes store all API objects and configuration.
Backing up this storage is sufficient for complete recovery of Kubernetes cluster state.

Backup with etcdctl

etcdctl is command line tool to manage etcd server and it’s date.
command to make a backup is:

Making a backup

ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

command to restore snapshot is:

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Note: For https endpoints you might need to specify paths to certificate keys in order to access etcd server api with etcdctl.

Store backup at remote storage

It’s important to backup data on remote storage like s3. It’s guarantee that a copy of etcd data will be available even if control plane volume is unaccessible or corrupted.

  • Make an s3 bucket.
  • Copy snapshot.db to s3 with new filename
  • Setup s3 object expiration to clean up old backup files
# new s3 bucket for etcd backups
aws s3 mb etcd-backup
# define a backup filename based on current date and time
filename=`date +%F-%H-%M`.db
aws s3 cp ./snapshot.db s3://etcd-backup/etcd-data/$filename
# set backup life cycle configuration for backup files rotation
aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --life
cycle-configuration  file://lifecycle.json

Example of lifecycle.json which transition backups to s3 Glacier:

              "Rules": [
                      "ID": "Move rotated backups to Glacier",
                      "Prefix": "etcd-data/",
                      "Status": "Enabled",
                      "Transitions": [
                              "Date": "2015-11-10T00:00:00.000Z",
                              "StorageClass": "GLACIER"
                      "Status": "Enabled",
                      "Prefix": "",
                      "NoncurrentVersionTransitions": [
                              "NoncurrentDays": 2,
                              "StorageClass": "GLACIER"
                      "ID": "Move old versions to Glacier"

Simplify etcd backup with Velero

Velero is powerful Kubernetes backup tool. It simplify many operation tasks.
As a result using Velero it’s easier to:

  • Choose what to backup(objects, volumes or everything)
  • Choose what NOT to backup(e.g. secrets)
  • Schedule cluster backups
  • Store backups on remote storage
  • Fast disaster recovery process

Install and configure Velero

1)Download latest version at Velero github page

2)Create AWS credential file:

aws_access_key_id=<your AWS access key ID>
aws_secret_access_key=<your AWS secret access key>

3)Create s3 bucket for etcd-backups

aws s3 mb s3://kubernetes-velero-backup-bucket

4)Install velero to kubernetes cluster:

velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.0.0 --bucket kubernetes-velero-backup-bucket --secret-file ./aws-iam-creds --backup-location-config region=us-east-1 --snapshot-location-config region=us-east-1

Note: we use s3 plugin to access remote storage. Velero support many different storage providers. See which works for you best.

Schedule automated backups

1)Schedule daily backups:

velero schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

2)Create a backup manually:

velero backup create <BACKUP NAME>

Disaster Recovery with Velero

Note: You might need to re-install Velero in case of full etcd data loss.

When Velero is up disaster recovery process are simple and straightforward:

1)Update your backup storage location to read-only mode

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'

By default, <STORAGE LOCATION NAME> is expected to be named default, however the name can be changed by specifying --default-backup-storage-location on velero server.

2)Create a restore with your most recent Velero Backup:

velero restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>

3)When ready, revert your backup storage location to read-write mode:

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
   --namespace velero \
   --type merge \
   --patch '{"spec":{"accessMode":"ReadWrite"}}'


  • Kubernetes cluster with infrequent change to API server is great choice for single control plane setup.
  • Frequent backups of etcd cluster will minimize time window of potential data loss.

Having fun with Kubernetes deployment

Install deployment

Install nginx 1.12.2 with 2 pods

If you need to have 2 pods from the start then it could be done in three easy steps:

  • Create deployment template with nginx version 1.12.2
  • Edit nginx.yaml to update replicas count.
  • Apply deployment template to Kubernetes cluster
# step 1
kubectl create  deployment nginx --save-config=true --image=nginx:1.12.2 --dry-run=client -o yaml > nginx.yaml
# step 2
edit nginx.yaml
# step 3
kubectl apply --record=true -f nginx.yaml

Notice use of --record=true to save the state of what caused the deployment change

Auto-scaling deployment

Deployments can be scaled manually or automatically. Let’s see how it could be done in few simple commands.

Scaling manually up to 4 pods

kubectl scale deployment nginx --replicas=4 --record=true

Scaling manually down to 2 pods

kubectl scale deployment nginx --replicas=2 --record=true

Automatically scale up and down

Automatically scale up to 4 pods and down to 2 pods based on cpu usage

kubectl autoscale deployment nginx --min=2 --max 4

You can adjust when to scale up/down using --cpu-percent(e.g. --cpu-percent=80) flag

Continue reading Having fun with Kubernetes deployment

How to check Kubernetes cluster health?

There are some essential tools to have a quick look at Kubernetes cluster health. Let’s review them here. As a result you would be able quickly tell if cluster has any obvious issues.

Install node problem detector

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver.

kubectl apply -f

Use node-problem detector in conjunction with drainer daemon. So, to quickly replace unhealthy nodes. Learn more about it at Monitor Node Health.

Kubernetes cluster info

To see if kubectl connect to master and master is running and on which port use kubectl cluster-info. To debug cluster state use kubectl cluster-info dump as a result it will print full cluster state including pod logs to stdout, but you can setup output to a directory.

kubectl cluster-info

Kubernetes master is running at
KubeDNS is running at
Metrics-server is running at

Nodes information

Get extended output of node information. Pay attention to STATUS, ROLES, AGE and IP columns. So, you see that ip addresses is the one which works in your network and able to communicate with each other. Also, nodes age is a kind of uptime for node, it could tell if nodes are stable enough – very useful if you use spot instances.

kubectl get nodes -o wide

master    Ready    master   11h   v1.18.2     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8
worker1   Ready    <none>   11h   v1.18.2     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8
worker2   Ready    <none>   11h   v1.18.2     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8

API Component statuses

Status of the most important component of Kubernetes cluster apart from apiserver could be retrieved using get componentstatusescommand.

kubectl get componentstatuses
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"} 

Pods statuses

Checking for not running pods with extended output could help you understand if there are any commonalities between failed pods like they are all at the same node or they all are belong to same availability zone.

kubectl get pods -o wide --all-namespaces |grep -v " Running "

Retrieve cluster events

Check events from all namespaces sorted by timestamp. As a result you will see how the state of the cluster have been changed for past two hours. Events are stored only for two hours to prevent apiserver from disk overload.

kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp

Api server health

You can check api server health using healthz endpoint which return HTTPS status 200 and message ‘ok’ when it’s healthy. So, you can keep an eye on the pulse of the cluster using simple tools like pingdom or nagios.

curl -k https://api-server-ip:6443/healthz

Multi node Kubernetes cluster on Vagrant

This is fast and easy way to install Kubernetes on Vagrant with Metrics server addon.

git clone
cd vagrant/kubernetes
vagrant up

At this point you would have one master node and two worker nodes ready.

Lets check cluster health

vagrant ssh master

kubectl get nodes
master    Ready    master   2m46s   v1.18.2
worker1   Ready    <none>   35s     v1.18.2
worker2   Ready    <none>   32s     v1.18.2

All nodes are ready.

Lets install Metrics server addon

kubectl apply -f

Update metrics server startup flags to solve nodes name resolution issue

kubectl -n kube-system edit deployment metrics-server

#Add following settings to metrics-server start command
- --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
- --kubelet-insecure-tls

At this point Metrics server is installed.

After about few minutes of collecting data you should see:

kubectl top node
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master    264m         13%    1091Mi          57%       
worker1   109m         5%     746Mi           39%       
worker2   109m         5%     762Mi           40% 

kubectl top pod
NAME                                       CPU(cores)   MEMORY(bytes)   
calico-kube-controllers-75d56dfc47-bdsxr   1m           5Mi             
calico-node-rvqwp                          20m          23Mi            
calico-node-thtd4                          31m          25Mi            
calico-node-vkhgs                          23m          22Mi            
coredns-66bff467f8-x68zs                   4m           5Mi             
coredns-66bff467f8-z7kzh                   4m           10Mi            
etcd-master                                22m          39Mi            
kube-apiserver-master                      52m          352Mi           
kube-controller-manager-master             18m          55Mi            
kube-proxy-tdwpf                           1m           18Mi            
kube-proxy-wvsb9                           1m           8Mi             
kube-proxy-zfd2c                           1m           9Mi             
kube-scheduler-master                      5m           23Mi            
metrics-server-7c557b6b9f-h4hz2            1m           11Mi

Build Kubernetes control plane image with Packer

Steps to prepare single control plane image is quite simple:

  • Prepare Docker and Kubernetes packages and settings
  • Execute kubeadm bootstrap script when EC2 start up first time

One unanswered question is: How to add additional control plane nodes and worker nodes which required tokens and certificates to be preset when joining the cluster?

Continue reading Build Kubernetes control plane image with Packer

Practical guide to Kubernetes Certified Administration exam

I have published practical guide to Kubernetes Certified Administration exam

Covered topics so far are:

Share your efforts

If your are also working on preparation to Kubernetes Certified Administration exam lets combine our efforts by sharing the practical side of exam.

Disaster recovery of single node Kubernetes control plane


There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing


  • unable to stop, update, or start new pods, services, replication controller
  • existing pods and services should continue to work normally, unless they depend on the Kubernetes API
Continue reading Disaster recovery of single node Kubernetes control plane

Thoughts on High available Kubernetes cluster with single control plane node

Why single node control plane?

Benefits are:

  • Monitoring and alerting are simple and on point. It reduce the number of false positive alerts.
  • Setup and maintenance are quick and straightforward. Less complex install process lead to more robust setup.
  • Disaster recovery and recovery documentation are more clear and shorter.
  • Application will continue to work even if Kubernetes control plane is down.
  • Multiple worker nodes and multiple deployment replicas will provide necessary high availability for your applications.

Disadvantages are:

  • Downtime of control plane node make it impossible to change any Kubernetes object. For example to schedule new deployments, update application configuration or to add/remove worker nodes.
  • If worker node goes down during control plane downtime when it will not be able to re-join the cluster after recovery.


  • If you have a heavy load on Kubernetes API like frequent deployments from many teams then you might consider to use multi control plane setup.
  • If changes to Kubernetes objects are infrequent and your team can tolerate a bit of downtime when single control plane Kubernetes cluster can be great choice.

Go http middleware chain with context package

Middleware is a function which wrap http.Handler to do pre or post processing of the request.

Chain of middleware is popular pattern in handling http requests in go languge. Using a chain we can:

  • Log application requests
  • Rate limit requests
  • Set HTTP security headers
  • and more

Go context package help to setup communication between middleware handlers.

Continue reading Go http middleware chain with context package

How to enable minikube kvm2 driver on Ubuntu 18.04

Verify kvm2 support

Confirm virtualization support by CPU

 egrep -c ‘(svm|vmx)’ /proc/cpuinfo

An output of 1 or more indicate that CPU can use virtualization technology.

sudo kvm-ok

Output “KVM acceleration can be used. ” indicate that the system has virtualization enabled and KVM can be used.

Continue reading How to enable minikube kvm2 driver on Ubuntu 18.04