Disaster recovery of single node Kubernetes control plane

Overview

There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing

Results

unable to stop, update, or start new pods, services, replication controller
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

In case of apiserver crash

Apiserver is a POD, so it’s responsibility of kubelet to restart the pod.
Kubelet itself is monitored by systemd which will restart kubelet in case of failure.

In case of VM shutdown

AWS Cloudwatch approach based on instance status check:

Create Cloudwatch Alarm
Choose EC2 Per-Instance metrics “StatusCheckFailed_Instance”
Select threshold StatusCheckFailed_Instance >= 1 for 2 datapoints within 2 minutes
Set EC2 action “Reboot this instance” when check is in “Alarm”

Apiserver backing storage lost

Results

apiserver should fail to come up
kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
manual recovery or recreation of apiserver state necessary before apiserver is restarted

Mitigations

Use EBS volumes
Setup etcd backup. See previous post on Backup of etcd

Network partition

Results

partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

Option 1. Re-provision control plane node in reachable availability zone(AZ). To restore etcd server data see previous post on Backup of etcd
Option 2. Setup control plane node in the same AZ as worker node.

References

https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#a-general-overview-of-cluster-failure-modes