Disaster recovery of single node Kubernetes control plane

Overview

There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.

Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.

Apiserver VM shutdown or apiserver crashing

Results

  • unable to stop, update, or start new pods, services, replication controller
  • existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

In case of apiserver crash

  • Apiserver is a POD, so it’s responsibility of kubelet to restart the pod.
  • Kubelet itself is monitored by systemd which will restart kubelet in case of failure.

In case of VM shutdown

AWS Cloudwatch approach based on instance status check:

  • Create Cloudwatch Alarm
  • Choose EC2 Per-Instance metrics “StatusCheckFailed_Instance”
  • Select threshold StatusCheckFailed_Instance >= 1 for 2 datapoints within 2 minutes
  • Set EC2 action “Reboot this instance” when check is in “Alarm”

Apiserver backing storage lost

Results

  • apiserver should fail to come up
  • kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
  • manual recovery or recreation of apiserver state necessary before apiserver is restarted

Mitigations

Network partition

Results

  • partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
  • existing pods and services should continue to work normally, unless they depend on the Kubernetes API

Mitigations

  • Option 1. Re-provision control plane node in reachable availability zone(AZ). To restore etcd server data see previous post on Backup of etcd
  • Option 2. Setup control plane node in the same AZ as worker node.

References

  • https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#a-general-overview-of-cluster-failure-modes