Overview
There are many possible root causes why control plane might become unavailable. Lets review most common scenarios and mitigation steps.
Mitigation steps in this article build around AWS public cloud features, but all popular public cloud offerings have similar functionality.
Apiserver VM shutdown or apiserver crashing
Results
- unable to stop, update, or start new pods, services, replication controller
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
Mitigations
In case of apiserver crash
- Apiserver is a POD, so it’s responsibility of kubelet to restart the pod.
- Kubelet itself is monitored by systemd which will restart kubelet in case of failure.
In case of VM shutdown
AWS Cloudwatch approach based on instance status check:
- Create Cloudwatch Alarm
- Choose EC2 Per-Instance metrics “StatusCheckFailed_Instance”
- Select threshold StatusCheckFailed_Instance >= 1 for 2 datapoints within 2 minutes
- Set EC2 action “Reboot this instance” when check is in “Alarm”
Apiserver backing storage lost
Results
- apiserver should fail to come up
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
Mitigations
- Use EBS volumes
- Setup etcd backup. See previous post on Backup of etcd
Network partition
Results
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
Mitigations
- Option 1. Re-provision control plane node in reachable availability zone(AZ). To restore etcd server data see previous post on Backup of etcd
- Option 2. Setup control plane node in the same AZ as worker node.
References
- https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#a-general-overview-of-cluster-failure-modes