How to check Kubernetes cluster health? - Site Reliability Engineer Blog

There are some essential tools to have a quick look at Kubernetes cluster health. Let’s review them here. As a result you would be able quickly tell if cluster has any obvious issues.

Install node problem detector

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver.

kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

Use node-problem detector in conjunction with drainer daemon. So, to quickly replace unhealthy nodes. Learn more about it at Monitor Node Health.

Kubernetes cluster info

To see if kubectl connect to master and master is running and on which port use kubectl cluster-info. To debug cluster state use kubectl cluster-info dump as a result it will print full cluster state including pod logs to stdout, but you can setup output to a directory.

kubectl cluster-info

Kubernetes master is running at https://10.0.0.10:6443
KubeDNS is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

Nodes information

Get extended output of node information. Pay attention to STATUS, ROLES, AGE and IP columns. So, you see that ip addresses is the one which works in your network and able to communicate with each other. Also, nodes age is a kind of uptime for node, it could tell if nodes are stable enough – very useful if you use spot instances.

kubectl get nodes -o wide

NAME      STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master    Ready    master   11h   v1.18.2   10.0.0.10     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8
worker1   Ready    <none>   11h   v1.18.2   10.0.0.11     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8
worker2   Ready    <none>   11h   v1.18.2   10.0.0.12     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.8

API Component statuses

Status of the most important component of Kubernetes cluster apart from apiserver could be retrieved using get componentstatusescommand.

kubectl get componentstatuses
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"}

Pods statuses

Checking for not running pods with extended output could help you understand if there are any commonalities between failed pods like they are all at the same node or they all are belong to same availability zone.

kubectl get pods -o wide --all-namespaces |grep -v " Running "

Retrieve cluster events

Check events from all namespaces sorted by timestamp. As a result you will see how the state of the cluster have been changed for past two hours. Events are stored only for two hours to prevent apiserver from disk overload.

kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp

Api server health

You can check api server health using healthz endpoint which return HTTPS status 200 and message ‘ok’ when it’s healthy. So, you can keep an eye on the pulse of the cluster using simple tools like pingdom or nagios.

curl -k https://api-server-ip:6443/healthz
ok