There are some essential tools to have a quick look at Kubernetes cluster health. Let’s review them here. As a result you would be able quickly tell if cluster has any obvious issues.
Install node problem detector
node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver.
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
Use node-problem detector in conjunction with drainer daemon. So, to quickly replace unhealthy nodes. Learn more about it at Monitor Node Health.
Kubernetes cluster info
To see if kubectl connect to master and master is running and on which port use kubectl cluster-info
. To debug cluster state use kubectl cluster-info dump
as a result it will print full cluster state including pod logs to stdout, but you can setup output to a directory.
kubectl cluster-info
Kubernetes master is running at https://10.0.0.10:6443
KubeDNS is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy
Nodes information
Get extended output of node information. Pay attention to STATUS, ROLES, AGE and IP columns. So, you see that ip addresses is the one which works in your network and able to communicate with each other. Also, nodes age is a kind of uptime for node, it could tell if nodes are stable enough – very useful if you use spot instances.
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready master 11h v1.18.2 10.0.0.10 <none> Ubuntu 18.04.4 LTS 4.15.0-99-generic docker://19.3.8
worker1 Ready <none> 11h v1.18.2 10.0.0.11 <none> Ubuntu 18.04.4 LTS 4.15.0-99-generic docker://19.3.8
worker2 Ready <none> 11h v1.18.2 10.0.0.12 <none> Ubuntu 18.04.4 LTS 4.15.0-99-generic docker://19.3.8
API Component statuses
Status of the most important component of Kubernetes cluster apart from apiserver could be retrieved using get componentstatuses
command.
kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health":"true"}
Pods statuses
Checking for not running pods with extended output could help you understand if there are any commonalities between failed pods like they are all at the same node or they all are belong to same availability zone.
kubectl get pods -o wide --all-namespaces |grep -v " Running "
Retrieve cluster events
Check events from all namespaces sorted by timestamp. As a result you will see how the state of the cluster have been changed for past two hours. Events are stored only for two hours to prevent apiserver from disk overload.
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp
Api server health
You can check api server health using healthz endpoint which return HTTPS status 200 and message ‘ok’ when it’s healthy. So, you can keep an eye on the pulse of the cluster using simple tools like pingdom or nagios.
curl -k https://api-server-ip:6443/healthz
ok