CNCFKubernetesContainer OrchestrationIntermediate24 min read

What Does Node Failure Troubleshooting Mean?

Also known as: node failure troubleshooting, kubernetes troubleshooting, CKA node failure, worker node not ready, kubelet troubleshooting

Reviewed byJohnson Ajibi· Senior Network & Security Engineer · MSc IT Security
On This Page

Quick Definition

In Kubernetes, a node is a virtual or physical machine that runs your applications. When a node fails, the applications on it stop working. Node failure troubleshooting is the methodical approach to find out why the node went down and how to fix it or safely remove it from the cluster so your applications can be rescheduled elsewhere.

Must Know for Exams

The CNCF Certified Kubernetes Administrator (CKA) exam places significant emphasis on troubleshooting, including node failure. The CKA curriculum explicitly includes objectives such as 'Troubleshoot cluster component failures' and 'Troubleshoot worker node failures.' In the exam, you will be presented with a cluster where a node is in a NotReady state, and you must diagnose and resolve the issue within a limited time.

The exam is performance-based, meaning you interact with live clusters via a terminal, not multiple-choice questions. For example, you may be given a scenario where kubelet on one node has stopped working, and you need to identify that the kubelet service is not running, then start it, and verify the node returns to Ready state. Alternatively, you might encounter a node with expired TLS certificates that prevent it from communicating with the control plane.

You would need to regenerate the certificates and restart the kubelet. The exam also tests your ability to use kubectl commands to cordon, drain, and delete nodes when recovery is not possible. Node failure troubleshooting is a high-weight area because it tests your practical understanding of cluster operations, system administration, and Kubernetes internals.

In the CKA, you cannot pass without demonstrating competence in this area. Similarly, the CKAD (Certified Kubernetes Application Developer) exam focuses less on infrastructure but still expects developers to understand node status and its impact on pod scheduling. For the CKS (Certified Kubernetes Security Specialist), node troubleshooting includes security aspects like verifying node integrity, checking for unauthorized pods, and ensuring node-level security policies are enforced.

Therefore, investing time in learning node failure troubleshooting directly translates to exam success.

Simple Meaning

Imagine you work in a large office building with many desks where different teams sit and do their work. Each desk represents a node in a Kubernetes cluster. The office manager, like the Kubernetes control plane, assigns each team to a desk.

Now suppose one desk suddenly breaks — the leg snaps and the desk tilts. The person sitting there cannot work anymore. The office manager needs to figure out why the desk broke. Was it too heavy?

Was it old and worn out? Did someone bump into it? Once the problem is understood, the manager can either fix the desk or move the person to a different desk so work continues. Node failure troubleshooting works the same way.

A Kubernetes cluster has many nodes (computers) that run containers (your applications). When a node fails, the control plane first notices that the node has stopped reporting its health. The troubleshooting process asks questions like: Is the node powered on?

Is the network cable unplugged? Is the node running out of memory or disk space? Has the kubelet service (the node's agent) crashed? Each of these questions is like checking the desk's legs, screws, or stability.

You start with the most obvious checks, like confirming the node is still connected to the network, then move to deeper checks like examining system logs or resource usage. The goal is always to restore the node to a healthy state or safely drain it so the control plane can move workloads to other nodes. This keeps the entire office running smoothly even when one desk breaks down.

Full Technical Definition

Node failure troubleshooting in Kubernetes involves diagnosing worker node unavailability using a combination of control plane signals, node-level diagnostics, and system administration commands. The core mechanism for detecting node failure is the Node Controller, a component of the kube-controller-manager. The Node Controller runs a synchronization loop that checks the health of each node based on the Node Status updates sent by the kubelet running on each node. The kubelet periodically reports heartbeats (node status updates) to the API server. If the kubelet stops reporting for a configurable duration (default is 40 seconds for the first failure, then 5 minutes for the node to be marked as NotReady), the Node Controller marks the node condition as NodeReady = False or Unknown. After a longer timeout (default 5 minutes for pod eviction), the Node Controller begins evicting pods from the failed node, and the scheduler reschedules them onto healthy nodes.

The troubleshooting process typically begins with the command "kubectl get nodes" to see the node's current status, often showing "NotReady" or "Unknown." The next step is to use "kubectl describe node <node-name>" to inspect detailed node conditions, including pressure on disk, memory, or PID resources. If the node is physically accessible, direct SSH or console access allows checking system-level services. Essential checks include confirming that the kubelet service is running (systemctl status kubelet), that the container runtime (e.g., containerd, CRI-O) is operational, and that there is sufficient disk space (df -h) and memory (free -m). Network connectivity between the node and the control plane is critical, so testing reachability to the API server (e.g., using curl or telnet on port 6443) is standard. Logs from the kubelet (journalctl -u kubelet) often reveal errors such as certificate expiry, failed authentication with the API server, or resource starvation. For nodes that have crashed or become unreachable, cloud providers offer console serial logs or out-of-band management tools (e.g., IPMI, iDRAC, AWS EC2 Instance Connect). In managed Kubernetes services like Amazon EKS, Azure AKS, or Google GKE, node failure troubleshooting includes checking the cloud provider's health dashboard, autoscaling groups, and instance state. For unresponsive nodes, administrators may attempt a graceful node drain (kubectl drain <node-name>) but often need to force delete pods and eventually remove the node object from the cluster (kubectl delete node <node-name>). Additional diagnostics include checking CNI (Container Network Interface) plugin health, DNS resolution, and control plane components (especially etcd if the node is also a master). In a CKA exam context, candidates must be able to perform these checks using command-line tools, interpret node conditions, and take corrective actions like cordoning, draining, or reinstalling the kubelet.

Real-Life Example

Think of a busy hotel with many rooms. Each hotel room is like a Kubernetes node, and the guests staying in the rooms are like the pods (applications) running on that node. The front desk manager acts as the control plane, keeping track of which rooms are occupied and which are empty.

One morning, the manager notices that Room 204 has not responded to the automatic 'do you need housekeeping?' signal for several hours. This is similar to a node not sending its heartbeat.

The manager (control plane) marks Room 204 as potentially problematic, just as Kubernetes marks a node as NotReady. The manager then sends a bellhop to physically check on the room. The bellhop knocks on the door but no one answers.

This is like an administrator SSH'ing into the node or checking the cloud provider's console. The bellhop reports back that the door is locked and no sound comes from inside. This indicates a more serious problem, similar to a node that is powered off or has a kernel panic.

The manager then checks the hotel's maintenance logs to see if there were any complaints about that room's air conditioner or electrical socket, much like an administrator checks the kubelet logs to see why the node stopped reporting. The manager also looks at the room's recent entries: maybe the guest checked out but the system was not updated, or maybe the room's keycard reader battery died (actual cause). In troubleshooting, the manager starts with the simplest explanation (keycard dead) and works up to more complex ones (electrical fire).

Once the cause is identified, the manager either fixes the issue (replacing the keycard reader) or evacuates the room and moves the guest's belongings to another room (draining the node and rescheduling pods). If the room is beyond repair, the manager permanently removes it from the hotel's room inventory (deleting the node object). This entire process ensures that hotel operations continue smoothly, and no guest is left waiting indefinitely.

Why This Term Matters

Node failures are inevitable in any production Kubernetes environment. Hardware can malfunction, operating systems can crash, network links can drop, and cloud instances can be terminated. Without a systematic troubleshooting approach, a single node failure could cause prolonged application downtime, degraded user experience, and potential data loss.

Understanding node failure troubleshooting directly impacts the stability and reliability of containerized workloads. In real IT operations, a team might have dozens or hundreds of nodes, each running multiple applications. When a node goes down, the control plane will eventually reschedule pods, but this process can take minutes, and misconfigured nodes can cause cascading failures if the root cause is not addressed.

For example, if a node fails due to disk pressure, and you simply delete and recreate it without clearing the disk, the new node may also fail quickly. Troubleshooting helps identify whether the failure is isolated to one node or is a pattern across the cluster (e.g.

, a bad deployment configuration, a resource leak, or a cloud provider issue). From a cost perspective, cloud-based nodes cost money even when they are unhealthy. A node that is stuck in a CrashLoopBackOff or unresponsive state still incurs charges.

Rapid troubleshooting and remediation save money. Additionally, proper troubleshooting practices align with incident management policies and help maintain Service Level Agreements (SLAs). For organizations running stateful applications like databases or message queues, a node failure can mean data loss if persistent volumes are not properly attached or if replication is not configured.

Troubleshooting reveals whether the data is recoverable or if backups need to be restored. Security also plays a role: a compromised node may appear as a failure, and troubleshooting must include checks for unauthorized access or tampering. Ultimately, node failure troubleshooting is a core skill for any Kubernetes administrator, ensuring that clusters remain resilient, cost-effective, and secure.

How It Appears in Exam Questions

In certification exams like the CKA, node failure troubleshooting appears primarily in performance-based tasks. You are given a terminal session on a master node (or a jump box) and instructed to troubleshoot a specific node. A typical question might be: 'One of the worker nodes, node01, is in a NotReady state.

Identify the issue and fix it so that the node returns to Ready.' The exam environment disables certain nodes or services to simulate failures. The question does not provide details; you must use kubectl get nodes, describe the node, SSH into the node, check services, and fix the issue.

Another common pattern is a scenario where a node has a resource pressure condition (e.g., DiskPressure or MemoryPressure). The question might say: 'Node02 is showing DiskPressure.

Free up disk space so that the condition clears and pods can be scheduled again.' You would need to log into the node, find large log files or unused container images, and clean them up. Some questions require you to drain a node that is going for maintenance.

For example: 'The node node03 needs to be taken down for hardware upgrades. Safely drain all pods from the node before the maintenance window.' Here you must reschedule pods onto other nodes, and the node must be cordoned first.

There are also hybrid questions where a node is failing due to an incorrect kubelet configuration, such as a misconfigured node IP or hostname. You might need to edit the kubelet configuration file, restart the service, and verify the node status. In the CKS exam, troubleshooting may include security-related node failures, such as a node that has been compromised or has a failing audit log agent.

The question might ask you to investigate the node and isolate it from the cluster. In all cases, the exam expects you to use command-line tools like kubectl, systemctl, journalctl, ssh, and grep. The key skill is methodical troubleshooting: check node status, check node conditions, examine logs, identify the root cause, apply the fix, and verify the node becomes healthy again.

Study cncf-cka

Test your understanding with exam-style practice questions.

Practise

Example Scenario

Imagine you work as a system administrator for an online bookstore that uses Kubernetes. One day, you receive an alert that the website is running slowly. You log into the cluster and run kubectl get nodes.

You see that one of the three worker nodes, called webshop-node-2, is showing status 'NotReady'. The other two nodes are fine. Your first step is to run kubectl describe node webshop-node-2 to see why.

The output shows that the node condition 'Ready' is set to 'Unknown', and the last heartbeat was received 10 minutes ago. This means the control plane has not heard from the node for a while. You then SSH into webshop-node-2 using its IP address.

Once inside, you run systemctl status kubelet, and it shows that the kubelet service is inactive (dead). You try to start it with systemctl start kubelet, but it fails again after a few seconds. You check the kubelet logs with journalctl -u kubelet and see an error: 'failed to run kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver is systemd but the kubelet is configured to use cgroupfs.'

This is the root cause: the kubelet’s cgroup driver does not match the container runtime. You edit the kubelet configuration file (usually located at /var/lib/kubelet/config.yaml), change the cgroupDriver field to match the runtime, restart the kubelet, and then verify that it stays active.

After a minute, you run kubectl get nodes from your management machine and see that webshop-node-2 is now 'Ready'. The website performance returns to normal, and you have successfully resolved the node failure.

Common Mistakes

Assuming a node is completely healthy just because it is reachable via SSH or ping.

A node can be network-reachable but still have a failed kubelet, exhausted disk space, or a misconfigured container runtime. Kubernetes relies on the kubelet to report health, and without it, the node is effectively dead for scheduling purposes.

Always check node status with kubectl get nodes and kubectl describe node. Even if the node responds to ping, verify that the kubelet is active and that all node conditions (Ready, DiskPressure, MemoryPressure, PIDPressure) are normal.

Trying to fix a node failure by immediately deleting the node object from the cluster instead of first attempting to drain it or diagnose the issue.

Deleting a node without draining can leave pods in a terminating state, and the scheduler may not properly reassign stateful workloads. It also destroys the chance to recover or investigate the root cause.

Always start with investigating the node. If you must remove it, first try kubectl cordon to mark it unschedulable, then kubectl drain to evict pods gracefully. Only delete the node object after pods are safely moved or forced to complete.

Focusing only on the kubelet when the node failure is caused by an underlying resource issue like disk pressure or memory pressure.

The kubelet may be running fine, but if the node’s disk is full or memory is exhausted, containers will not start and new pods cannot be scheduled. Restarting the kubelet without freeing resources does not fix the root cause.

Check node conditions using kubectl describe node. Look for DiskPressure or MemoryPressure. Then log into the node and use commands like df -h, free -m, or journalctl to find what is consuming resources. Clear logs, remove unused container images, or increase resources.

Neglecting to check the container runtime when troubleshooting node failures.

The kubelet relies on container runtimes like containerd or CRI-O to run containers. If the runtime has stopped or is misconfigured, the kubelet will report errors or the node will appear unhealthy.

On the problematic node, run systemctl status containerd (or crio) to confirm the runtime is active. Check runtime logs with journalctl -u containerd. Ensure the runtime is compatible with the kubelet version and that they share the same cgroup driver.

Assuming that rebooting the node always fixes the issue without further analysis.

Rebooting can temporarily clear symptoms but mask underlying problems like disk corruption, misconfiguration, or failing hardware. The node might fail again shortly after reboot, causing repeated downtime.

After a reboot, immediately check node status and kubelet logs. Look for recurring error patterns. If the node fails again, investigate at a deeper level: disk health (smartctl), memory tests, or configuration validation.

Exam Trap — Don't Get Fooled

In an exam scenario, you are asked to troubleshoot a node that shows NotReady. You notice that the kubelet is running, but the node still does not become Ready. You might be tempted to restart the kubelet repeatedly or reboot the node.

Do not restart services blindly. Use kubectl describe node to check node conditions and the 'LastHeartbeatTime' field. Look at the kubelet logs carefully for specific errors like 'failed to get node' or 'certificate signed by unknown authority.'

The most common hidden causes are TLS certificate issues, incorrect node IP in kubelet configuration, or mismatched cgroup drivers. Always read the logs and configuration files (often found in /var/lib/kubelet/config.yaml or /etc/kubernetes/kubelet.

conf) before taking action.

Commonly Confused With

Node Failure TroubleshootingvsPod Failure Troubleshooting

Node failure troubleshooting focuses on the health of the entire worker machine (operating system, kubelet, disk, network). Pod failure troubleshooting focuses on problems within a specific pod, like container crashes, image pull errors, or misconfigured volumes. A node can be healthy while a pod on it fails, and vice versa.

If your application pod is stuck in CrashLoopBackOff, that is a pod failure. If you cannot run kubectl get nodes and see all nodes as Ready, that is a node failure. You would troubleshoot the pod by checking its logs; you would troubleshoot the node by checking its kubelet and resource usage.

Node Failure TroubleshootingvsControl Plane Failure

Node failure troubleshooting deals with worker nodes (where user applications run). Control plane failure deals with the master components: API server, scheduler, controller manager, and etcd. If the control plane fails, you may not even be able to run kubectl commands. Worker node failures are usually isolated, while control plane failures affect the entire cluster.

If you run kubectl get nodes and get an error like 'The connection to the server was refused', that likely indicates a control plane issue. If you get the list of nodes but one shows NotReady, that is a worker node failure.

Node Failure TroubleshootingvsNode Maintenance (Cordon/Drain)

Node maintenance is a planned, administrative action to prepare a node for safe removal or reboot. Troubleshooting node failure is a reactive process to diagnose and fix an unplanned problem. In maintenance, you cordon and drain a node before work; in troubleshooting, you often discover a node has already failed and must decide whether to recover or remove it.

Before updating the kernel on a node, you run kubectl drain node01 – this is maintenance. If node01 suddenly goes NotReady on its own, you must troubleshoot to find out why.

Step-by-Step Breakdown

1

Identify the Failing Node

Run kubectl get nodes to see the status of all nodes. The failing node will typically show NotReady, Unknown, or a condition like DiskPressure. Write down the node's name and its external or internal IP address. This step gives you a starting point for deeper investigation.

2

Inspect Node Details

Use kubectl describe node <node-name> to view detailed conditions, capacity, and allocation. Look for the 'Conditions' section. Check if 'Ready' is True, False, or Unknown. Also look for DiskPressure, MemoryPressure, or PIDPressure. This reveals whether the issue is resource-related or connectivity-related.

3

Check Node Reachability

Try to ping the node's IP address from a control plane node or your admin workstation. If ping fails, the node may be powered off, have a network misconfiguration, or be isolated. If ping succeeds, the node is network-reachable but the kubelet may not be reporting. This step narrows down the problem domain.

4

Access the Node and Check Services

SSH into the failing node (or use console access via cloud provider). Run systemctl status kubelet to see if the kubelet service is active. Also check the container runtime (systemctl status containerd or crio). If any service is inactive or failed, this is likely the root cause. Check system resources with df -h and free -m.

5

Examine Logs for Error Clues

Use journalctl -u kubelet to read the kubelet logs. Look for lines marked 'Error', 'Failed', or 'Cannot'. Common errors include TLS handshake failures, certificate expiry, cgroup driver mismatch, or inability to connect to the API server. The logs often tell you exactly what is wrong.

6

Apply the Corrective Action

Based on the logs, take action. For a failed kubelet service, restart it with systemctl restart kubelet. For disk pressure, clean up logs or images with docker system prune or journalctl --rotate. For certificate issues, regenerate certificates using kubeadm or update the kubelet configuration. For configuration errors, edit the appropriate config file and restart the service.

7

Verify Node Recovery

After applying the fix, wait a few seconds and run kubectl get nodes from a control plane machine. The node should transition to Ready within a minute. Use kubectl describe node again to confirm condition 'Ready' is True and that all pressure conditions are False. If the node remains unhealthy, repeat the troubleshooting steps, starting with a deeper log inspection.

Practical Mini-Lesson

Node failure troubleshooting is one of the most practical, hands-on skills you need as a Kubernetes administrator. It is not enough to understand theory; you must be comfortable using the command line in a live cluster. Let me walk you through a realistic scenario from start to finish, as if you were on call for a production system.

Imagine you get an automated alert: 'Node worker-2 has been in NotReady state for 5 minutes.' You open a terminal and run kubectl get nodes, confirming that worker-2 is NotReady. Your first action is to describe the node. The output shows that the 'Ready' condition is 'Unknown' and the 'LastHeartbeatTime' is 6 minutes ago. This tells you the node has stopped sending heartbeats to the control plane. The node could be offline, or its kubelet could be stuck. Since you cannot assume anything, you try to ping the node. If ping fails, you check your cloud provider's console: maybe the instance was stopped by auto-scaling or has a failed health check. In AWS, you might check the EC2 instance status and system logs. If ping succeeds, you SSH into the node. Once inside, you run systemctl status kubelet and see it is 'inactive (dead)'. You start it, but it fails again after 30 seconds. You check the logs with journalctl -u kubelet -n 50. The logs reveal: 'failed to connect to the API server: x509: certificate has expired or is not yet valid'.

Now you know the issue: the kubelet’s client certificate has expired. In a cluster set up with kubeadm, you can regenerate the certificate by using kubeadm to renew it. The command is kubeadm certs renew kubelet-client or in newer versions, you can use kubeadm alpha certs renew. After renewal, you copy the new certificate to the correct location (usually /var/lib/kubelet/pki), update the kubelet configuration to point to the new cert, and restart the kubelet. Then you run systemctl restart kubelet and verify its status is active. You exit the SSH session and run kubectl get nodes again. After a few seconds, worker-2 shows 'Ready'. You confirm by describing the node and seeing that the heartbeat time is fresh.

What if the issue was different? For disk pressure, you would find large files using du -sh /var/log/* and clear logs or remove unused container images. For memory pressure, you might find a rogue process consuming all RAM and kill it or adjust resource limits. For a broken container runtime, you might reinstall containerd or check its socket (usually /run/containerd/containerd.sock). The key is to follow a logical flow: check reachability, check services, check logs, apply fix, verify. This methodical approach works for almost any node failure, whether physical or cloud-based. Remember that in a cluster with multiple nodes, you may also need to cordon the node before performing repairs to prevent new pods from being scheduled to a potentially failing node. After fixing, you uncordon it. For nodes that cannot be recovered, you drain them and delete the node object. In production, always document your findings and the action taken for incident reviews.

Memory Tip

In a node failure, check the three S's first: Services (kubelet and container runtime), Space (disk and memory), and Signals (heartbeat and logs). A node without these three is a node you cannot trust.

Covered in These Exams

Related Glossary Terms

Frequently Asked Questions

What is the first command I should run when I suspect a node failure?

Run kubectl get nodes to see the status of all nodes. This quickly tells you which nodes are Ready and which are NotReady or Unknown. It is the starting point for any troubleshooting.

Can a node be healthy even if kubectl get nodes shows NotReady?

No, if a node is marked NotReady by the control plane, it means the node’s kubelet has not reported health for a period, or the node has a resource pressure condition. The node may be accessible via SSH, but it is not considered healthy for scheduling pods.

What does it mean when a node has condition DiskPressure?

DiskPressure means the node's available disk space or inodes have fallen below a threshold. The kubelet will start evicting pods to free up space. You should log into the node and check disk usage with df -h, then clear logs, remove unused images, or expand the disk.

How do I safely remove a failed node from the cluster?

First, try to drain the node with kubectl drain <node-name> --ignore-daemonsets. If the node is unreachable, you may need to force drain with --force and --delete-local-data. After draining, run kubectl delete node <node-name> to remove the node object. Then clean up any persistent resources like volumes if needed.

Do I need to check the container runtime when a node fails?

Yes, because the kubelet relies on a container runtime (like containerd or CRI-O) to run pods. If the runtime is down or misconfigured, the kubelet will not function properly. Always check runtime status with systemctl status containerd.

What is the difference between cordon and drain?

Cordon marks a node as unschedulable, meaning no new pods will be assigned to it, but existing pods continue running. Drain evicts all existing pods from the node (except those managed by DaemonSets) and also cordons it. Drain is more disruptive and is used before node maintenance or removal.

Can I fix a node failure without SSH or console access?

In some managed Kubernetes services, you can use kubectl and cloud provider tools to diagnose. For example, in EKS you can check the node group health. But for deep troubleshooting (checking logs, disk, system services), console or SSH access is usually required. Without it, the best option is to replace the node.

What exam objectives cover node failure troubleshooting?

In the CKA exam, objectives include 'Troubleshoot cluster component failures' and 'Troubleshoot worker node failures.' The exam expects you to diagnose and fix node issues in a live cluster using command-line tools. This is a core, high-weight area.

Summary

Node failure troubleshooting is an indispensable skill for any Kubernetes administrator. It involves a systematic process: identifying the failing node via kubectl commands, inspecting node conditions and logs, checking underlying system services and resources, and applying the correct fix whether it is restarting kubelet, freeing disk space, fixing certificates, or draining the node for removal. This skill is heavily tested in the CKA exam, where you must diagnose and resolve node issues in a live cluster under time constraints.

In the real world, mastering node failure troubleshooting ensures high availability of applications, prevents cascading failures, optimizes cloud costs, and strengthens the overall resilience of your infrastructure. Remember the core workflow: check node status, check services and resources, examine logs, fix the root cause, and verify recovery. Avoid common pitfalls like blindly restarting services or deleting nodes without draining.

With practice, this process becomes second nature, allowing you to maintain stable and reliable Kubernetes clusters.