This chapter covers Amazon EKS Multi-AZ Node Groups, a critical feature for building resilient, highly available Kubernetes workloads on AWS. For the SAA-C03 exam, understanding how to distribute worker nodes across Availability Zones (AZs) is a key aspect of designing resilient architectures. Approximately 5-10% of exam questions touch on EKS or related container orchestration concepts, with Multi-AZ node groups being a specific pattern tested in the context of high availability and fault tolerance. By the end of this chapter, you will understand the architecture, configuration, and best practices for deploying EKS node groups across multiple AZs, including how it differs from single-AZ deployments and how to avoid common pitfalls.
Jump to a section
Imagine you run a nationwide delivery company with three major distribution centers (Availability Zones) in different parts of the city. Each center has a fleet of delivery trucks (EC2 instances) that are identical in capacity and configuration. To ensure that a single road closure or power outage doesn't stop deliveries, you don't put all your trucks in one center. Instead, you split them across the three centers. This is your 'node group' – a logical grouping of trucks that all run the same delivery software. Now, your dispatcher (the Kubernetes control plane) needs to assign a package (a pod) to a truck. It looks at which centers have available trucks and picks one. If a center goes offline, the dispatcher automatically reassigns packages to trucks in the other centers. But here's the key: you must tell the dispatcher upfront that the trucks are spread across centers, and you might also set rules like 'always keep at least one truck free in each center' (minimum nodes per AZ) or 'never use more than 10 trucks in a center' (maximum nodes per AZ). Without this explicit configuration, the dispatcher might accidentally put all trucks in one center, defeating the purpose. This is exactly how Amazon EKS Multi-AZ Node Groups work – they let you define a set of worker nodes that span multiple Availability Zones, with the Kubernetes scheduler aware of the AZ distribution so it can place pods for high availability.
What Are EKS Multi-AZ Node Groups?
Amazon EKS (Elastic Kubernetes Service) is a managed Kubernetes service that simplifies running Kubernetes on AWS. A node group is a collection of EC2 instances (worker nodes) that are registered with an EKS cluster and run your containerized applications. In a Multi-AZ node group, these worker nodes are distributed across two or more Availability Zones within the same AWS region. This distribution ensures that if one AZ fails, the application pods can be rescheduled onto nodes in the remaining AZs, providing high availability.
Why Multi-AZ Node Groups Matter
Kubernetes itself does not inherently understand AWS Availability Zones. Without explicit configuration, the Kubernetes scheduler might place all pods on nodes in a single AZ, creating a single point of failure. Multi-AZ node groups leverage AWS features (like EC2 Auto Scaling groups with multiple subnets) to spread nodes across AZs, and they use Kubernetes topology spread constraints or pod anti-affinity to ensure pods are also spread. For the SAA-C03, you must understand that this pattern is essential for meeting SLAs that require 99.99% uptime or for applications that must survive an AZ outage.
How Multi-AZ Node Groups Work Internally
An EKS node group is backed by an EC2 Auto Scaling group (ASG). When you create a node group with multiple subnets (each in a different AZ), the ASG distributes EC2 instances across those subnets evenly by default. The EKS cluster then automatically labels each node with failure-domain.beta.kubernetes.io/zone (or topology.kubernetes.io/zone in newer versions) set to the AZ name (e.g., us-east-1a). The Kubernetes scheduler uses these labels to enforce pod placement policies.
Key components:
- Subnets: You must specify at least two subnets in different AZs when creating the node group. Each subnet must have a route to the internet (if public) or a NAT gateway (if private).
- Auto Scaling Group: The ASG manages the lifecycle of instances. It uses the specified launch template or launch configuration to create instances.
- Node Group Scaling: You can set minimum, maximum, and desired size. The ASG attempts to keep the desired number of instances, distributing across AZs.
- Kubernetes Labels: EKS automatically adds labels like topology.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region. These are used by the scheduler for pod topology spread constraints.
Configuration and Verification
To create a Multi-AZ node group via the AWS CLI, you can use the following command:
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name multi-az-ng \
--scaling-config minSize=2,maxSize=10,desiredSize=3 \
--subnets subnet-abc123 subnet-def456 subnet-ghi789 \
--instance-types t3.medium \
--node-role arn:aws:iam::123456789012:role/eks-node-roleVerification: After creation, check the node group status:
aws eks describe-nodegroup --cluster-name my-cluster --nodegroup-name multi-az-ngLook for status: ACTIVE. To see the AZ distribution of nodes:
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'failure-domain\.beta\.kubernetes\.io/zone'Pod Placement Strategies
To ensure pods are spread across AZs, you must use Kubernetes pod topology spread constraints or pod anti-affinity. For example:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotScheduleThis ensures that pods are evenly distributed across zones, with a skew of at most 1 (i.e., no zone has more than one extra pod compared to another). Without such constraints, the scheduler might place all pods on nodes in the same AZ, even if nodes are spread.
Interaction with Related AWS Services
EC2 Auto Scaling: The ASG handles adding/removing instances based on scaling policies. EKS node groups support both manual scaling and auto scaling via the Cluster Autoscaler.
Elastic Load Balancing (ELB): An ALB or NLB can be configured to route traffic to pods in multiple AZs. The load balancer itself is regional (for ALB) or cross-zone enabled (for NLB).
AWS VPC: Subnets must have appropriate routing. Private subnets require a NAT gateway for outbound internet access (e.g., to pull container images).
IAM: The node group role must have permissions to describe EC2 instances, attach volumes, etc. The EKS cluster role must have permission to manage the node group.
Defaults and Timers
Scaling Cooldown: The ASG has a default cooldown of 300 seconds. This prevents rapid scaling.
Health Check Grace Period: For managed node groups, the default health check grace period is 0 seconds, meaning EKS immediately checks node health after launch.
Update Strategy: Managed node groups support rolling updates with a configurable max unavailable (default 1).
Common Pitfalls
Insufficient Subnets: If you specify only one subnet, the node group is single-AZ. The exam tests that you need at least two subnets in different AZs for Multi-AZ.
Subnet CIDR Conflicts: Subnets must not overlap and must have sufficient IP addresses for the desired number of nodes.
Missing Pod Topology Constraints: Even with Multi-AZ node groups, pods may not be spread unless you explicitly configure constraints.
Incorrect IAM Permissions: The node group role must have autoscaling:CompleteLifecycleAction and other permissions for the ASG to work properly.
Exam Relevance
For SAA-C03, you should know:
Multi-AZ node groups require multiple subnets in different AZs.
EKS automatically labels nodes with AZ information.
Pod topology spread constraints are needed to distribute pods across AZs.
The Cluster Autoscaler can scale node groups based on pod resource requests.
Managed node groups simplify updates and scaling, but you can also use self-managed node groups.
Advanced: Cluster Autoscaler with Multi-AZ
The Cluster Autoscaler (CA) can scale node groups up and down. When multiple node groups exist (e.g., one per AZ), the CA will consider each node group independently. However, with a single Multi-AZ node group, the CA triggers the ASG to add instances, and the ASG distributes them across AZs. This can lead to imbalance if the CA adds only one instance at a time (it goes to the AZ with the fewest instances). To mitigate, you can use node group per AZ or set the ASG's availability zone distribution to balanced best effort.
Summary of Key Values
Minimum nodes per AZ: Not directly configurable via EKS; you must use separate node groups per AZ or use pod anti-affinity.
Maximum nodes per AZ: Similarly, not directly configurable.
Default ASG distribution: The ASG attempts to keep the same number of instances in each AZ, but it is not strict.
Node replacement: If a node fails, the ASG replaces it in the same AZ (if possible).
Plan Subnet and AZ Layout
First, identify at least two Availability Zones in your target region. For each AZ, either use an existing subnet or create a new one. Ensure each subnet has a unique CIDR block that does not overlap with others and has enough free IP addresses to accommodate your maximum desired nodes (e.g., /24 subnet supports up to 251 usable IPs). Also, decide whether subnets will be public (with internet gateway) or private (with NAT gateway for outbound traffic). For production, private subnets are recommended. Record the subnet IDs (e.g., subnet-abc123, subnet-def456).
Create IAM Roles for Node Group
Create an IAM role for the node group with the AWS managed policy `AmazonEKSWorkerNodePolicy`, `AmazonEKS_CNI_Policy`, and `AmazonEC2ContainerRegistryReadOnly`. Also attach a custom policy for Auto Scaling permissions if using Cluster Autoscaler. The role must have a trust relationship with `ec2.amazonaws.com`. Example trust policy: `{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"}]}`. Note the role ARN for later use.
Create the Node Group via AWS CLI or Console
Use the AWS CLI command `aws eks create-nodegroup` with the `--subnets` flag listing at least two subnet IDs. Set `--scaling-config` with min, max, and desired size. For example, `--scaling-config minSize=2,maxSize=10,desiredSize=3`. Specify instance types (e.g., `--instance-types t3.medium`) and the node IAM role ARN. Optionally, set `--ami-type AL2_x86_64` (Amazon Linux 2) or `BOTTLEROCKET_x86_64`. The command returns a node group object; wait for the status to become `ACTIVE` using `aws eks describe-nodegroup`.
Verify Node Distribution Across AZs
After the node group is active, use kubectl to list nodes and their AZ labels. Run `kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'failure-domain\.beta\.kubernetes\.io/zone'`. Ensure nodes appear in at least two different zones. If all nodes are in one zone, check that you specified subnets in different AZs and that the ASG has not yet balanced (it may take a few minutes). Also verify that the subnets have proper route tables and that the nodes can communicate with the EKS control plane.
Deploy Pods with Topology Spread Constraints
To ensure pods are distributed across AZs, add a `topologySpreadConstraints` section to your Deployment or StatefulSet spec. For example, set `maxSkew: 1`, `topologyKey: topology.kubernetes.io/zone`, and `whenUnsatisfiable: DoNotSchedule`. Also set `replicas` to a multiple of the number of AZs (e.g., 3 replicas for 3 AZs). Apply the manifest. Verify pod distribution with `kubectl get pods -o wide --all-namespaces` and check the NODE column to see which AZ each pod is on. If pods are not spread, check for node resource constraints or taints.
Test AZ Failure Resilience
Simulate an AZ failure by either scaling down the ASG to zero in one AZ (via AWS console, set the AZ's desired capacity to 0) or by blocking traffic to nodes in that AZ using network ACLs. Observe that pods on the failed nodes become `Terminating` or `Unknown`. The Kubernetes controller manager will reschedule these pods onto nodes in other AZs (provided there is capacity). Verify that the application remains available if you have multiple replicas. After the test, restore the AZ and rebalance nodes.
Enterprise Scenario 1: E-Commerce Platform
A large e-commerce company runs its customer-facing web application on EKS. The application requires 99.99% uptime and must survive an entire AZ failure. They deploy a Multi-AZ node group across three AZs (us-east-1a, us-east-1b, us-east-1c) with a minimum of 5 nodes per AZ (total 15 minimum). They use a managed node group with t3.large instances. The deployment uses pod topology spread constraints with maxSkew: 1 and whenUnsatisfiable: DoNotSchedule. They also use a Cluster Autoscaler to scale up during peak shopping seasons. In production, they observed that during a real AZ outage, the ASG automatically launched replacement nodes in the healthy AZs, and pods were rescheduled within 2 minutes. The key configuration was setting the ASG's capacity rebalancing to "balanced best effort" to avoid uneven distribution. A common mistake they encountered was forgetting to set the topologySpreadConstraints on their StatefulSets, which caused all database pods to land in one AZ, leading to data unavailability during an outage.
Enterprise Scenario 2: Financial Services
A fintech company runs a microservices architecture for payment processing. They have strict regulatory requirements for data residency within a single region but across multiple AZs. They use separate node groups per AZ (e.g., ng-az1, ng-az2, ng-az3) instead of a single Multi-AZ node group because they need fine-grained control over node sizing per AZ. Each node group has its own ASG and scaling policies. They use pod node affinity to pin pods to specific AZs. The downside is increased management overhead. They also use the Kubernetes topology.kubernetes.io/zone label to enforce that pods from a critical service are spread across all three AZs. During an audit, they demonstrated that no single AZ failure could cause more than 33% capacity loss, meeting the SLA.
Common Misconfigurations
Using only one subnet: This results in a single-AZ node group, defeating high availability.
Not enabling cross-zone load balancing: The ALB must be configured to route traffic to pods in all AZs.
Insufficient IP addresses: The subnet CIDR must be large enough to accommodate the max nodes plus a buffer for pod IPs (if using AWS VPC CNI).
Forgetting to update the VPC CNI configuration: The VPC CNI plugin must have WARM_ENI_TARGET and WARM_IP_TARGET set appropriately to avoid IP exhaustion during scaling.
SAA-C03 Exam Focus on EKS Multi-AZ Node Groups
Objective Code: Resilient Architectures (2.1) – Design a multi-tier architecture solution. Specifically, the exam tests your ability to design high-availability solutions using EKS with Multi-AZ node groups.
What the Exam Tests:
You must know that Multi-AZ node groups require multiple subnets in different Availability Zones.
You must understand that EKS automatically labels nodes with the AZ, but you must use Kubernetes topology spread constraints to actually spread pods.
You need to know that the Cluster Autoscaler works with node groups to add/remove nodes, but it does not guarantee AZ balance.
You should be able to differentiate between managed and self-managed node groups. Managed node groups simplify updates and scaling.
Common Wrong Answers: 1. "Creating a node group with multiple subnets automatically spreads pods across AZs." – This is false. Nodes are spread, but pods are not unless you use constraints. 2. "You need to configure the ASG to distribute instances evenly across AZs." – The ASG does this by default, but the exam might ask about pod placement, not node placement. 3. "Multi-AZ node groups are only possible with self-managed node groups." – Managed node groups also support multiple subnets. 4. "You must create a separate node group per AZ to achieve Multi-AZ." – Not required; a single node group with multiple subnets works.
Specific Numbers and Terms:
Subnet requirement: at least 2 subnets in different AZs.
Node group scaling: min, max, desired size.
Default AMI: Amazon Linux 2 (AL2_x86_64) or Bottlerocket.
Topology spread constraint keys: topology.kubernetes.io/zone and topology.kubernetes.io/region.
maxSkew default: 1 (if not specified, default is 1).
Edge Cases:
If you specify only one subnet, the node group is single-AZ. The exam may present a scenario where you need high availability, and the answer is to add subnets in other AZs.
If you use spot instances, the ASG may not be able to maintain capacity in all AZs during a spot shortage. The exam might test that you should use on-demand instances for critical workloads.
The Cluster Autoscaler may not scale down nodes that have pods with PodDisruptionBudget that prevents eviction.
How to Eliminate Wrong Answers:
Focus on the mechanism: node distribution is done by ASG; pod distribution is done by Kubernetes scheduler with constraints.
If the question mentions "high availability" and "EKS", look for answers that include multiple subnets and topology spread constraints.
If the question asks about "resilience to AZ failure", the correct answer will involve multiple AZs and pod anti-affinity or topology spread constraints.
If the answer mentions "placing all nodes in one AZ", it's wrong for high availability.
Multi-AZ node groups require at least two subnets in different Availability Zones.
EKS automatically labels nodes with `topology.kubernetes.io/zone`.
Pod topology spread constraints (e.g., `maxSkew: 1`) are necessary to spread pods across AZs.
The ASG distributes nodes across subnets, but does not guarantee strict balance.
Managed node groups simplify updates and scaling compared to self-managed.
Cluster Autoscaler works with node groups to add/remove nodes based on pod resource requests.
For high availability, always combine Multi-AZ node groups with pod anti-affinity or topology spread constraints.
These come up on the exam all the time. Here's how to tell them apart.
Single Multi-AZ Node Group
Single ASG manages all nodes across AZs.
Simpler management – one node group to update.
Node distribution is automatic but can become uneven.
Cluster Autoscaler scales the entire group uniformly.
Best for uniform workloads with simple scaling.
Separate Node Groups per AZ
Multiple ASGs, one per AZ.
More granular control over node sizing and scaling per AZ.
Node distribution is strictly per AZ; you control capacity per AZ.
Cluster Autoscaler can scale each node group independently.
Better for workloads that need AZ-specific capacity or spot instance diversification.
Mistake
Creating a Multi-AZ node group automatically distributes pods across AZs.
Correct
Nodes are distributed, but pods are scheduled by Kubernetes. Without topology spread constraints or pod anti-affinity, pods can all land on nodes in one AZ. You must explicitly configure pod placement to spread across AZs.
Mistake
You need separate node groups per AZ to achieve Multi-AZ resilience.
Correct
A single node group with multiple subnets (each in a different AZ) will distribute nodes across those AZs. Separate node groups per AZ are optional and provide more control but are not required.
Mistake
The ASG guarantees even distribution of nodes across AZs at all times.
Correct
The ASG attempts to keep the number of instances balanced, but during scale-up events, it may launch instances in the AZ with the most available capacity. Over time, distribution can become skewed. You can use 'balanced best effort' or 'capacity rebalance' settings.
Mistake
Managed node groups do not support Multi-AZ.
Correct
Managed node groups fully support multiple subnets. You specify the subnets when creating the node group, and EKS manages the ASG accordingly.
Mistake
You must use the AWS VPC CNI plugin for Multi-AZ to work.
Correct
The VPC CNI plugin is the default and recommended, but you can use other CNI plugins (e.g., Calico, Cilium). Multi-AZ node groups work at the infrastructure level, independent of the CNI plugin.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
You create a node group using the AWS CLI, Console, or CloudFormation, specifying at least two subnets in different Availability Zones. For example, use `aws eks create-nodegroup --subnets subnet-abc subnet-def`. The node group will be backed by an Auto Scaling group that distributes instances across those subnets.
No. EKS distributes worker nodes across AZs, but the Kubernetes scheduler does not automatically spread pods. You must use pod topology spread constraints or pod anti-affinity rules in your deployment manifests to ensure pods are spread across AZs.
Yes, you can specify spot instances in the launch template. However, spot capacity can be limited in certain AZs. To improve resilience, you can use multiple instance types or a mix of on-demand and spot instances. The ASG will attempt to maintain the desired capacity across AZs, but spot interruptions may cause imbalance.
Both support Multi-AZ. Managed node groups automatically handle updates, scaling, and health checks. Self-managed node groups give you full control over the ASG and launch configuration, but you must manage updates and patching yourself. For the exam, managed node groups are preferred for simplicity.
Use `kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'failure-domain\.beta\.kubernetes\.io/zone'`. This will show the AZ label for each node. Alternatively, check the EC2 console for the instances' placement.
If an AZ fails, the nodes in that AZ become unreachable. The Kubernetes controller will mark pods as `Terminating` and reschedule them onto healthy nodes in other AZs, provided there is sufficient capacity. The ASG will also launch replacement instances in the remaining AZs if the node group's desired size is maintained.
Yes, Cluster Autoscaler works with node groups. It will scale up the ASG when there are pending pods that cannot be scheduled. However, it does not consider AZ balance; it simply adds nodes to the ASG, which then distributes them across subnets. For better AZ balance, consider using separate node groups per AZ.
You've just covered EKS Multi-AZ Node Groups — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?