CNCFKubernetesCloud NativeIntermediate22 min read

What Is Horizontal Pod Autoscaling in Cloud Computing?

Also known as: Horizontal Pod Autoscaling, Kubernetes HPA, CNCF KCNA, autoscaling, pod scaling

Reviewed byJohnson Ajibi· Senior Network & Security Engineer · MSc IT Security

On This Page

Quick Definition

Horizontal Pod Autoscaling is a feature in Kubernetes that automatically increases or decreases the number of running copies of an application based on demand. Think of it like a smart system that hires more workers when the line gets long and sends them home when things slow down. This helps keep applications responsive without wasting resources.

Must Know for Exams

The CNCF KCNA (Kubernetes and Cloud Native Associate) exam tests foundational knowledge of Kubernetes concepts, including autoscaling. The exam objectives list scaling as a key capability of Kubernetes clusters. Candidates are expected to understand the purpose of Horizontal Pod Autoscaling, the metrics it uses, and how it differs from vertical scaling and cluster scaling.

Questions may ask you to identify the correct configuration for an HPA resource, or choose when HPA is appropriate versus manual scaling. The CKA (Certified Kubernetes Administrator) exam goes deeper. You may be asked to create an HPA for an existing deployment, modify HPA parameters, or troubleshoot why an HPA is not scaling.

The CKAD (Certified Kubernetes Application Developer) exam focuses on designing applications that work well with HPA, such as ensuring applications are stateless and can handle rapid scaling. For all these exams, understanding the relationship between resource requests, limits, and HPA targets is critical. A common exam scenario asks you to set a CPU target utilization of 80 percent.

If candidates forget that this percentage is relative to the pod's CPU request, not the node's CPU, they will calculate the wrong replica count. The exams also test knowledge of the metrics server, which must be installed and running for HPA to work. Troubleshooting questions might involve checking if the metrics server is reachable, if the HPA is reading the correct metrics, or why scaling is not happening.

The objective is to ensure candidates can implement and maintain autoscaling in real clusters. The CNCF KCNA exam may also ask conceptual questions about the benefits of horizontal versus vertical scaling. For example, horizontal scaling adds more instances, which improves resilience and redundancy, while vertical scaling adds more power to a single instance, which has a hard limit.

Knowing when to use each is part of the exam.

Simple Meaning

Imagine you run a busy food truck. Some days, a huge crowd lines up for lunch. Other days, only a few people come by. If you have only one cook, on busy days customers wait forever and leave unhappy.

If you keep five cooks every day, you pay wages for people who stand around doing nothing on slow days. Horizontal Pod Autoscaling, or HPA, is like having a manager who watches the line length and calls in extra cooks from a nearby pool only when the line gets long. When the rush ends, the manager sends the extra cooks back.

In Kubernetes, each cook is a pod, which is a small container running your application. The manager is the HPA controller. It checks metrics like CPU usage or memory consumption every few seconds.

If the average CPU usage across all pods goes above a target you set, say 70 percent, the HPA creates more pod copies. If usage drops below that target for a while, it removes some copies. This keeps your app fast during spikes and saves money during quiet times.

You do not have to guess how many copies you need or wake up at 3 AM to add more. The HPA does it automatically. It works like a thermostat for your app capacity instead of temperature.

You set the desired level, and the system adjusts to maintain it.

Full Technical Definition

Horizontal Pod Autoscaling is a Kubernetes feature defined by the autoscaling API group. It is implemented as a control loop that runs in the kube-controller-manager component. The HPA controller periodically queries the metrics server or custom metrics API to collect resource utilization data for the pods managed by a specific ReplicaSet, Deployment, or similar workload resource.

By default, the controller syncs every 15 seconds, but this interval can be configured. The algorithm used by HPA is based on the ratio of current metric value to the desired metric value. The number of desired replicas is calculated as ceil(currentReplicas * (currentMetricValue / desiredMetricValue)).

For example, if you set a target CPU utilization of 80 percent and current utilization is 100 percent, the HPA aims to increase replicas by approximately 25 percent. The resulting count is then subject to scaling thresholds and stabilization windows. The stabilization window prevents thrashing, where rapid fluctuations in load cause continuous scaling up and down.

The default stabilization window for scale-down is 5 minutes, meaning the controller waits five minutes before reducing replicas after utilization drops. For scale-up, the window is typically zero or very short to respond quickly to load increases. The HPA supports three types of metrics: resource metrics like CPU and memory utilization, custom metrics from applications exposed through the Kubernetes custom metrics API, and external metrics from systems outside Kubernetes like queue length in a message broker.

Each metric type has its own configuration structure. When multiple metrics are specified, the HPA computes the desired replica count for each metric and then uses the largest count to ensure all targets are met. The HPA works with the cluster autoscaler in larger deployments.

While HPA scales pods within the existing node pool, the cluster autoscaler adds or removes nodes when pods cannot be scheduled due to resource constraints. This layered approach provides both application-level and infrastructure-level scaling. In configuration, the targetRef field specifies the workload resource to scale, such as a Deployment named frontend.

The minReplicas and maxReplicas fields define the lower and upper bounds. The metrics array contains the metric definitions. For resource metrics, the type is Resource, and the name is either cpu or memory.

The target type can be Utilization or AverageValue. Utilization is expressed as a percentage of the pod's resource request. This is important because the target percentage is relative to the requested amount, not the node capacity.

For custom or external metrics, the target type is Value or AverageValue, and the value is an absolute amount, for example 100 queries per second.

Real-Life Example

Think about a busy airport security checkpoint. During normal hours, five lanes are open. Each lane has one security officer checking boarding passes and another operating the X-ray machine.

When a big international flight lands, hundreds of passengers arrive at once. The line grows fast. An airport supervisor watches a screen that shows how many people are waiting and how long the wait time is.

When the wait time exceeds ten minutes, the supervisor calls for more officers to open additional lanes. The airport can open up to fifteen lanes maximum. Once the crowd passes and wait times drop below five minutes, the supervisor gradually closes extra lanes, sending officers to breaks or other duties.

In this scenario, the supervisor is the HPA controller. The security lanes are pods. The number of lanes is the replica count. The wait time is the metric, similar to CPU utilization.

The target is ten minutes maximum wait time. The five lanes are the minimum replicas, and fifteen is the maximum. The supervisor does not open lanes instantly they take about two minutes to set up, which matches the HPA stabilization window.

Also, the supervisor does not close a lane immediately after one short line they wait a few minutes to be sure the rush is over. This prevents constant opening and closing, which would confuse staff and waste effort. The airport also has a higher level manager who can add more terminal gates if all fifteen lanes are full and passengers are still waiting.

That is the cluster autoscaler adding nodes. The entire system ensures passengers get through quickly during rushes but the airport does not pay too many officers during slow hours.

Why This Term Matters

Horizontal Pod Autoscaling matters because modern applications experience variable traffic patterns. A retail website might have ten times the traffic during a flash sale compared to a Tuesday afternoon. Without HPA, an operator would need to manually adjust the number of pods, which is slow, error prone, and often too late.

If traffic spikes suddenly, the application becomes slow or crashes, losing revenue and damaging reputation. If traffic drops, running too many pods wastes money on cloud compute resources. HPA solves both problems automatically.

In real IT work, teams set up HPA for stateless applications like web servers, APIs, and worker queues. This reduces operational burden because engineers do not need to monitor load continuously or be on call for scaling events. It also improves reliability because the system reacts faster than a human can.

Cloud costs become more predictable and efficient because resources match demand closely. HPA also integrates with other Kubernetes features. For example, when combined with pod disruption budgets, scaling down does not cause service interruption because the system respects policies that keep a minimum number of pods running.

In production environments, HPA is often paired with the cluster autoscaler so that if the node pool runs out of capacity, new nodes are added automatically. This creates a fully elastic infrastructure. Engineers who understand HPA can design applications that handle load gracefully without manual intervention.

They also need to know the limitations, such as that HPA does not work well with stateful applications where adding replicas requires data synchronization, or with applications that have long startup times. In those cases, predictive scaling or custom metrics may be needed. Overall, HPA is a core tool for building resilient, cost effective, and scalable cloud native systems.

How It Appears in Exam Questions

Exam questions about Horizontal Pod Autoscaling appear in several formats. Scenario questions describe a situation where a web application experiences traffic spikes during business hours and low traffic at night. The question asks which Kubernetes feature can automatically adjust the number of pods to match the load.

The answer is HPA. Configuration questions give a YAML snippet for an HPA and ask which metric type is being used, or what the target replica count will be given a current metric value. For example, if a deployment has 4 pods, each with a CPU request of 500 millicores, and the current CPU usage is 1600 millicores total, with a target of 80 percent utilization, the candidate must calculate the desired number of replicas.

The correct answer is 4, because 1600 millicores out of a total request of 2000 millicores is 80 percent, so no scaling is needed. Troubleshooting questions describe an HPA that is not scaling up despite high CPU usage. The candidate must check whether the metrics server is installed, whether the deployment has resource requests set, or whether the HPA targetRef points to the correct resource.

Architecture questions ask which component runs the HPA control loop, which is the kube-controller-manager. Other questions compare HPA with Vertical Pod Autoscaling and cluster autoscaling, asking which one is best for a given workload. Some questions ask about stabilization windows, such as what the default cooldown period is for scaling down.

The answer is 5 minutes. Multi-part questions might combine HPA with pod disruption budgets or resource quotas to test understanding of how these constraints interact. For the KCNA exam, questions are more conceptual, such as identifying the correct statement about HPA from a list of options.

For CKA and CKAD, the question format may include a terminal simulation where the candidate runs commands like kubectl autoscale deployment frontend --cpu-percent=50 --min=1 --max=10, or edits an HPA YAML file to add a custom metric.

Study cncf-kcna

Test your understanding with exam-style practice questions.

Practise

Example Scenario

A company runs an online ticketing platform for concerts. Their application is deployed on Kubernetes with three replicas. On Monday morning, a popular band announces a surprise tour.

Thousands of fans rush to the site to buy tickets. The application's CPU usage jumps from 30 percent to 95 percent across all pods. The response time slows from 200 milliseconds to 5 seconds.

The company has an HPA configured with a target CPU utilization of 70 percent, minimum replicas of 3, and maximum replicas of 20. The HPA controller sees that the current usage is above the target. It calculates the desired replicas as currentReplicas times currentMetric divided by desiredMetric, which is 3 multiplied by 95 divided by 70, equaling about 4.

07. Since the result is fractional, the HPA rounds up to 5. Within two minutes, the deployment scales to 5 pods. CPU usage per pod drops to about 57 percent. The site becomes responsive again.

Over the next hour, as more fans join, usage rises again. The HPA continues scaling up until it reaches 18 pods. After the ticket sale ends, traffic drops sharply. The HPA waits five minutes (the scale-down stabilization window), then begins reducing replicas one by one until it reaches 3 again.

The company did not need an engineer to watch the system or add capacity manually. The HPA handled the entire surge automatically, keeping the site fast and avoiding unnecessary costs after the rush.

Common Mistakes

Believing the HPA target CPU percentage is based on the node's total CPU capacity.

The HPA target for CPU utilization is calculated as a percentage of the pod's resource request, not the node's total CPU. If a pod requests 500 millicores, a target of 80 percent means the HPA will aim for 400 millicores used per pod. Confusing these leads to incorrect scaling behavior.

Always set CPU requests for your pods and remember that the HPA target percentage applies to that requested amount. Check the pod's resource request, not the node capacity, when reading HPA metrics.

Thinking HPA works without a metrics server installed.

HPA relies on the metrics server to collect resource usage data from the kubelet on each node. Without the metrics server, the HPA controller cannot get CPU or memory metrics, so it will never trigger scaling. Many beginners assume HPA works out of the box.

Install the Kubernetes metrics server in your cluster before configuring HPA. Verify it is running with kubectl top pods to see resource usage.

Setting minReplicas too high or maxReplicas too low, which defeats the purpose of autoscaling.

If minReplicas is set too high, the application always runs many pods even when idle, wasting resources. If maxReplicas is too low, the HPA cannot scale enough to handle traffic spikes, leading to performance issues. Both extremes reduce the benefit of autoscaling.

Analyze your traffic patterns and set minReplicas to the number needed for baseline load. Set maxReplicas to a value that can handle peak load plus a safety margin. Use load testing to determine these numbers.

Expecting HPA to work instantly for scaling down.

HPA has a scale-down stabilization window, defaulting to 5 minutes, to prevent thrashing. Beginners often expect pods to be removed immediately when load drops. This delay is intentional and necessary for stable operations.

Understand that scaling down is intentionally slower than scaling up. If you need faster scale-down, you can adjust the stabilization window using the behavior field in the HPA spec, but be cautious about causing instability.

Assuming HPA works well for stateful applications without additional considerations.

Stateful applications, such as databases with persistent volumes, cannot simply add and remove replicas without careful data synchronization. HPA is designed primarily for stateless workloads. Using it on stateful sets without proper configuration can cause data loss or corruption.

For stateful applications, consider alternatives like manual scaling, using a StatefulSet with careful index management, or using a database-specific scaling solution. Reserve HPA for stateless services like web servers and APIs.

Exam Trap — Don't Get Fooled

An exam question asks: You have a deployment with 2 pods, each with a CPU request of 1 core. The current CPU usage is 1.8 cores total. The HPA target is 80 percent CPU utilization.

How many pods will the HPA scale to? Many learners answer 2, because 1.8 is close to 80 percent of 2 cores, but they forget to compute correctly. Use the HPA formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)).

Here, currentReplicas is 2. CurrentMetricValue is 1.8 cores. DesiredMetricValue per pod is 0.8 cores (80 percent of 1 core requested). So 2 * (1.8 / 0.8) = 2 * 2.25 = 4.5, ceiling is 5.

The HPA will scale to 5 pods. Always calculate the per-pod desired value and then apply the formula.

Commonly Confused With

Horizontal Pod AutoscalingvsVertical Pod Autoscaling (VPA)

Horizontal Pod Autoscaling adds or removes pod copies to handle load. Vertical Pod Autoscaling adjusts the CPU and memory limits of existing pods. HPA increases the number of workers, while VPA gives each worker more power. They solve different problems and can be used together, but VPA is more complex and often requires pod restarts.

For a web server that gets more traffic, HPA adds more server instances. For a database that is running out of memory, VPA increases the memory limit of the existing database pod.

Horizontal Pod AutoscalingvsCluster Autoscaler

Horizontal Pod Autoscaling scales the number of pods inside a cluster. Cluster Autoscaler scales the number of nodes in the cluster. HPA works at the application level, while Cluster Autoscaler works at the infrastructure level. They are often used together, but they are separate components.

If HPA adds more pods but the current nodes are full, Cluster Autoscaler adds a new node to the cluster so the new pods can run. HPA handles app demand; Cluster Autoscaler handles capacity.

Horizontal Pod AutoscalingvsManual Scaling via kubectl scale

Manual scaling requires a human to run a command like kubectl scale deployment frontend --replicas=5. HPA automates this decision based on metrics. Manual scaling does not react to load changes automatically and is not practical for dynamic environments.

If traffic spikes at 3 AM, manual scaling requires someone to be awake and monitoring. HPA scales up automatically within minutes, without human intervention.

Step-by-Step Breakdown

Define the workload resource

Before HPA can scale, you need a workload resource like a Deployment or StatefulSet that has resource requests set for CPU or memory. The HPA needs these requests to calculate utilization percentages. Without requests, HPA cannot determine the target value.

Install the metrics server

The Kubernetes metrics server collects resource metrics from each node's kubelet and exposes them through the metrics API. HPA reads this API to get CPU and memory usage. Without the metrics server, the HPA controller has no data and never scales.

Create the HPA resource

Define an HPA YAML file with specifications including the target workload, minimum and maximum replicas, and the metrics to monitor. The targetRef points to the deployment or replicaset. The metrics array defines what to measure, such as CPU at 80 percent utilization.

HPA controller reads metrics

The kube-controller-manager runs the HPA control loop, which periodically queries the metrics API. By default, this happens every 15 seconds. It collects the current average utilization for the pods managed by the target workload.

Calculate desired replicas

The HPA controller applies the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)). It computes this for each metric defined. If multiple metrics are used, it takes the largest resulting replica count to ensure all targets are satisfied.

Apply scaling decision with stabilization

Before changing the replica count, the HPA checks stabilization windows. For scaling up, the window is typically 0 seconds so it responds fast. For scaling down, the default window is 5 minutes. This prevents rapid fluctuations in replica count. If the desired count is outside the min and max bounds, it is clamped to those values.

Update the workload resource

If the calculated replica count differs from the current count, the HPA updates the replicas field of the target Deployment or ReplicaSet. The Kubernetes controller manager then works to create or terminate pods to match the new desired count.

Practical Mini-Lesson

To use Horizontal Pod Autoscaling effectively in a production environment, you must understand how it interacts with other Kubernetes components and the application itself. First, ensure every pod you intend to scale has resource requests defined. Without requests, the HPA cannot compute utilization percentages.

The requests should be based on realistic baseline usage, not arbitrary numbers. Use tools like kubectl top pods during normal and peak traffic to gather data. Set the HPA target slightly below the point where performance degrades.

For CPU, a target of 60 to 80 percent is common. For custom metrics like requests per second, set a target that keeps response times acceptable. The metrics server is mandatory for resource metrics, but for custom or external metrics, you need additional adapters like the Prometheus adapter or the Kubernetes custom metrics API.

Installation and configuration of these adapters is a common task for Kubernetes administrators. When configuring HPA, always set sensible minReplicas and maxReplicas. The minReplicas should handle baseline traffic with some redundancy.

The maxReplicas should be high enough to absorb peak traffic but not exceed your cluster's node capacity. Monitor the HPA status using kubectl describe hpa to see the current metrics, target values, and recent scaling events. Look for conditions like AbleToScale, ScalingActive, and ScalingLimited.

These tell you if the HPA is working, if metrics are being received, and if scaling is limited by maxReplicas. A common issue is that HPA shows unknown for metrics. This usually means the metrics server is not installed or the pods do not have resource requests.

Another issue is that scaling is too slow for sudden traffic bursts. In that case, consider using a combination of HPA with a slightly lower target, or use the behavior field to override the default stabilization windows for scale-up. For example, you can set a 1 minute stabilization window for scale-down instead of 5 minutes if your traffic drops rapidly.

You can also set a policy that allows scaling up by multiple pods at once rather than one at a time. HPA can be used with custom metrics from your application, such as queue depth in a message broker. This is more responsive than CPU because CPU may lag behind the actual load.

For example, if you have a worker pod that processes jobs from a queue, you can configure HPA to scale based on the number of messages in the queue, using an external metrics provider. This gives more direct scaling signals. As a professional, you should also understand that HPA relies on the resource usage data being up to date.

If the metrics server has a delay, scaling decisions may be based on stale data. Use monitoring tools to verify that metrics are flowing correctly. Finally, test your HPA configuration with load testing tools before putting it in production.

Simulate traffic spikes and verify that HPA scales up as expected, then scales down without causing instability. Document your HPA configurations and review them regularly as traffic patterns change.

Memory Tip

Remember HPA by the three Rs: Requests must be set, Metrics must be available, and Stabilization prevents thrashing.

Covered in These Exams

cncf-kcna

Related Glossary Terms

2FA

Two-factor authentication (2FA) is a security method that requires two different types of proof before granting access to an account or system.

3D printer

A 3D printer is a device that creates physical objects by depositing layers of material based on a digital model.

Frequently Asked Questions

Can HPA work without a metrics server?

No. HPA needs resource usage data to make scaling decisions. The metrics server provides this data by collecting CPU and memory usage from nodes. Without it, HPA will not scale.

Does HPA work with StatefulSets?

HPA can work with StatefulSets in theory, but it is not recommended for stateful applications that require persistent storage or ordered pod identity. Scaling stateful pods up or down can cause data issues if not handled carefully.

What is the difference between HPA and Vertical Pod Autoscaling?

HPA changes the number of pod replicas. Vertical Pod Autoscaling changes the resource limits of existing pods. HPA handles load by adding or removing copies, while VPA handles load by giving each pod more CPU or memory.

How often does HPA check metrics?

By default, the HPA control loop runs every 15 seconds. This interval can be adjusted by configuring the kube-controller-manager flag --horizontal-pod-autoscaler-sync-period.

What happens if HPA cannot reach the metrics server?

The HPA controller will report unknown metrics and will not perform scaling. You may see warnings in the HPA status. The existing pod count will stay the same until metrics become available again.

Can I use multiple metrics in one HPA?

Yes. You can define multiple metrics in the HPA spec, such as CPU and memory, or CPU and a custom metric. The HPA will compute the desired replica count for each metric and apply the largest count to ensure all targets are met.

Does HPA respect pod disruption budgets when scaling down?

Yes. When HPA reduces the number of replicas, the termination of pods respects the PodDisruptionBudget you have configured, ensuring that a minimum number of pods remain available during the scale-down process.

Summary

Horizontal Pod Autoscaling is a fundamental Kubernetes feature that automatically adjusts the number of running application replicas based on real-time metrics like CPU, memory, or custom signals. It eliminates the need for manual scaling, reduces operational overhead, and ensures applications remain responsive under variable load while keeping infrastructure costs aligned with actual demand. For IT certification exams like the CNCF KCNA, CKA, and CKAD, understanding HPA is critical.

You must know how it calculates desired replicas, the importance of resource requests, the role of the metrics server, and the stabilization mechanisms that prevent thrashing. Common mistakes include misinterpreting the target metric as node-level rather than per-pod, forgetting to install the metrics server, and setting inappropriate min and max replicas. By mastering HPA, you gain a practical skill for building elastic, cloud native systems and a clear advantage in your certification journey.

What Should You Do Next?

Browse Topic Guides

Deep lessons on key IT concepts

Study cncf-kcna

Test your Horizontal Pod Autoscaling knowledge

Browse All Glossary Terms

Explore cncf concepts

← Back to Glossary Practice Questions