ACEChapter 50 of 101Objective 3.1

GKE Horizontal Pod Autoscaler

Without automatic scaling, a sudden load spike can crash your GKE application—that's why the Horizontal Pod Autoscaler (HPA) is a critical component for autoscaling workloads in GKE. For the ACE exam, HPA appears in roughly 5-8% of questions, often integrated with Cluster Autoscaler and node pool configuration. You will learn the exact mechanism, configuration syntax, default values, and common pitfalls tested on the exam. This chapter ensures you understand not just how to create an HPA, but how it decides when to scale, what metrics it uses, and how to troubleshoot it.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

The Automated Coffee Shop Manager

A busy coffee shop, several baristas at the counter. The shop manager has a rule: if the queue of customers waiting for coffee exceeds 10 people for more than 30 seconds, the manager calls in an extra barista from the back. If the queue drops below 3 people for more than 1 minute, the manager sends one barista home. The manager checks the queue length every 15 seconds. This is exactly how the Horizontal Pod Autoscaler (HPA) works. The 'queue length' is the target metric (like CPU utilization). The 'baristas' are Pod replicas. The manager is the HPA controller. The thresholds (10 customers, 3 customers) are the target metric values. The check interval (15 seconds) is the default sync period. The cool-down periods (30 seconds to add, 1 minute to remove) are the stabilization windows. If the manager only checked the queue once every 5 minutes, the shop would be overwhelmed before help arrives. Similarly, if HPA syncs too slowly, it fails to react to load spikes. If the manager sends home a barista the moment the queue drops to 2, but then a rush comes, customers wait too long. HPA's 'stabilization window' prevents such thrashing. The manager also cannot add unlimited baristas — there is a max limit set by the owner. HPA also has a 'maxReplicas' and 'minReplicas' configuration. The manager uses a simple rule: add one barista for every 5 customers above 10, up to a maximum of 10 baristas. HPA computes desired replicas using a formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]. If the current CPU is 200% of target, HPA doubles the replicas. If it is 50%, it halves them. Just like the manager only acts when conditions persist, HPA uses stabilization windows to avoid reacting to transient spikes. The manager also considers the time of day — HPA can use custom metrics like requests per second. The key is that both systems are reactive, proportional, and constrained by limits to ensure stability.

How It Actually Works

What is the Horizontal Pod Autoscaler?

The Horizontal Pod Autoscaler (HPA) is a Kubernetes resource that automatically scales the number of Pod replicas in a Deployment, StatefulSet, or ReplicaSet based on observed CPU utilization, memory utilization, or custom metrics. In GKE, HPA is a native Kubernetes feature, but Google Cloud provides managed versions like the HorizontalPodAutoscaler resource and also integrates with Google Cloud Monitoring for custom metrics.

Why does HPA exist?

Applications experience variable load. Without autoscaling, you either over-provision (wasting money) or under-provision (causing performance degradation). HPA dynamically adjusts replicas to maintain target metrics, balancing cost and performance.

How HPA works internally

The HPA controller is a control loop that runs in the kube-controller-manager. By default, it checks metrics every 15 seconds (--horizontal-pod-autoscaler-sync-period flag). The flow is:

1. Metrics collection: HPA queries the metrics API (metrics.k8s.io for resource metrics, custom.metrics.k8s.io for custom metrics). In GKE, these are backed by the Metrics Server (for resource metrics) and Google Cloud Monitoring adapter (for custom metrics). 2. Desired replica calculation: For each metric specified, HPA computes a desired replica count using the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)] For example, if current replicas=4, current CPU usage=200m (millicores), desired CPU target=100m, then desiredReplicas = ceil[4 * (200/100)] = ceil[8] = 8. 3. Multiple metrics: If multiple metrics are defined, HPA takes the largest desired replica count among them. 4. Stabilization window: Before scaling, HPA applies stabilization windows. The default stabilization window for both scale-up and scale-down is 0 seconds for resource metrics, but you can set it via the behavior field. In practice, GKE uses a default scale-down stabilization of 5 minutes to prevent flapping. 5. Cooldown: After scaling, HPA waits for a cooldown period before scaling again. The default cooldown for scale-up is 3 minutes, for scale-down 5 minutes (configurable via --horizontal-pod-autoscaler-downscale-stabilization). 6. Act on target: The HPA updates the target workload's replicas field. The Deployment controller then creates or deletes Pods.

Key components and defaults

Metrics Server: Must be installed in the cluster for resource metrics. GKE clusters with default settings have Metrics Server enabled.

Custom Metrics Adapter: For custom metrics, you need an adapter like the Google Cloud Monitoring adapter (stackdriver-adapter) or Prometheus adapter.

Default sync period: 15 seconds.

Default scale-up cooldown: 3 minutes (--horizontal-pod-autoscaler-downscale-stabilization does not apply to scale-up; scale-up has no built-in stabilization by default, but you can add it).

Default scale-down stabilization: 5 minutes (configurable via behavior.scaleDown.stabilizationWindowSeconds).

Target metric types: Utilization (percentage of request) or AverageValue (absolute value).

Min and max replicas: Required fields in HPA spec.

Configuration and verification commands

Create an HPA using kubectl:

kubectl autoscale deployment my-deployment --cpu-percent=80 --min=1 --max=10

Or define a YAML manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

Verify HPA status:

kubectl get hpa my-hpa -o yaml
kubectl describe hpa my-hpa

The describe output shows current metrics, desired replicas, and conditions like "AbleToScale", "ScalingActive", "ScalingLimited".

Interaction with related technologies

Cluster Autoscaler: HPA scales Pods within a node pool. If the node pool runs out of capacity, Pods may remain pending. Cluster Autoscaler then adds nodes to the node pool. This is a common exam scenario: HPA triggers Pod scale-up, which triggers Cluster Autoscaler to add nodes.

Vertical Pod Autoscaler (VPA): VPA adjusts resource requests/limits of Pods. HPA and VPA should not be used together on the same metric (e.g., CPU) because they conflict. The exam tests this incompatibility.

Node auto-provisioning: In GKE, with node auto-provisioning enabled, the cluster autoscaler can create new node pools with different machine types to accommodate Pods that cannot schedule.

Custom metrics via Google Cloud Monitoring: You can use custom metrics like requests per second from Cloud Monitoring. This requires the custom-metrics-stackdriver-adapter to be deployed.

Advanced: Behavior and policies

The autoscaling/v2 API allows fine-grained control via the behavior field. You can set stabilization windows, scale-up and scale-down policies (e.g., "Pods" or "Percent"), and select policies. For example:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
    selectPolicy: Max

This limits scale-down to 10% of current replicas per minute, and scale-up to 4 Pods per minute.

Exam-relevant details

The default target CPU utilization is not set; you must specify it. The exam may ask what happens if you don't set a target: the HPA will not scale.

HPA works only if the Pod has resource requests defined. Without requests, the Metrics Server cannot compute utilization. This is a common exam trap.

The HPA uses the average utilization across all Pods. If a Pod is not yet ready, its metrics are excluded.

For custom metrics, the target can be AverageValue (absolute) or Value (total). The exam tests the difference.

The HPA controller scales based on the metric value at the time of the sync. It does not predict future load.

If the desired replica count exceeds maxReplicas, HPA caps at maxReplicas. The condition "ScalingLimited" becomes true.

If the HPA cannot fetch metrics, it does not scale and reports an event. The condition "ScalingActive" becomes false.

Walk-Through

Define HPA with target metrics

You create an HPA resource pointing to a target workload (Deployment, StatefulSet, or ReplicaSet). You specify minReplicas, maxReplicas, and one or more metrics. Each metric defines a target value. For CPU and memory, the target is typically a utilization percentage (e.g., 80%). For custom metrics, it can be an average value (e.g., 100 requests per second per Pod). The HPA controller uses this configuration to know what to monitor and what thresholds trigger scaling.

Metrics Server collects pod metrics

The Metrics Server, a cluster-level component, scrapes resource usage from Kubelets on each node. It collects CPU and memory usage for every Pod every 15 seconds. The data is aggregated and exposed via the metrics.k8s.io API. If the Metrics Server is not running (e.g., in a custom cluster), HPA cannot obtain resource metrics and will not scale. In GKE, Metrics Server is installed by default on clusters version 1.8 and later. For custom metrics, an adapter like the Stackdriver adapter polls Google Cloud Monitoring.

HPA queries metrics API

Every 15 seconds (default sync period), the HPA controller reads the current metric values from the metrics API. For resource metrics, it retrieves the current CPU or memory usage for each Pod. For custom metrics, it queries the custom.metrics.k8s.io API. The controller calculates the average utilization across all ready Pods. If a Pod is not ready (e.g., still starting), its metrics are excluded from the average. This prevents scaling based on incomplete data.

Calculate desired replica count

For each metric, HPA computes desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]. For utilization targets, the currentMetricValue is the average utilization percentage across Pods. For example, if currentReplicas=5, average CPU utilization=120%, target=80%, then desiredReplicas = ceil[5 * (120/80)] = ceil[7.5] = 8. If multiple metrics are defined, HPA takes the largest desiredReplicas among them. This ensures that if any metric indicates more replicas are needed, the HPA scales up to meet that demand.

Apply stabilization and policies

Before acting, HPA checks stabilization windows. The default scale-down stabilization is 5 minutes, meaning HPA will not reduce replicas if the desired count has been lower for less than 5 minutes. Scale-up stabilization defaults to 0. Additionally, scaling policies (e.g., max 10% of Pods per minute) can limit the rate of change. HPA uses the selectPolicy (Max, Min, or Disabled) to choose the policy that allows the most or least change. If the desired count is within the allowed change, HPA proceeds. Otherwise, it adjusts to the maximum allowed.

Update target workload replicas

HPA updates the `replicas` field of the target workload (e.g., Deployment). The Deployment controller then creates or deletes Pods to match the desired count. When scaling up, new Pods are scheduled on nodes. If nodes lack capacity, Pods remain Pending, triggering Cluster Autoscaler (if enabled). When scaling down, Pods are gracefully terminated (SIGTERM). HPA records events and updates its status, including current replicas, desired replicas, and conditions like "AbleToScale" and "ScalingLimited".

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce platform during flash sales

A large e-commerce company runs its product catalog service on GKE. During flash sales, traffic spikes 10x within seconds. They configure HPA with CPU target at 70% and also a custom metric: requests per second (RPS) per Pod from Google Cloud Monitoring, with a target of 500 RPS. The HPA uses both metrics, and the largest desired replica count wins. During the spike, CPU rises to 90% and RPS to 2000, so HPA scales up quickly. The stabilization window for scale-down is set to 10 minutes to avoid thrashing after the spike ends. They also set maxReplicas to 50 to control cost. Without the custom metric, the HPA would react only to CPU, which might lag behind the RPS spike. The key lesson: use relevant metrics for the workload.

Enterprise Scenario 2: Batch processing with memory-intensive jobs

A data analytics company runs batch jobs on GKE that are memory-bound. They set HPA based on memory utilization with a target of 80%. However, they notice that HPA scales down too aggressively when memory drops after a job completes, causing new jobs to wait for Pods to spin up. They add a scale-down stabilization window of 5 minutes and a policy that limits scale-down to 1 Pod per minute. This smooths out the scaling. They also use Cluster Autoscaler to add nodes when Pods are pending due to insufficient memory. The common mistake: not setting stabilization windows, leading to flapping.

Scenario 3: Microservices with varying traffic patterns

A SaaS provider runs 20 microservices on GKE. Each service has an HPA with different metrics. The payment service uses CPU, while the notification service uses a custom metric (queue depth from Pub/Sub). They find that the notification service HPA sometimes fails to scale because the custom metric adapter (Stackdriver) has a 60-second delay in metric availability. They adjust the HPA sync period to 30 seconds and increase the target queue depth to compensate. The performance consideration: custom metrics from external sources have inherent latency; HPA may react slower than resource metrics. The fix: use predictive scaling or buffer targets.

What goes wrong when misconfigured

Missing resource requests: HPA cannot calculate utilization, so it never scales. The cluster may be overloaded. The exam tests this: you must set requests on containers.

Conflicting with VPA: Using both HPA and VPA on the same metric causes oscillation. VPA changes requests, which changes utilization, causing HPA to scale.

Incorrect target type: Using "Value" instead of "AverageValue" for a custom metric that is per-Pod leads to incorrect scaling. For example, if total requests across all Pods is 1000, and target is 500 (Value), HPA would think 2 Pods are enough, but each Pod might be handling 500 requests, which is fine. But if the metric is per-Pod, you should use AverageValue.

Not enabling Cluster Autoscaler: HPA scales Pods, but if the node pool is full, Pods remain Pending. The application stays degraded. The exam scenario: HPA shows desired replicas > current replicas, but Pods are Pending. Solution: enable Cluster Autoscaler.

How ACE Actually Tests This

What the ACE exam tests on HPA (Objective 3.1)

The exam focuses on:

Understanding the relationship between HPA and Cluster Autoscaler.

Configuring HPA with CPU utilization target.

Troubleshooting why HPA is not scaling (missing requests, Metrics Server not running, etc.).

Differentiating between resource metrics and custom metrics.

Knowing the default sync period and stabilization windows.

Recognizing when HPA and VPA conflict.

Common wrong answers and why candidates choose them

1. Wrong answer: "HPA automatically sets a default CPU target of 80% if not specified." Why wrong: HPA requires an explicit target. If omitted, the HPA will not scale. Candidates assume a default exists because many other Kubernetes resources have defaults.

2. Wrong answer: "HPA can scale down immediately after a scale-up." Why wrong: By default, there is a 5-minute stabilization window for scale-down. Candidates forget this and think scaling is instantaneous.

3. Wrong answer: "HPA uses the average CPU across all Pods, including unready ones." Why wrong: Unready Pods are excluded from the average. This is a subtle but tested detail.

4. Wrong answer: "HPA works with any Deployment even without resource requests." Why wrong: Without requests, the Metrics Server cannot compute utilization percentage. The HPA will report "unknown" metric and not scale.

Specific numbers and terms that appear on the exam

Default sync period: 15 seconds (--horizontal-pod-autoscaler-sync-period)

Default scale-down stabilization: 5 minutes (behavior.scaleDown.stabilizationWindowSeconds)

Default scale-up stabilization: 0 seconds (but you can set it)

Target type: Utilization (percentage) vs AverageValue (absolute)

Formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

Conditions: AbleToScale, ScalingActive, ScalingLimited

Edge cases and exceptions

If the desired replica count is less than minReplicas, HPA sets it to minReplicas.

If the metric value is 0, the formula yields infinity; HPA caps at maxReplicas.

For custom metrics, if the metric is not found, HPA does not scale and reports an event.

If the HPA itself is misconfigured (e.g., wrong API version), it will not be created.

How to eliminate wrong answers

If a question mentions "HPA not scaling", check if Pods have resource requests. If not, that's the problem.

If the question says "Pods are pending", think Cluster Autoscaler.

If the question involves both HPA and VPA, look for conflict.

If the question asks about default behavior, recall the 5-minute scale-down stabilization.

Key Takeaways

HPA scales Pod replicas based on observed metrics; default sync period is 15 seconds.

Formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)].

Resource requests must be defined on containers for HPA to compute utilization.

Default scale-down stabilization window is 5 minutes; scale-up has no default stabilization.

HPA can use resource metrics (CPU, memory) and custom metrics (e.g., requests per second).

HPA and VPA should not be used together on the same metric to avoid conflict.

HPA works with Deployments, StatefulSets, and ReplicaSets; not with DaemonSets.

If HPA cannot fetch metrics, it stops scaling and reports 'ScalingActive' condition as false.

Cluster Autoscaler complements HPA by adding nodes when Pods are pending due to resource constraints.

The autoscaling/v2 API supports advanced behavior with stabilization windows and scaling policies.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Horizontal Pod Autoscaler (HPA)

Scales the number of Pod replicas horizontally (in/out).

Works with CPU, memory, and custom metrics.

Requires resource requests on containers to compute utilization.

Can be used with Cluster Autoscaler to add nodes.

Cannot adjust resource requests/limits of existing Pods.

Vertical Pod Autoscaler (VPA)

Scales the resource requests/limits of Pods vertically (up/down).

Works with CPU and memory (and custom metrics in some setups).

Does not require pre-set requests (it can recommend them).

Cannot directly trigger node scaling; it may cause Pod rescheduling.

Cannot be used with HPA on the same metric (CPU or memory).

Watch Out for These

Mistake

HPA uses the total CPU usage across all Pods to determine scaling.

Correct

HPA uses the average CPU utilization per Pod (percentage of request). For example, if you have 5 Pods each requesting 100m CPU, and total CPU usage is 500m, average utilization is 100% (500m/500m). HPA compares this average to the target utilization.

Mistake

HPA can scale based on memory utilization without any additional configuration.

Correct

HPA supports memory as a resource metric, but you must specify the target in the metrics block. Memory utilization is calculated similarly to CPU. However, memory is often less reliable for scaling because garbage collection can cause sudden drops. The exam may test that memory scaling works but requires explicit configuration.

Mistake

HPA scales immediately when the metric exceeds the target.

Correct

HPA has a sync period (15 seconds) and stabilization windows. For scale-down, there is a default 5-minute window. For scale-up, there is no stabilization by default, but the sync period introduces a delay of up to 15 seconds. Additionally, scaling policies can limit the rate of change.

Mistake

You can use both HPA and VPA on the same Deployment without issues.

Correct

Using both on the same metric (e.g., CPU) causes a conflict. VPA adjusts resource requests, which changes utilization, causing HPA to scale the number of Pods. This can lead to oscillation. The exam tests that you should not use both on the same metric.

Mistake

HPA works with any metric exposed by the application.

Correct

HPA can only use metrics exposed through the Kubernetes metrics API (resource metrics) or custom metrics API. For application-specific metrics, you need a custom metrics adapter (e.g., Prometheus adapter or Stackdriver adapter). The metric must be available in the cluster's metrics pipeline.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Why is my HPA not scaling even though CPU usage is high?

First, check if your Pods have resource requests defined. HPA uses utilization percentage, which requires requests. If requests are missing, the Metrics Server cannot compute utilization, and HPA will show 'unknown' metric. Use `kubectl describe hpa` to see the current metric status. Also, ensure the Metrics Server is running: `kubectl get pods -n kube-system | grep metrics-server`. If it's not, install it. Another common cause: the target workload might not be a Deployment, StatefulSet, or ReplicaSet. HPA does not work with DaemonSets or standalone Pods.

Can HPA scale to zero replicas?

No, HPA cannot scale to zero replicas because minReplicas must be at least 1. However, you can use the Kubernetes Event-Driven Autoscaler (KEDA) which supports scaling to zero. For the ACE exam, remember that HPA minReplicas is always >= 1.

What is the difference between 'Utilization' and 'AverageValue' target types?

Utilization is a percentage of the resource request. For example, targetAverageUtilization: 80 means each Pod should average 80% of its CPU request. AverageValue is an absolute value, e.g., targetAverageValue: 100m means each Pod should average 100 millicores. Utilization is more common for CPU/memory; AverageValue is used for custom metrics like requests per second.

How does HPA interact with Cluster Autoscaler?

HPA scales the number of Pods. If the node pool does not have enough capacity, new Pods remain Pending. Cluster Autoscaler detects pending Pods and adds nodes to the node pool. After nodes are added, Pods are scheduled. On scale-down, HPA reduces replicas, and if nodes become underutilized, Cluster Autoscaler removes them. They work together but are independent. The exam may ask: 'What happens if HPA scales up but there are not enough nodes?' Answer: Pods stay Pending until Cluster Autoscaler adds nodes.

What are the default stabilization windows for HPA?

For scale-down, the default stabilization window is 5 minutes. For scale-up, there is no default stabilization (0 seconds). You can configure these using the behavior field in the HPA spec. The stabilization window prevents flapping by requiring the metric to be below the threshold for the entire window before scaling.

Can I use custom metrics from Cloud Monitoring with HPA?

Yes, you can use custom metrics from Google Cloud Monitoring (formerly Stackdriver) by deploying the custom-metrics-stackdriver-adapter. This adapter exposes Cloud Monitoring metrics via the custom.metrics.k8s.io API. You then reference the metric in the HPA spec. For example, you can use 'requests_per_second' from Cloud Monitoring. Note that there is a latency of up to 60 seconds for metric availability.

What happens if the desired replica count exceeds maxReplicas?

HPA caps the desired replicas at maxReplicas and sets the condition 'ScalingLimited' to true. The HPA will continue to try to scale up, but it cannot exceed the max. You should increase maxReplicas or optimize the application to handle load with fewer Pods.

Terms Worth Knowing

Azure Kubernetes Service GKE Google Kubernetes Engine

Ready to put this to the test?

You've just covered GKE Horizontal Pod Autoscaler — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.

Try ACE practice questions Back to all chapters

Done with this chapter?

Preemptible and Spot VMs on GCP

GKE Workload Identity

See the full ACE study guide