AZ-104Chapter 130 of 168Objective 4.2

Application Gateway v2 Autoscaling

In the AZ-104 Networking domain (Objective 4.2), Application Gateway v2 Autoscaling is a critical feature for managing web traffic in Azure with elastic scaling. For the AZ-104 exam, this topic appears in the Networking domain (Objective 4.2: Configure and manage Azure Application Gateway) and typically accounts for 5-10% of networking questions. Understanding autoscaling is essential for designing resilient, cost-effective web frontends. We will explore what it is, how it works internally, configuration details, interaction with other services, and exam traps.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Highway Tollbooth with Variable Lanes

A highway toll plaza is a traffic bottleneck that must handle both quiet nights and holiday rush hours. Instead of having a fixed number of tollbooths staffed at all times, the plaza uses a smart system: it monitors the queue length at each booth and the average wait time. When wait times exceed 30 seconds and queue length surpasses 10 cars, the system automatically opens additional booths — up to a maximum of 100. Each booth can process up to 100 cars per minute. The system also has a minimum of 2 booths open even at 3 AM. The tollbooths are virtual: they share a common payment processing backend, so adding a booth doesn't require a new payment terminal. The system scales out by adding booths and scales in by closing them, but never below the minimum. The processing capacity is measured in cars per minute (CPM). If traffic spikes suddenly, the system may take 1-2 minutes to open new booths, during which queues may grow. The system also supports multiple lanes (e.g., E-ZPass vs cash) and can route traffic to the appropriate booth type. This is exactly how Application Gateway v2 Autoscaling works: it monitors metrics like throughput and connection count, scales out by adding instances (each with a fixed capacity) up to a maximum, scales in when load drops, and has a minimum instance count. The instances share a backend pool and configuration. Scaling is not instantaneous — there is a lag of 1-2 minutes. The gateway can handle multiple listeners and routing rules, like tollbooths for different payment types.

How It Actually Works

What is Application Gateway v2 Autoscaling?

Application Gateway v2 is a regional Layer 7 load balancer. The v2 SKU introduces autoscaling, which automatically adjusts the number of instances based on traffic. This eliminates the need to manually size the gateway for peak load, reducing cost during low traffic and ensuring capacity during spikes. Autoscaling is available only in the v2 SKU (Standard_v2 and WAF_v2). The v1 SKU requires manual instance count selection.

Why Autoscaling Matters for the Exam

The AZ-104 exam tests your ability to configure and manage an Application Gateway. You must understand:

The difference between v1 and v2 SKUs regarding scaling.

How to enable autoscaling (it's on by default for v2 but you set min and max instances).

The metrics that trigger scaling (throughput, connection count, compute units).

The relationship between autoscaling and instance counts.

The impact on backend health and performance.

How Autoscaling Works Internally

Application Gateway v2 autoscaling operates based on a metric called Capacity Units (CU) . Each instance of the gateway has a fixed capacity: it can handle up to 10 Capacity Units. A Capacity Unit is a composite measure of: - 1 Mbps throughput - 2500 active connections - 2.22 Mbps compute (for TLS termination, routing, etc.)

Autoscaling monitors the aggregate load across all instances. When the average CU utilization exceeds a threshold (typically 80% of the capacity of current instances), the gateway scales out by adding instances. When utilization drops below a lower threshold (e.g., 20%), it scales in by removing instances. Scaling is not instantaneous; it takes 1-2 minutes for new instances to become ready. The gateway also respects a cooldown period (default 5 minutes) to avoid flapping.

Key Components and Defaults

Minimum Instance Count: You set this (default is 0, but recommended minimum 2 for production). If set to 0, the gateway can scale to zero when idle, but then cold starts take time. Minimum 1 is allowed.

Maximum Instance Count: You set this (maximum 125 for v2). The gateway will not exceed this.

Capacity Units per Instance: Fixed at 10 CU per instance. This is not configurable.

Scaling Metrics: The gateway uses throughput (Mbps), connection count, and compute utilization. The exact algorithm is not documented, but it's designed to keep CU utilization between 20-80%.

Scaling Cooldown: After a scale event, the gateway waits 5 minutes before initiating another scale action, to prevent oscillation.

Configuration and Verification

You enable autoscaling when creating an Application Gateway v2. In the Azure portal, under "Instance count", you select "Autoscale" and set min and max instances. Alternatively, via PowerShell:

$gw = Get-AzApplicationGateway -Name "myAppGateway" -ResourceGroupName "myRG"
$gw.Sku.Name = "Standard_v2"
$gw.Sku.Tier = "Standard_v2"
$gw.AutoscaleConfiguration.MinCapacity = 2
$gw.AutoscaleConfiguration.MaxCapacity = 10
Set-AzApplicationGateway -ApplicationGateway $gw

Via Azure CLI:

az network application-gateway update \
  --name myAppGateway \
  --resource-group myRG \
  --min-capacity 2 \
  --max-capacity 10

To verify current instance count and autoscaling status:

az network application-gateway show \
  --name myAppGateway \
  --resource-group myRG \
  --query "{capacity:operationalState, min:autoscaleConfiguration.minCapacity, max:autoscaleConfiguration.maxCapacity}"

Interaction with Related Technologies

Backend Pools: Autoscaling does not affect backend pool configuration. Backend servers must be able to handle the load when the gateway scales out. Health probes continue to work independently.

WAF_v2: Autoscaling works identically for WAF_v2, but WAF rules consume compute capacity. The same CU limits apply.

Azure Monitor: You can view autoscaling metrics like "Capacity Units", "Current Capacity Units", "Throughput", "Unhealthy Host Count". These are essential for troubleshooting.

Azure Front Door: Front Door is a global Layer 7 load balancer with autoscaling built-in, but it's different from Application Gateway (global vs regional). The exam may compare them.

Load Balancer: Standard Load Balancer (Layer 4) has autoscaling for outbound connections via NAT gateway, but not for inbound. Application Gateway autoscaling is for inbound Layer 7.

Common Misconfigurations

Setting max instances too low causing throttling during spikes.

Setting min instances to 0 causing cold start delays.

Not enabling autoscaling at all (using fixed instance count) leading to overprovisioning or underprovisioning.

Confusing autoscaling with the gateway's ability to scale out/in based on backend health (it does not; backend health only affects routing).

Exam-Relevant Details

The v2 SKU is required for autoscaling. v1 does not support autoscaling.

You cannot disable autoscaling on a v2 gateway; you can only set min and max to the same value to fix the instance count.

Autoscaling is free; you pay for the capacity units consumed.

During scale-out, new instances are added to the existing gateway; there is no downtime.

The gateway scales in when load decreases, but it will not scale below the min instance count.

If you set min capacity to 0, the gateway can scale to zero, but incoming traffic will fail until a new instance spins up (cold start).

The maximum number of instances is 125 for v2 (in most regions).

The autoscaling decision is based on aggregate metrics across all instances, not per-instance.

Step-by-Step Autoscaling Mechanism

Traffic Arrives: Client requests hit the gateway's frontend IP.

Load Measurement: The gateway measures throughput, active connections, and compute usage every few seconds.

CU Calculation: These metrics are converted to Capacity Units (e.g., 1 CU = 1 Mbps or 2500 connections).

Comparison: The current total CU load is compared to the capacity of current instances (each instance = 10 CU).

Scale Decision: If load exceeds 80% of current capacity for a sustained period (e.g., 1 minute), a scale-out is triggered. If load drops below 20% for a sustained period, scale-in occurs.

Provisioning: New instances are provisioned (this takes 1-2 minutes). During this time, the gateway continues to operate with existing instances (may be overloaded).

Traffic Redistribution: Once new instances are ready, traffic is load-balanced across all instances.

Cooldown: After a scale event, a 5-minute cooldown prevents further scaling actions.

Walk-Through

Monitor Capacity Unit Metrics

Application Gateway continuously monitors the load on each instance. It measures throughput in Mbps, active connections, and compute utilization (e.g., TLS decryption). These are aggregated into Capacity Units (CU). For example, if the gateway is processing 100 Mbps throughput and 50,000 active connections, that's 100 CU from throughput (100 Mbps / 1 Mbps per CU) plus 20 CU from connections (50,000 / 2,500 per CU), total 120 CU. With 10 instances (each 10 CU capacity, total 100 CU), the gateway is over capacity and will trigger scale-out.

Scale-Out Decision Based on Thresholds

When the average CU utilization across all instances exceeds 80% for a sustained period (typically 1 minute), the autoscaler decides to add instances. The number of instances to add is calculated to bring utilization below 80%. For example, if current load is 120 CU and each instance provides 10 CU, the gateway needs at least 12 instances (120/10 = 12). If currently 10 instances, it will add 2. The scale-out request is sent to the Azure Resource Manager.

Provision New Instances

Azure provisions new gateway instances. This involves allocating compute resources, configuring the gateway software, and attaching to the subnet. This process takes 1-2 minutes. During this time, the existing instances continue to handle traffic, which may degrade performance if overloaded. The new instances are not immediately available for traffic.

Add New Instances to Load Balancer

Once the new instances are ready, they are added to the internal load balancer of the Application Gateway. The gateway's frontend IP and routing rules are automatically applied. Traffic distribution begins to include the new instances. This is seamless; no client reconfiguration is needed.

Scale-In Decision and Cooldown

When load decreases, the autoscaler monitors if average CU utilization drops below 20% for a sustained period (e.g., 5 minutes). If so, it removes instances one at a time to avoid sudden capacity drops. After any scale event, a cooldown period of 5 minutes prevents further scaling actions. The gateway will not scale below the configured minimum instance count.

What This Looks Like on the Job

Enterprise Scenario 1: E-Commerce Platform with Variable Traffic

A large e-commerce company runs an online store that experiences massive traffic spikes during Black Friday and Cyber Monday, but has low traffic at other times. They deploy an Application Gateway v2 with autoscaling to handle this. They set min instances to 2 (to handle baseline traffic and ensure low latency) and max instances to 50 (to handle peak loads). During normal days, the gateway runs with 2 instances. On Black Friday, traffic surges, and the gateway automatically scales out to 50 instances within minutes. The backend pool consists of a virtual machine scale set (VMSS) that also autoscales. The combination allows the site to handle millions of requests without manual intervention. Misconfiguration example: if they set max instances too low (e.g., 5), the gateway would throttle traffic, causing 503 errors and lost sales.

Enterprise Scenario 2: Multi-Region Web Application with WAF

A financial services company hosts a web application across multiple Azure regions for disaster recovery. Each region has an Application Gateway (WAF_v2) with autoscaling to protect against DDoS and SQL injection attacks. They set min instances to 2 for high availability and max to 20. The gateway uses Azure Traffic Manager to route users to the nearest region. During a DDoS attack, the gateway's WAF consumes additional compute, causing CU utilization to spike. Autoscaling adds instances to handle the extra processing. This keeps the application online. Common failure: if the subnet size is too small (e.g., /28 with 11 usable IPs), the gateway cannot scale beyond that number of instances. They learned to use a /24 subnet to allow up to 251 instances.

Scenario 3: API Gateway for Microservices

A SaaS provider uses Application Gateway v2 as an API gateway for microservices. They have multiple listeners and path-based routing rules. Autoscaling ensures that as API calls increase, the gateway scales out. They set min to 1 (but experienced cold start delays when scaling from 0 to 1, so changed to min 2). The backend is a Kubernetes cluster (AKS). The gateway's autoscaling works independently of the cluster's autoscaler. They monitor CU metrics and set alerts when utilization exceeds 70% to proactively check backend health. A common issue: when the gateway scales in, it may drop connections to instances that are being removed, but since connections are load-balanced, it's usually graceful. However, they ensure that backend servers handle connection draining via the gateway's connection draining feature.

How AZ-104 Actually Tests This

What the AZ-104 Exam Tests on Autoscaling (Objective 4.2)

The exam expects you to:

Differentiate between v1 and v2 SKUs: autoscaling is only for v2.

Configure autoscaling settings (min/max instances) via portal, CLI, or PowerShell.

Understand that autoscaling is based on Capacity Units (CU) and that each instance provides 10 CU.

Know that autoscaling cannot be disabled; you can fix instance count by setting min and max to the same value.

Recognize that the gateway scales based on aggregate metrics, not per-instance.

Understand that scaling is not immediate (1-2 minutes lag).

Common Wrong Answers and Why

"Autoscaling can be enabled or disabled on v2 gateways." This is false. Autoscaling is always enabled; you control min/max instances.

"Set min instances to 0 for best cost savings." While true for cost, it causes cold start delays; exam expects you to know that min 2 is recommended for production.

"Autoscaling scales based on backend pool health." False; backend health only affects routing, not scaling.

"You can set max instances to unlimited." False; maximum is 125 for v2.

Specific Numbers and Terms

Min instances: 0-125 (default 0)

Max instances: 1-125

Capacity Units per instance: 10

1 CU = 1 Mbps throughput, or 2500 active connections, or 2.22 Mbps compute

Scaling lag: 1-2 minutes

Cooldown: 5 minutes

v2 SKU names: Standard_v2, WAF_v2

Edge Cases and Exceptions

If you set min and max to the same value (e.g., min=5, max=5), the gateway behaves like a fixed-size v2 gateway with 5 instances. Autoscaling is still enabled but has no room to scale.

The gateway can scale to zero instances if min=0 and there is no traffic. However, incoming traffic will fail until a new instance starts (cold start). This is not recommended for production.

Autoscaling does not consider backend capacity. If backend is overwhelmed, the gateway will still scale out, worsening the problem. You must scale backend separately.

The gateway cannot scale beyond the subnet size. Ensure the subnet has enough free IP addresses for max instances.

How to Eliminate Wrong Answers

If a question mentions "autoscaling" and the answer options include "v1 SKU", eliminate that option immediately. If a question asks about scaling based on backend health, that's a distractor. Look for answers that reference Capacity Units, min/max instances, or the v2 SKU. When calculating required instances, use the formula: instances = ceil(CU load / 10).

Key Takeaways

Application Gateway v2 autoscaling is enabled by default and cannot be disabled; you configure min and max instance counts.

Each v2 instance provides 10 Capacity Units (CU). 1 CU = 1 Mbps throughput or 2500 active connections or 2.22 Mbps compute.

Autoscaling decisions are based on aggregate CU utilization; scale-out occurs when utilization exceeds 80%, scale-in below 20%.

Scaling is not instantaneous — expect 1-2 minutes for new instances to become ready.

A 5-minute cooldown after each scale event prevents flapping.

Minimum instance count can be 0, but that introduces cold start delays; production recommends min 2.

Maximum instance count is 125 for v2.

Autoscaling does not consider backend pool health; backend scaling must be managed separately.

The subnet must have enough free IP addresses to accommodate max instances.

You can fix instance count by setting min and max to the same value, but autoscaling remains enabled.

Autoscaling is free; you only pay for the Capacity Units consumed.

For the exam, remember that v1 does not support autoscaling — only v2.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Application Gateway v2 Autoscaling

Supports autoscaling (min/max instances).

Uses Capacity Units (CU) for scaling decisions.

Pay per CU consumed; no need to overprovision.

No manual scaling needed; handles traffic spikes automatically.

Requires v2 SKU (Standard_v2 or WAF_v2).

Application Gateway v1 Fixed Scaling

Fixed instance count; manual scaling required.

No CU concept; you pay per instance hour.

Must provision for peak load, leading to higher cost during low traffic.

Requires manual intervention to scale up/down.

Only v1 SKU available (Standard or WAF).

Watch Out for These

Mistake

Application Gateway v2 autoscaling can be disabled.

Correct

Autoscaling is always enabled for v2 SKUs. You cannot disable it; you only control the minimum and maximum instance counts. Setting min and max to the same value effectively fixes the instance count but does not disable autoscaling.

Mistake

Autoscaling is based on backend pool health.

Correct

Autoscaling uses gateway metrics like throughput, connection count, and compute utilization (Capacity Units). Backend health affects routing (e.g., unhealthy backends are removed from rotation) but does not trigger scaling.

Mistake

Setting min instances to 0 is always fine for cost savings.

Correct

While min=0 allows scaling to zero instances when idle, it causes a cold start delay (1-2 minutes) when traffic arrives. During that time, requests fail. For production, set min to at least 2.

Mistake

Autoscaling is instantaneous.

Correct

Scaling takes 1-2 minutes to provision new instances. There is also a 5-minute cooldown after each scale event to prevent flapping.

Mistake

Each instance can handle unlimited Capacity Units.

Correct

Each v2 instance has a fixed capacity of 10 Capacity Units. If load exceeds this, the gateway scales out (if within max instances) or throttles.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How do I enable autoscaling on an existing Application Gateway v1?

You cannot enable autoscaling on a v1 gateway. You must create a new v2 gateway and migrate your configuration. Autoscaling is a feature of the v2 SKU only. The exam tests this distinction: v1 requires manual instance count, v2 has autoscaling always on.

What happens if I set min and max instances to the same value?

Setting min and max to the same value (e.g., min=5, max=5) effectively fixes the instance count at that number. Autoscaling is still enabled but will not add or remove instances because there is no range. This is useful if you want a fixed-size v2 gateway. The exam may present this as a way to 'disable' autoscaling, but technically it's still enabled.

Can Application Gateway v2 autoscale based on backend latency?

No. Autoscaling uses gateway-side metrics: throughput, active connections, and compute utilization (Capacity Units). Backend latency does not directly trigger scaling. However, if backend latency causes connections to back up, that may increase active connections and indirectly trigger scale-out.

How long does it take for a new instance to start handling traffic?

It typically takes 1-2 minutes for a new instance to be provisioned and start serving traffic. During that time, existing instances handle all traffic. If the gateway is already overloaded, performance may degrade. The exam may ask about this lag.

What is the maximum number of instances for Application Gateway v2?

The maximum is 125 instances per gateway in most Azure regions. This is a hard limit. Ensure your subnet has at least 125 free IP addresses if you plan to use the max. The exam may test this number.

How is autoscaling billed?

You are billed per Capacity Unit consumed per hour. There is no additional charge for autoscaling itself. Each instance provides 10 CU, but you only pay for the CU actually used (not the full instance capacity). However, the minimum bill is based on the minimum instance count you set. The exam may not ask billing details, but it's good to know.

Does autoscaling work with WAF_v2?

Yes, WAF_v2 supports autoscaling identically to Standard_v2. The WAF rules consume compute capacity, so the CU usage may be higher for the same traffic. The same min/max instance configuration applies.

Terms Worth Knowing

Autoscaler Azure Application Gateway WAF

Ready to put this to the test?

You've just covered Application Gateway v2 Autoscaling — now see how well it sticks with free AZ-104 practice questions. Full explanations included, no account needed.

Try AZ-104 practice questions Back to all chapters

Done with this chapter?

Load Balancer Frontend, Backend, and Health Probe Rules

Application Gateway Ingress Controller for AKS

See the full AZ-104 study guide