AZ-204Chapter 39 of 102Objective 1.3

App Service Auto-Scaling

This chapter covers Azure App Service auto-scaling, a critical feature for dynamically adjusting compute resources based on demand. For the AZ-204 exam, auto-scaling questions appear in approximately 10-15% of Compute domain questions, focusing on configuration, rules, and best practices. You will learn the internal mechanics, default values, and common pitfalls to ensure you can design cost-effective, responsive applications. Mastery of auto-scaling is essential for the exam and for real-world Azure development.

25 min read
Intermediate
Updated May 31, 2026

Auto-Scaling as a Hotel's Dynamic Staffing

Consider a hotel that adjusts its front-desk staff based on guest demand. The hotel has a rule: if the average check-in queue exceeds 5 guests over a 10-minute period, the manager calls in additional staff from a nearby pool. Conversely, if the queue stays below 2 guests for 15 minutes, some staff are sent home. This mirrors Azure App Service auto-scaling: the 'queue' is a metric like CPU percentage or request count, the 'manager' is the autoscale engine, and the 'staff pool' is the set of instances. The hotel cannot instantly add staff; it takes 5 minutes for a new staff member to arrive (scaling out), and it avoids sending staff home too quickly to prevent rapid rehiring (cool-down period). Metrics are collected every minute, and the decision to scale is based on aggregated data over a window (e.g., 10 minutes) to avoid reacting to transient spikes. The hotel also has a maximum staff cap to control costs, just as Azure has a maximum instance count. If the hotel scales out too aggressively, it wastes money on idle staff; if it scales in too slowly, it overspends. The analogy breaks down if the hotel could instantly teleport staff, but Azure scaling takes 1-5 minutes for new instances to be ready. The key mechanistic parallel is the use of thresholds, time windows, and cool-downs to balance responsiveness and stability.

How It Actually Works

What is Auto-Scaling and Why Does It Exist?

Auto-scaling in Azure App Service automatically adjusts the number of instances running your web app, API, or mobile backend based on real-time metrics or a predefined schedule. It exists to solve the fundamental tension between cost and performance: you want enough resources to handle peak load without paying for idle capacity during low traffic. Without auto-scaling, you either over-provision (wasting money) or under-provision (causing poor performance or outages).

Azure App Service auto-scaling operates at the App Service Plan level, not the individual app level. When you scale out (increase instance count) or scale in (decrease), all apps within that plan share the new instance count. The plan's pricing tier determines the maximum number of instances allowed: Basic tier supports up to 3 instances, Standard up to 10, Premium up to 30, and PremiumV2/V3 up to 30 (can be increased via support request). Free and Shared tiers do not support auto-scaling.

How Auto-Scaling Works Internally

The Azure autoscale engine is a regional service that monitors metrics for a given resource (e.g., an App Service Plan). It evaluates scaling rules periodically—typically every 1 to 5 minutes, depending on the metric source. The engine uses a 'scale-out' and 'scale-in' rule, each with a metric trigger (e.g., CPU > 70% for 10 minutes) and an action (e.g., increase count by 1).

Metric Aggregation and Time Window: - The engine collects metric samples (e.g., CPU percentage) from Azure Monitor every minute. - It aggregates these samples over a 'time window' (e.g., last 10 minutes) using a statistic (e.g., Average, Minimum, Maximum). - If the aggregated value crosses the threshold, the rule fires. - The time window prevents 'flapping'—reacting to transient spikes. For example, if CPU spikes to 90% for one minute but averages 50% over 10 minutes, no scale-out occurs.

Scale Actions and Instance Count Boundaries: - Each rule specifies an 'instance count' change: increase by 1, decrease by 1, increase by percentage, decrease by percentage, or set to a specific count. - You define a 'minimum', 'maximum', and 'default' instance count. The autoscale engine never scales below the minimum or above the maximum. - The default instance count is used when the resource is first enabled for autoscale or after a restart.

Cool-Down Period: - After a scale action, the engine enforces a 'cool-down' period (default 5 minutes for scale-out, 10 minutes for scale-in). During this period, the same rule cannot fire again. This prevents rapid successive scaling that would be disruptive and costly. - The cool-down applies per rule. If you have multiple rules, each has its own cool-down.

Scale-In Protection: - Azure uses 'scale-in protection' to avoid removing instances that are still processing requests. When a scale-in action occurs, the engine marks instances as 'draining' and stops sending new requests to them. Once existing requests complete, the instance is removed. This ensures no requests are dropped.

Key Components, Values, Defaults, and Timers

Default cool-down for scale-out: 5 minutes

Default cool-down for scale-in: 10 minutes

Minimum time window for metric evaluation: 5 minutes

Maximum instances per tier: Basic=3, Standard=10, Premium=30 (default)

Scale-out default increment: 1 instance

Scale-in default decrement: 1 instance

Metric collection interval: 1 minute (for most Azure Monitor metrics)

Autoscale evaluation frequency: Every 1-5 minutes (not configurable)

Cooldown range: 1 minute to 1 week (configurable in ARM template)

Common Metrics Used: - CPU Percentage - Memory Percentage - HTTP Queue Length (number of requests waiting in the IIS queue) - Data In/Out (bytes) - Custom metrics via Application Insights

Configuration and Verification Commands

You can configure auto-scaling via the Azure portal, Azure CLI, PowerShell, or ARM templates. Below are key CLI commands:

Create autoscale settings (CLI):

az monitor autoscale create \
  --resource-group myResourceGroup \
  --resource myAppServicePlan \
  --resource-type Microsoft.Web/serverfarms \
  --name myAutoscaleSettings \
  --min-count 1 \
  --max-count 10 \
  --count 1

Add a scale-out rule:

az monitor autoscale rule create \
  --resource-group myResourceGroup \
  --autoscale-name myAutoscaleSettings \
  --scale out 1 \
  --condition "CPU Percentage > 70 avg 10m"

Add a scale-in rule:

az monitor autoscale rule create \
  --resource-group myResourceGroup \
  --autoscale-name myAutoscaleSettings \
  --scale in 1 \
  --condition "CPU Percentage < 30 avg 10m"

View autoscale history:

az monitor autoscale history \
  --resource-group myResourceGroup \
  --autoscale-name myAutoscaleSettings

Interaction with Related Technologies

Azure Load Balancer: App Service uses an internal load balancer to distribute requests across instances. Auto-scaling adds/removes instances from the load balancer's backend pool.

Application Insights: You can use custom metrics from Application Insights (e.g., server response time, request rate) as triggers for auto-scaling.

Azure Monitor: All autoscale metrics are stored in Azure Monitor. You can create alerts based on autoscale events.

Traffic Manager / Front Door: These global load balancers can route traffic to multiple App Service instances across regions, but auto-scaling operates per-region.

Virtual Network: If your App Service is integrated with a VNet, scaling out adds new instances that are also integrated.

Scheduled vs. Metric-Based Scaling

Scheduled (time-based) scaling: You can define profiles that set instance counts for specific times or dates. For example, scale to 10 instances from 9 AM to 5 PM on weekdays, scale to 1 instance at night. This is useful for predictable traffic patterns.

Metric-based scaling: Reacts to real-time metrics. You can combine both: use a scheduled profile to set baseline and metric rules to adjust within that baseline's min/max.

Best Practices

Always set a minimum instance count of at least 2 for production to handle failover during updates.

Use separate rules for scale-out and scale-in with different thresholds and windows to avoid thrashing.

Scale-out aggressively (e.g., CPU > 70% for 5 minutes) and scale-in conservatively (e.g., CPU < 30% for 15 minutes).

Test your scaling rules with load testing tools like Apache JMeter or Azure Load Testing.

Monitor autoscale history to identify flapping or insufficient scaling.

Limitations

Auto-scaling does not support scale-in based on memory percentage reliably because memory is not always freed immediately.

The maximum instance count is capped by your plan tier. To exceed the default max, you must open a support request.

Scaling is not instantaneous; new instances take 1-5 minutes to be ready. For very sudden spikes, consider using Azure Functions (serverless) or pre-warming instances.

Autoscale cannot scale to zero instances because the App Service Plan must always have at least one running instance (except for Free tier which is not available for autoscale).

Walk-Through

1

Define Autoscale Profile

First, you create an autoscale profile that sets the context for scaling. The profile includes a name, the resource to scale (App Service Plan), and the minimum, maximum, and default instance counts. For example, you might set min=1, max=10, default=1. This profile can be scheduled to activate only during certain hours or days. The autoscale engine uses this profile to determine the boundaries within which scaling rules operate. Without a profile, the engine has no constraints.

2

Configure Scale-Out Rule

Add a rule that triggers when a metric exceeds a threshold. For example, 'When CPU Percentage > 70, increase instance count by 1'. The rule includes a metric source, statistic (e.g., Average), time window (e.g., 10 minutes), and threshold. The engine evaluates this rule every 1-5 minutes. If the condition is met, the engine schedules a scale-out action. The cool-down period (default 5 minutes) prevents another scale-out from the same rule immediately.

3

Configure Scale-In Rule

Add a rule that triggers when a metric drops below a threshold. For example, 'When CPU Percentage < 30, decrease instance count by 1'. The time window for scale-in is typically longer (e.g., 15 minutes) to avoid premature scale-in. The cool-down for scale-in is default 10 minutes. The engine uses the same evaluation cycle. If multiple scale-in rules exist, the one with the highest threshold (most conservative) wins.

4

Engine Evaluates Metrics

The autoscale engine periodically queries Azure Monitor for the specified metrics. It aggregates the samples over the defined time window using the chosen statistic (Average, Min, Max, etc.). If the aggregated value crosses the threshold, the rule is considered 'fired'. The engine then checks if the cool-down period for that rule has elapsed. If yes, it executes the scale action. If multiple rules fire simultaneously, the engine applies the most aggressive scale-out action but the most conservative scale-in action.

5

Scale Action Executes

The engine sends a request to the App Service resource provider to add or remove instances. For scale-out, new instances are provisioned and added to the load balancer backend pool. This takes 1-5 minutes. For scale-in, the engine marks instances for draining: it stops sending new requests to those instances and waits for existing requests to complete. Once all requests finish, the instances are removed. During this period, the metric may change, but the cool-down prevents immediate reversal.

6

Monitor and Adjust

After scaling, you should monitor the autoscale history and metrics to ensure the rules are effective. Use Azure Monitor alerts to notify you of autoscale events. If you observe flapping (rapid scale-out/in cycles), adjust thresholds or time windows. You can also use predictive scaling (preview) which uses machine learning to forecast load and scale proactively. Regular review ensures cost optimization and performance.

What This Looks Like on the Job

Enterprise Scenario 1: E-Commerce Platform with Seasonal Spikes

A large e-commerce company runs its product catalog and checkout services on Azure App Service. During Black Friday, traffic can increase 10x within minutes. They use auto-scaling with a metric-based rule: CPU > 70% for 5 minutes triggers scale-out by 2 instances (to catch up faster). Scale-in is conservative: CPU < 20% for 30 minutes reduces by 1 instance. They also use a scheduled profile to pre-scale to 20 instances at 6 AM on Black Friday, reducing the cold-start delay. The maximum is set to 50 (after a support request). They monitor autoscale history and set up alerts for scale-out events. A common misconfiguration is setting scale-in threshold too close to scale-out (e.g., 60% out, 50% in), causing flapping. They avoid this by a 40% gap and longer scale-in window.

Enterprise Scenario 2: SaaS Application with Predictable Work Hours

A SaaS provider serves business users who are active 8 AM to 6 PM local time. They use scheduled scaling: a weekday profile sets min=5, max=20, with metric rules within that range. A weekend profile sets min=1, max=5. They also use custom metrics from Application Insights—specifically 'request rate per instance'—to scale out when requests exceed 1000 per minute per instance. This is more accurate than CPU for their I/O-bound app. They set scale-out cool-down to 10 minutes to avoid overreacting to short bursts. They learned the hard way that scaling in too quickly (cool-down of 5 minutes) caused instances to be removed while still processing long-running reports, leading to request failures. They now use a 15-minute cool-down for scale-in.

Scenario 3: Media Streaming Backend with Unpredictable Traffic

A media company streams live events. Traffic spikes are sudden and short-lived. They use a combination of scheduled scaling (pre-scale before events) and metric scaling for unexpected surges. They set scale-out to be aggressive: CPU > 60% for 2 minutes (minimum window) increases by 3 instances. Scale-in is very conservative: CPU < 10% for 20 minutes reduces by 1. They also use Azure Front Door to route traffic to multiple regional App Service deployments, each with auto-scaling. They discovered that the default cool-down of 5 minutes for scale-out was too long for their flash crowds; they reduced it to 2 minutes via ARM template. They also ensure that the App Service Plan is on Premium tier to allow up to 30 instances and faster scaling.

How AZ-204 Actually Tests This

The AZ-204 exam tests auto-scaling under objective 'Implement Azure App Service web apps' (AZ-204: Implement IaaS solutions, but auto-scaling is part of App Service plan management). Specific sub-objectives include scaling and performance optimization. Expect 1-2 questions on auto-scaling, often scenario-based.

Common Wrong Answers and Why Candidates Choose Them

1.

'Autoscale at the app level': Many candidates think you can enable auto-scaling per web app. In reality, it's always at the App Service Plan level. All apps in the plan share the instance count.

2.

'Scale to zero instances': Candidates assume you can save costs by scaling to zero during low traffic. However, the App Service Plan must always have at least one running instance (except Free tier, which doesn't support autoscale). The minimum count is 1.

3.

'Instant scaling': Some believe scaling is instantaneous. New instances take 1-5 minutes to provision and warm up. The exam may ask about handling sudden spikes—answer is to use pre-warming or scheduled scaling, not rely on instant autoscale.

4.

'Scale-in removes instances immediately': Candidates think scale-in instantly deletes instances, but Azure uses draining mode to complete existing requests first. The exam may test that no requests are dropped during scale-in.

Specific Numbers and Terms

Default cool-down: 5 min out, 10 min in (memorize these).

Maximum instances by tier: Basic=3, Standard=10, Premium=30 (default).

Minimum time window: 5 minutes.

Metric aggregation: Average, Min, Max (commonly Average).

Autoscale evaluation frequency: every 1-5 minutes (not configurable).

Edge Cases

Multiple rules: If both scale-out and scale-in conditions are met simultaneously, scale-out takes precedence (most aggressive wins for out, most conservative for in).

Scheduled profiles: If a scheduled profile ends, the engine reverts to the default profile. Ensure default profile has appropriate min/max.

Custom metrics: Must be emitted to Azure Monitor; otherwise, autoscale cannot use them.

How to Eliminate Wrong Answers

If an option mentions 'scale individual app instances', it's wrong—scale is at plan level.

If an option says 'scale to 0', it's wrong.

If an option claims 'instant scaling', it's wrong.

Look for keywords: 'cool-down', 'draining', 'min/max instances', 'scheduled profile'.

Remember that scale-in is always more conservative than scale-out.

For scenario questions, identify whether the problem is about cost optimization (use scale-in) or performance (use scale-out). If the scenario mentions unpredictable traffic, metric-based scaling is appropriate; if predictable, scheduled scaling.

Key Takeaways

Auto-scaling is configured at the App Service Plan level, not per app.

Default cool-down: 5 minutes for scale-out, 10 minutes for scale-in.

Minimum instance count is 1 (cannot scale to zero).

Maximum instances depend on tier: Basic=3, Standard=10, Premium=30 (default).

Scale-in uses draining mode to complete existing requests before removing instances.

Metric evaluation uses a time window (minimum 5 minutes) to avoid reacting to transient spikes.

Scheduled scaling is ideal for predictable traffic; metric-based for unpredictable.

When both scale-out and scale-in conditions are met, scale-out takes precedence.

New instances take 1-5 minutes to provision; not instantaneous.

Autoscale evaluation frequency is every 1-5 minutes (not configurable).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Metric-Based Auto-Scaling

Reacts to real-time metrics like CPU, memory, or queue length.

Best for unpredictable traffic patterns.

Requires careful tuning of thresholds and windows to avoid flapping.

Scale-out and scale-in rules are separate with different cool-downs.

Can use custom metrics from Application Insights for app-specific signals.

Scheduled Auto-Scaling

Scales based on a predefined schedule (time of day, day of week).

Best for predictable traffic patterns (e.g., business hours).

No risk of flapping because it does not react to metrics.

Can set specific instance counts for each scheduled profile.

Often combined with metric-based scaling for baseline capacity.

Watch Out for These

Mistake

Auto-scaling can be configured per individual web app within an App Service Plan.

Correct

Auto-scaling is always configured at the App Service Plan level. All apps in the plan share the same instance count. To scale a single app independently, you must place it in its own plan.

Mistake

You can scale an App Service Plan to zero instances to save costs during idle periods.

Correct

The minimum instance count for an App Service Plan is 1 (except Free tier, which doesn't support autoscale). You cannot scale to zero; the plan must always have at least one running instance.

Mistake

When a scale-in action is triggered, instances are immediately removed.

Correct

Azure uses draining mode: it stops sending new requests to the instances to be removed and waits for existing requests to complete. This ensures no requests are dropped. Only after all requests finish are the instances removed.

Mistake

Scaling out adds new instances instantly.

Correct

Provisioning new instances takes 1-5 minutes. During this time, the new instances are not yet available to serve traffic. The exam may test that you should pre-warm instances or use scheduled scaling for predictable spikes.

Mistake

The default cool-down period is the same for scale-out and scale-in.

Correct

The default cool-down is 5 minutes for scale-out and 10 minutes for scale-in. This asymmetry prevents rapid scale-in after a scale-out, reducing flapping.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Can I auto-scale an App Service Plan to zero instances?

No, the minimum instance count is 1. An App Service Plan must always have at least one running instance to serve requests. The Free tier does not support auto-scaling. To save costs during low traffic, you can scale down to one instance, but not zero.

What is the difference between scaling up and scaling out?

Scaling up (vertical scaling) changes the pricing tier of your App Service Plan to one with more resources (CPU, memory, etc.). Scaling out (horizontal scaling) adds more instances of the current tier. Auto-scaling only handles scaling out/in, not scaling up/down. Scaling up/down must be done manually or via automation (e.g., Azure Automation).

How long does it take for a new instance to be ready after a scale-out?

Typically 1-5 minutes. The time depends on the tier and whether the instance needs to be provisioned from scratch. For Premium tiers, instances may be pre-warmed to reduce latency. The exam expects you to know that scaling is not instant.

What happens if I have multiple scale-out rules that fire at the same time?

The autoscale engine applies the most aggressive scale-out action (the one that increases the instance count the most). For scale-in, it applies the most conservative action (the one that decreases the least). This prevents over-scaling.

Can I use custom metrics for auto-scaling?

Yes, you can use custom metrics emitted to Azure Monitor, such as from Application Insights. For example, you can scale based on request rate per instance or application-specific performance counters. The custom metric must be available in Azure Monitor and follow the same aggregation rules.

Does auto-scaling work with deployment slots?

Auto-scaling applies to the entire App Service Plan, not individual slots. All slots (production, staging, etc.) within the same plan share the same instances. If you need to scale a specific slot independently, you must use a separate App Service Plan for that slot.

What is the default cool-down period and can I change it?

Default cool-down is 5 minutes for scale-out and 10 minutes for scale-in. You can change it in the ARM template or via Azure CLI when creating the rule. The minimum is 1 minute, maximum is 1 week. However, setting it too short can cause flapping.

Terms Worth Knowing

Ready to put this to the test?

You've just covered App Service Auto-Scaling — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.

Done with this chapter?