This chapter covers Azure App Service auto-scaling, a critical feature for dynamically adjusting compute resources based on demand. For the AZ-204 exam, auto-scaling questions appear in approximately 10-15% of Compute domain questions, focusing on configuration, rules, and best practices. You will learn the internal mechanics, default values, and common pitfalls to ensure you can design cost-effective, responsive applications. Mastery of auto-scaling is essential for the exam and for real-world Azure development.
Jump to a section
Consider a hotel that adjusts its front-desk staff based on guest demand. The hotel has a rule: if the average check-in queue exceeds 5 guests over a 10-minute period, the manager calls in additional staff from a nearby pool. Conversely, if the queue stays below 2 guests for 15 minutes, some staff are sent home. This mirrors Azure App Service auto-scaling: the 'queue' is a metric like CPU percentage or request count, the 'manager' is the autoscale engine, and the 'staff pool' is the set of instances. The hotel cannot instantly add staff; it takes 5 minutes for a new staff member to arrive (scaling out), and it avoids sending staff home too quickly to prevent rapid rehiring (cool-down period). Metrics are collected every minute, and the decision to scale is based on aggregated data over a window (e.g., 10 minutes) to avoid reacting to transient spikes. The hotel also has a maximum staff cap to control costs, just as Azure has a maximum instance count. If the hotel scales out too aggressively, it wastes money on idle staff; if it scales in too slowly, it overspends. The analogy breaks down if the hotel could instantly teleport staff, but Azure scaling takes 1-5 minutes for new instances to be ready. The key mechanistic parallel is the use of thresholds, time windows, and cool-downs to balance responsiveness and stability.
What is Auto-Scaling and Why Does It Exist?
Auto-scaling in Azure App Service automatically adjusts the number of instances running your web app, API, or mobile backend based on real-time metrics or a predefined schedule. It exists to solve the fundamental tension between cost and performance: you want enough resources to handle peak load without paying for idle capacity during low traffic. Without auto-scaling, you either over-provision (wasting money) or under-provision (causing poor performance or outages).
Azure App Service auto-scaling operates at the App Service Plan level, not the individual app level. When you scale out (increase instance count) or scale in (decrease), all apps within that plan share the new instance count. The plan's pricing tier determines the maximum number of instances allowed: Basic tier supports up to 3 instances, Standard up to 10, Premium up to 30, and PremiumV2/V3 up to 30 (can be increased via support request). Free and Shared tiers do not support auto-scaling.
How Auto-Scaling Works Internally
The Azure autoscale engine is a regional service that monitors metrics for a given resource (e.g., an App Service Plan). It evaluates scaling rules periodically—typically every 1 to 5 minutes, depending on the metric source. The engine uses a 'scale-out' and 'scale-in' rule, each with a metric trigger (e.g., CPU > 70% for 10 minutes) and an action (e.g., increase count by 1).
Metric Aggregation and Time Window: - The engine collects metric samples (e.g., CPU percentage) from Azure Monitor every minute. - It aggregates these samples over a 'time window' (e.g., last 10 minutes) using a statistic (e.g., Average, Minimum, Maximum). - If the aggregated value crosses the threshold, the rule fires. - The time window prevents 'flapping'—reacting to transient spikes. For example, if CPU spikes to 90% for one minute but averages 50% over 10 minutes, no scale-out occurs.
Scale Actions and Instance Count Boundaries: - Each rule specifies an 'instance count' change: increase by 1, decrease by 1, increase by percentage, decrease by percentage, or set to a specific count. - You define a 'minimum', 'maximum', and 'default' instance count. The autoscale engine never scales below the minimum or above the maximum. - The default instance count is used when the resource is first enabled for autoscale or after a restart.
Cool-Down Period: - After a scale action, the engine enforces a 'cool-down' period (default 5 minutes for scale-out, 10 minutes for scale-in). During this period, the same rule cannot fire again. This prevents rapid successive scaling that would be disruptive and costly. - The cool-down applies per rule. If you have multiple rules, each has its own cool-down.
Scale-In Protection: - Azure uses 'scale-in protection' to avoid removing instances that are still processing requests. When a scale-in action occurs, the engine marks instances as 'draining' and stops sending new requests to them. Once existing requests complete, the instance is removed. This ensures no requests are dropped.
Key Components, Values, Defaults, and Timers
Default cool-down for scale-out: 5 minutes
Default cool-down for scale-in: 10 minutes
Minimum time window for metric evaluation: 5 minutes
Maximum instances per tier: Basic=3, Standard=10, Premium=30 (default)
Scale-out default increment: 1 instance
Scale-in default decrement: 1 instance
Metric collection interval: 1 minute (for most Azure Monitor metrics)
Autoscale evaluation frequency: Every 1-5 minutes (not configurable)
Cooldown range: 1 minute to 1 week (configurable in ARM template)
Common Metrics Used: - CPU Percentage - Memory Percentage - HTTP Queue Length (number of requests waiting in the IIS queue) - Data In/Out (bytes) - Custom metrics via Application Insights
Configuration and Verification Commands
You can configure auto-scaling via the Azure portal, Azure CLI, PowerShell, or ARM templates. Below are key CLI commands:
Create autoscale settings (CLI):
az monitor autoscale create \
--resource-group myResourceGroup \
--resource myAppServicePlan \
--resource-type Microsoft.Web/serverfarms \
--name myAutoscaleSettings \
--min-count 1 \
--max-count 10 \
--count 1Add a scale-out rule:
az monitor autoscale rule create \
--resource-group myResourceGroup \
--autoscale-name myAutoscaleSettings \
--scale out 1 \
--condition "CPU Percentage > 70 avg 10m"Add a scale-in rule:
az monitor autoscale rule create \
--resource-group myResourceGroup \
--autoscale-name myAutoscaleSettings \
--scale in 1 \
--condition "CPU Percentage < 30 avg 10m"View autoscale history:
az monitor autoscale history \
--resource-group myResourceGroup \
--autoscale-name myAutoscaleSettingsInteraction with Related Technologies
Azure Load Balancer: App Service uses an internal load balancer to distribute requests across instances. Auto-scaling adds/removes instances from the load balancer's backend pool.
Application Insights: You can use custom metrics from Application Insights (e.g., server response time, request rate) as triggers for auto-scaling.
Azure Monitor: All autoscale metrics are stored in Azure Monitor. You can create alerts based on autoscale events.
Traffic Manager / Front Door: These global load balancers can route traffic to multiple App Service instances across regions, but auto-scaling operates per-region.
Virtual Network: If your App Service is integrated with a VNet, scaling out adds new instances that are also integrated.
Scheduled vs. Metric-Based Scaling
Scheduled (time-based) scaling: You can define profiles that set instance counts for specific times or dates. For example, scale to 10 instances from 9 AM to 5 PM on weekdays, scale to 1 instance at night. This is useful for predictable traffic patterns.
Metric-based scaling: Reacts to real-time metrics. You can combine both: use a scheduled profile to set baseline and metric rules to adjust within that baseline's min/max.
Best Practices
Always set a minimum instance count of at least 2 for production to handle failover during updates.
Use separate rules for scale-out and scale-in with different thresholds and windows to avoid thrashing.
Scale-out aggressively (e.g., CPU > 70% for 5 minutes) and scale-in conservatively (e.g., CPU < 30% for 15 minutes).
Test your scaling rules with load testing tools like Apache JMeter or Azure Load Testing.
Monitor autoscale history to identify flapping or insufficient scaling.
Limitations
Auto-scaling does not support scale-in based on memory percentage reliably because memory is not always freed immediately.
The maximum instance count is capped by your plan tier. To exceed the default max, you must open a support request.
Scaling is not instantaneous; new instances take 1-5 minutes to be ready. For very sudden spikes, consider using Azure Functions (serverless) or pre-warming instances.
Autoscale cannot scale to zero instances because the App Service Plan must always have at least one running instance (except for Free tier which is not available for autoscale).
Define Autoscale Profile
First, you create an autoscale profile that sets the context for scaling. The profile includes a name, the resource to scale (App Service Plan), and the minimum, maximum, and default instance counts. For example, you might set min=1, max=10, default=1. This profile can be scheduled to activate only during certain hours or days. The autoscale engine uses this profile to determine the boundaries within which scaling rules operate. Without a profile, the engine has no constraints.
Configure Scale-Out Rule
Add a rule that triggers when a metric exceeds a threshold. For example, 'When CPU Percentage > 70, increase instance count by 1'. The rule includes a metric source, statistic (e.g., Average), time window (e.g., 10 minutes), and threshold. The engine evaluates this rule every 1-5 minutes. If the condition is met, the engine schedules a scale-out action. The cool-down period (default 5 minutes) prevents another scale-out from the same rule immediately.
Configure Scale-In Rule
Add a rule that triggers when a metric drops below a threshold. For example, 'When CPU Percentage < 30, decrease instance count by 1'. The time window for scale-in is typically longer (e.g., 15 minutes) to avoid premature scale-in. The cool-down for scale-in is default 10 minutes. The engine uses the same evaluation cycle. If multiple scale-in rules exist, the one with the highest threshold (most conservative) wins.
Engine Evaluates Metrics
The autoscale engine periodically queries Azure Monitor for the specified metrics. It aggregates the samples over the defined time window using the chosen statistic (Average, Min, Max, etc.). If the aggregated value crosses the threshold, the rule is considered 'fired'. The engine then checks if the cool-down period for that rule has elapsed. If yes, it executes the scale action. If multiple rules fire simultaneously, the engine applies the most aggressive scale-out action but the most conservative scale-in action.
Scale Action Executes
The engine sends a request to the App Service resource provider to add or remove instances. For scale-out, new instances are provisioned and added to the load balancer backend pool. This takes 1-5 minutes. For scale-in, the engine marks instances for draining: it stops sending new requests to those instances and waits for existing requests to complete. Once all requests finish, the instances are removed. During this period, the metric may change, but the cool-down prevents immediate reversal.
Monitor and Adjust
After scaling, you should monitor the autoscale history and metrics to ensure the rules are effective. Use Azure Monitor alerts to notify you of autoscale events. If you observe flapping (rapid scale-out/in cycles), adjust thresholds or time windows. You can also use predictive scaling (preview) which uses machine learning to forecast load and scale proactively. Regular review ensures cost optimization and performance.
Enterprise Scenario 1: E-Commerce Platform with Seasonal Spikes
A large e-commerce company runs its product catalog and checkout services on Azure App Service. During Black Friday, traffic can increase 10x within minutes. They use auto-scaling with a metric-based rule: CPU > 70% for 5 minutes triggers scale-out by 2 instances (to catch up faster). Scale-in is conservative: CPU < 20% for 30 minutes reduces by 1 instance. They also use a scheduled profile to pre-scale to 20 instances at 6 AM on Black Friday, reducing the cold-start delay. The maximum is set to 50 (after a support request). They monitor autoscale history and set up alerts for scale-out events. A common misconfiguration is setting scale-in threshold too close to scale-out (e.g., 60% out, 50% in), causing flapping. They avoid this by a 40% gap and longer scale-in window.
Enterprise Scenario 2: SaaS Application with Predictable Work Hours
A SaaS provider serves business users who are active 8 AM to 6 PM local time. They use scheduled scaling: a weekday profile sets min=5, max=20, with metric rules within that range. A weekend profile sets min=1, max=5. They also use custom metrics from Application Insights—specifically 'request rate per instance'—to scale out when requests exceed 1000 per minute per instance. This is more accurate than CPU for their I/O-bound app. They set scale-out cool-down to 10 minutes to avoid overreacting to short bursts. They learned the hard way that scaling in too quickly (cool-down of 5 minutes) caused instances to be removed while still processing long-running reports, leading to request failures. They now use a 15-minute cool-down for scale-in.
Scenario 3: Media Streaming Backend with Unpredictable Traffic
A media company streams live events. Traffic spikes are sudden and short-lived. They use a combination of scheduled scaling (pre-scale before events) and metric scaling for unexpected surges. They set scale-out to be aggressive: CPU > 60% for 2 minutes (minimum window) increases by 3 instances. Scale-in is very conservative: CPU < 10% for 20 minutes reduces by 1. They also use Azure Front Door to route traffic to multiple regional App Service deployments, each with auto-scaling. They discovered that the default cool-down of 5 minutes for scale-out was too long for their flash crowds; they reduced it to 2 minutes via ARM template. They also ensure that the App Service Plan is on Premium tier to allow up to 30 instances and faster scaling.
The AZ-204 exam tests auto-scaling under objective 'Implement Azure App Service web apps' (AZ-204: Implement IaaS solutions, but auto-scaling is part of App Service plan management). Specific sub-objectives include scaling and performance optimization. Expect 1-2 questions on auto-scaling, often scenario-based.
Common Wrong Answers and Why Candidates Choose Them
'Autoscale at the app level': Many candidates think you can enable auto-scaling per web app. In reality, it's always at the App Service Plan level. All apps in the plan share the instance count.
'Scale to zero instances': Candidates assume you can save costs by scaling to zero during low traffic. However, the App Service Plan must always have at least one running instance (except Free tier, which doesn't support autoscale). The minimum count is 1.
'Instant scaling': Some believe scaling is instantaneous. New instances take 1-5 minutes to provision and warm up. The exam may ask about handling sudden spikes—answer is to use pre-warming or scheduled scaling, not rely on instant autoscale.
'Scale-in removes instances immediately': Candidates think scale-in instantly deletes instances, but Azure uses draining mode to complete existing requests first. The exam may test that no requests are dropped during scale-in.
Specific Numbers and Terms
Default cool-down: 5 min out, 10 min in (memorize these).
Maximum instances by tier: Basic=3, Standard=10, Premium=30 (default).
Minimum time window: 5 minutes.
Metric aggregation: Average, Min, Max (commonly Average).
Autoscale evaluation frequency: every 1-5 minutes (not configurable).
Edge Cases
Multiple rules: If both scale-out and scale-in conditions are met simultaneously, scale-out takes precedence (most aggressive wins for out, most conservative for in).
Scheduled profiles: If a scheduled profile ends, the engine reverts to the default profile. Ensure default profile has appropriate min/max.
Custom metrics: Must be emitted to Azure Monitor; otherwise, autoscale cannot use them.
How to Eliminate Wrong Answers
If an option mentions 'scale individual app instances', it's wrong—scale is at plan level.
If an option says 'scale to 0', it's wrong.
If an option claims 'instant scaling', it's wrong.
Look for keywords: 'cool-down', 'draining', 'min/max instances', 'scheduled profile'.
Remember that scale-in is always more conservative than scale-out.
For scenario questions, identify whether the problem is about cost optimization (use scale-in) or performance (use scale-out). If the scenario mentions unpredictable traffic, metric-based scaling is appropriate; if predictable, scheduled scaling.
Auto-scaling is configured at the App Service Plan level, not per app.
Default cool-down: 5 minutes for scale-out, 10 minutes for scale-in.
Minimum instance count is 1 (cannot scale to zero).
Maximum instances depend on tier: Basic=3, Standard=10, Premium=30 (default).
Scale-in uses draining mode to complete existing requests before removing instances.
Metric evaluation uses a time window (minimum 5 minutes) to avoid reacting to transient spikes.
Scheduled scaling is ideal for predictable traffic; metric-based for unpredictable.
When both scale-out and scale-in conditions are met, scale-out takes precedence.
New instances take 1-5 minutes to provision; not instantaneous.
Autoscale evaluation frequency is every 1-5 minutes (not configurable).
These come up on the exam all the time. Here's how to tell them apart.
Metric-Based Auto-Scaling
Reacts to real-time metrics like CPU, memory, or queue length.
Best for unpredictable traffic patterns.
Requires careful tuning of thresholds and windows to avoid flapping.
Scale-out and scale-in rules are separate with different cool-downs.
Can use custom metrics from Application Insights for app-specific signals.
Scheduled Auto-Scaling
Scales based on a predefined schedule (time of day, day of week).
Best for predictable traffic patterns (e.g., business hours).
No risk of flapping because it does not react to metrics.
Can set specific instance counts for each scheduled profile.
Often combined with metric-based scaling for baseline capacity.
Mistake
Auto-scaling can be configured per individual web app within an App Service Plan.
Correct
Auto-scaling is always configured at the App Service Plan level. All apps in the plan share the same instance count. To scale a single app independently, you must place it in its own plan.
Mistake
You can scale an App Service Plan to zero instances to save costs during idle periods.
Correct
The minimum instance count for an App Service Plan is 1 (except Free tier, which doesn't support autoscale). You cannot scale to zero; the plan must always have at least one running instance.
Mistake
When a scale-in action is triggered, instances are immediately removed.
Correct
Azure uses draining mode: it stops sending new requests to the instances to be removed and waits for existing requests to complete. This ensures no requests are dropped. Only after all requests finish are the instances removed.
Mistake
Scaling out adds new instances instantly.
Correct
Provisioning new instances takes 1-5 minutes. During this time, the new instances are not yet available to serve traffic. The exam may test that you should pre-warm instances or use scheduled scaling for predictable spikes.
Mistake
The default cool-down period is the same for scale-out and scale-in.
Correct
The default cool-down is 5 minutes for scale-out and 10 minutes for scale-in. This asymmetry prevents rapid scale-in after a scale-out, reducing flapping.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No, the minimum instance count is 1. An App Service Plan must always have at least one running instance to serve requests. The Free tier does not support auto-scaling. To save costs during low traffic, you can scale down to one instance, but not zero.
Scaling up (vertical scaling) changes the pricing tier of your App Service Plan to one with more resources (CPU, memory, etc.). Scaling out (horizontal scaling) adds more instances of the current tier. Auto-scaling only handles scaling out/in, not scaling up/down. Scaling up/down must be done manually or via automation (e.g., Azure Automation).
Typically 1-5 minutes. The time depends on the tier and whether the instance needs to be provisioned from scratch. For Premium tiers, instances may be pre-warmed to reduce latency. The exam expects you to know that scaling is not instant.
The autoscale engine applies the most aggressive scale-out action (the one that increases the instance count the most). For scale-in, it applies the most conservative action (the one that decreases the least). This prevents over-scaling.
Yes, you can use custom metrics emitted to Azure Monitor, such as from Application Insights. For example, you can scale based on request rate per instance or application-specific performance counters. The custom metric must be available in Azure Monitor and follow the same aggregation rules.
Auto-scaling applies to the entire App Service Plan, not individual slots. All slots (production, staging, etc.) within the same plan share the same instances. If you need to scale a specific slot independently, you must use a separate App Service Plan for that slot.
Default cool-down is 5 minutes for scale-out and 10 minutes for scale-in. You can change it in the ARM template or via Azure CLI when creating the rule. The minimum is 1 minute, maximum is 1 week. However, setting it too short can cause flapping.
You've just covered App Service Auto-Scaling — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.
Done with this chapter?