AZ-305Chapter 42 of 103Objective 4.4

Bulkhead and Circuit Breaker Patterns

This chapter covers the Bulkhead and Circuit Breaker patterns, two critical resilience patterns for designing fault-tolerant Azure solutions. For the AZ-305 exam, these patterns appear in questions about designing for high availability and disaster recovery, typically comprising 5-10% of exam questions on objective 4.4. You will need to understand when to apply each pattern, how they differ, and how to implement them using Azure services like Azure Functions, Service Bus, and Polly library integration.

25 min read
Intermediate
Updated May 31, 2026

Ship Compartments and Electrical Breakers

Imagine a large ship divided into watertight compartments (bulkheads). If a hull breach occurs, only the flooded compartment is sealed off, preventing the entire ship from sinking. This is the bulkhead pattern: isolating failures to prevent cascading collapse. Now consider a home circuit breaker: if too many appliances draw current, the breaker trips, cutting power to protect wiring from overheating. After a brief delay, you reset it once the overload is resolved. This is the circuit breaker pattern: failing fast to protect system resources and allowing recovery. In Azure, bulkheads isolate services into separate resource groups, subscriptions, or regions, so a failure in one doesn't affect others. Circuit breakers wrap calls to external dependencies; when failures exceed a threshold (e.g., 5 consecutive timeouts), the breaker opens and subsequent calls fail immediately without waiting, then after a timeout (e.g., 30 seconds) it allows a probe to test recovery. Together, these patterns ensure resilience: bulkheads contain blast radius, circuit breakers prevent resource exhaustion and allow graceful degradation.

How It Actually Works

What Are Bulkhead and Circuit Breaker Patterns?

Resilience patterns help applications handle failures gracefully. The Bulkhead pattern isolates components into separate pools so that a failure in one does not cascade. The Circuit Breaker pattern detects failures and prevents repeated attempts that are likely to fail, allowing the system to recover.

Bulkhead Pattern

The term comes from shipbuilding: a ship's hull is divided into watertight compartments (bulkheads) so that if one compartment floods, the others keep the ship afloat. In software, the Bulkhead pattern partitions resources (thread pools, connections, memory) into isolated groups. Each group serves a specific workload, and if one group fails, the others remain unaffected.

Key Components: - Partitions: Logical or physical separation of resources. In Azure, partitions can be:

- Separate Azure App Service plans for different microservices - Separate Azure SQL databases for different tenants - Separate Service Bus namespaces for different message streams - Resource Pools: Each partition has its own thread pool, connection pool, or memory quota. - Failure Isolation: A failure in one partition does not consume resources from another.

How It Works: 1. The system identifies workloads that should be isolated (e.g., user-facing vs. batch processing). 2. Each workload is assigned its own resource pool. For example, an Azure Function App with a dedicated App Service plan for critical APIs, and a separate plan for background jobs. 3. If the batch processing pool experiences a memory leak, it crashes only that App Service plan. The critical APIs continue running on their own plan.

Configuration in Azure: - Azure App Service: Create separate App Service plans for different tiers. Each plan has its own set of VMs and resource limits. - Azure Service Bus: Use separate namespaces or topics for high-priority vs. low-priority messages. This prevents a flood of low-priority messages from starving high-priority ones. - Azure SQL Database: Use elastic pools or separate databases for different tenants (multi-tenant isolation).

Default Values and Timers: - No specific timers; it's a design pattern. However, you set resource limits per partition:

- App Service Plan: max instances, max burst instances - Service Bus: namespace throughput units, message size limits - SQL Database: DTU/vCore limits per database or elastic pool

Verification: - Monitor partition-specific metrics in Azure Monitor: CPU, memory, requests per partition. - Use Application Insights to trace failures per component.

Circuit Breaker Pattern

A circuit breaker wraps calls to an external service or resource. It monitors for failures and, when the failure rate exceeds a threshold, it 'opens' the circuit, causing subsequent calls to fail immediately without attempting the actual call. After a timeout, it allows a limited number of test calls (half-open state) to see if the service has recovered. If successful, the circuit closes again.

States: - Closed: Normal operation. Calls pass through. Failure count is tracked. - Open: Calls fail immediately. After a timeout, transitions to half-open. - Half-Open: A limited number of calls are allowed to test recovery. If successful, close; if failure, back to open.

Key Components: - Failure Threshold: Number of consecutive failures or failure rate (e.g., 5 failures within 10 seconds). - Timeout (Reset Timeout): Duration the circuit stays open before transitioning to half-open (e.g., 30 seconds). - Half-Open Max Calls: Number of test calls allowed in half-open state (e.g., 1). - Success Threshold: Number of successful test calls to close the circuit (e.g., 2 consecutive successes).

How It Works: 1. In closed state, each call increments a failure counter on timeout or exception. 2. When failure counter reaches threshold (e.g., 5), circuit opens. 3. During open state, all calls immediately throw an exception (e.g., CircuitBreakerOpenException). 4. After reset timeout (e.g., 30 seconds), circuit transitions to half-open. 5. In half-open, a limited number of calls are permitted. If they succeed, circuit closes and counters reset. If any fails, circuit reopens and timer restarts.

Implementation in Azure: - Polly Library: A .NET resilience library that provides circuit breaker policies. Example:

var circuitBreakerPolicy = Policy
    .Handle<HttpRequestException>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (ex, duration) => { /* log */ },
        onReset: () => { /* log */ },
        onHalfOpen: () => { /* log */ });

Azure Functions: Use Polly within function code to wrap calls to downstream services.

Azure API Management: Built-in circuit breaker policy (rate limiting and retry policies also available, but not full circuit breaker).

Azure Service Fabric: Has built-in circuit breaker in Reliable Services.

Default Values and Timers: - Common defaults: 5 failures before open, 30-second break, 1 call in half-open. - These are configurable; no Azure service enforces specific values.

Verification: - Log circuit state transitions using onBreak, onReset, onHalfOpen callbacks. - Monitor circuit breaker metrics: open count, half-open count, failure rate. - Use Application Insights to track exceptions and circuit state changes.

Interaction with Related Technologies

Retry Pattern: Often combined with circuit breaker. Retry transient failures (e.g., 3 retries with exponential backoff), then circuit breaker opens after repeated failures.

Bulkhead and Circuit Breaker Together: Bulkhead isolates resource pools; circuit breaker prevents repeated calls to failing services within each pool.

Azure Load Balancer: Bulkhead at infrastructure level (availability zones, regions).

Azure Service Bus: Bulkhead using separate topics; circuit breaker for downstream processing.

Example: E-commerce Application

Bulkhead: Separate App Service plans for product catalog (high traffic) and order processing (critical).

Circuit Breaker: Calls to payment gateway wrapped in circuit breaker. If gateway times out 5 times in a row, circuit opens for 30 seconds. Subsequent orders show "payment unavailable" gracefully instead of hanging.

Trap Patterns (Common Wrong Answers)

Confusing Bulkhead with Load Balancing: Bulkhead isolates; load balancing distributes. Candidates choose load balancing when isolation is needed.

Using Circuit Breaker for All Failures: Circuit breaker is for external dependencies, not for internal logic errors. Candidates apply it to database calls that have transient failures, but circuit breaker is more appropriate for services with long recovery.

Setting Timeout Too Short: A 5-second reset timeout may cause thrashing. Exam may test appropriate values (30 seconds is common).

Walk-Through

1

Identify Workloads for Isolation

Analyze the system to determine which components need isolation. For example, separate critical user-facing APIs from batch processing jobs. In Azure, this means creating separate App Service plans, separate Service Bus namespaces, or separate databases. The goal is to ensure that a failure in one workload does not consume resources from another. This step involves capacity planning: each partition should have enough resources for its peak load, but not so much that it wastes cost. Common mistake: over-isolating, leading to high management overhead and cost.

2

Configure Resource Partitions

Implement the bulkhead by allocating dedicated resource pools. For Azure App Service, create a separate App Service plan for each tier. For Azure SQL, use separate databases or elastic pools. For Azure Functions, use separate function apps with dedicated plans. Set resource limits: max instances, max burst, DTU limits. Ensure that no partition can exceed its allocated resources. For example, set autoscaling rules per plan to scale independently. Monitor each partition's metrics to verify isolation. If one partition experiences high CPU, it should not affect others.

3

Implement Circuit Breaker Logic

Wrap calls to external dependencies (e.g., payment gateway, third-party API) with a circuit breaker. Use a library like Polly in .NET, or implement custom logic. Configure failure threshold (e.g., 5 consecutive failures), reset timeout (e.g., 30 seconds), and half-open max calls (e.g., 1). In Azure Functions, add Polly NuGet package and define a circuit breaker policy. The policy is applied to HttpClient calls. When the circuit opens, the function can return a cached response or a friendly error message. Log state transitions to Application Insights for monitoring.

4

Test and Monitor Circuit Breaker

Simulate failures to verify circuit breaker behavior. For example, intentionally cause timeouts from the downstream service. Observe that after 5 failures, subsequent calls immediately fail without hitting the service. After 30 seconds, one call is allowed; if it succeeds, the circuit closes. Monitor using Application Insights: create custom metrics for circuit state. Set up alerts for circuit open events. Common issue: circuit breaker not resetting because the half-open call fails; ensure the downstream service has recovered. Also, ensure that the circuit breaker is per-instance or per-request path; multiple instances may each have their own state.

5

Combine with Retry Pattern

Retry transient failures before opening the circuit. For example, use a retry policy with exponential backoff (e.g., 3 retries with 1, 2, 4 second delays). If all retries fail, then the circuit breaker counts it as a failure. This prevents opening the circuit for a single transient blip. Configure the retry policy to handle specific exceptions (e.g., HttpRequestException) and the circuit breaker to handle those plus timeouts. Ensure that the retry count is low enough to avoid long delays. In Polly, you can wrap a retry policy inside a circuit breaker policy. This is a common exam scenario: which pattern to use first? Retry, then circuit breaker.

What This Looks Like on the Job

Enterprise Scenario 1: Multi-Tenant SaaS Application

A SaaS provider hosts a multi-tenant application on Azure. Each tenant has its own database (Bulkhead pattern). The application uses Azure App Service with separate App Service plans for each tenant tier (free, standard, premium). This ensures that a noisy neighbor tenant on the free tier cannot consume CPU or memory from premium tenants. Additionally, the application calls a third-party email service. To protect against email service outages, the team implements a circuit breaker using Polly. They configure 5 failures before opening, 30-second break, and 1 half-open call. In production, the email service experienced intermittent failures. The circuit breaker prevented the application from hanging for 30 seconds on each request; instead, users saw a 'email temporarily unavailable' message. The team monitors circuit state using Application Insights and sets an alert if the circuit remains open for more than 5 minutes, triggering a manual switch to a backup email provider.

Enterprise Scenario 2: Financial Trading Platform

A financial trading platform uses Azure Service Bus to decouple order ingestion from processing. To ensure high-priority orders (e.g., large trades) are not delayed by low-priority market data, they use separate Service Bus topics (Bulkhead). Each topic has its own throughput units. The processing services (Azure Functions) have dedicated App Service plans per priority. For the order processing function, calls to a risk analysis service are wrapped in a circuit breaker. The risk service is critical but occasionally slow. The circuit breaker is configured with 3 failures (aggressive) and a 10-second break (short) because financial transactions need quick fallback. If the circuit opens, the function uses a cached risk score. This setup ensures that a risk service outage does not block order processing entirely.

Common Misconfigurations

Bulkhead too fine-grained: Creating too many partitions increases cost and management complexity. For example, per-user App Service plans are unnecessary. Best practice: group by criticality or tenant tier.

Circuit breaker timeout too short: A 5-second break may cause constant state transitions if the service is down for minutes. This wastes resources on half-open probes. Typical break is 30 seconds to 5 minutes.

Not combining with retry: Applying circuit breaker without retry causes the circuit to open on transient failures. Always layer retry inside circuit breaker.

Ignoring half-open state: Some implementations skip half-open and go directly from open to closed after timeout, which can cause immediate failure if service is still down. Always use half-open to test recovery.

How AZ-305 Actually Tests This

AZ-305 Objective 4.4: Design for High Availability and Disaster Recovery

This objective includes designing for resilience. The exam tests your ability to choose appropriate patterns for given scenarios. Key points:

1.

Bulkhead vs. Load Balancing: The exam presents scenarios where a failure in one component should not affect others. Wrong answer: 'Use a load balancer to distribute traffic.' Load balancing does not isolate; it spreads load. The correct answer is Bulkhead pattern (separate resource pools).

2.

Circuit Breaker vs. Retry: A common question: 'An application calls an external API that occasionally returns HTTP 500 errors. What pattern should you implement?' Many candidates choose 'Retry with exponential backoff'. But if the API is down for minutes, retries will keep failing and waste resources. The correct answer is 'Circuit Breaker' to fail fast and allow recovery. The exam may ask: 'What is the primary purpose of a circuit breaker?' Answer: 'To prevent repeated calls to a failing service and allow it to recover.'

3.

Specific Values: The exam does not test exact numbers like '5 failures' but may ask about the concept of threshold and timeout. However, be familiar with typical defaults: 5 failures, 30-second break, 1 half-open call. Know that these are configurable.

4. Edge Cases: - Multiple instances: Each instance of a service may have its own circuit breaker state. The exam may ask about consistency: 'If you have 10 instances of a function app, how many circuit breakers are there?' Answer: 10, each independent. - Half-open state: The exam may ask: 'What happens when a circuit is half-open?' Answer: 'A limited number of calls are allowed to test recovery.' - Bulkhead in Azure: The exam may ask: 'How can you implement bulkhead in Azure App Service?' Answer: 'Use separate App Service plans.'

5. Eliminating Wrong Answers: - If the question mentions 'isolating resources', eliminate options that mention scaling or distribution. - If the question mentions 'preventing cascading failures', Bulkhead is likely. - If the question mentions 'failing fast' or 'protecting resources from repeated failures', Circuit Breaker is likely. - If the question mentions 'transient faults', Retry is more appropriate than Circuit Breaker.

6.

Combined Patterns: The exam may present a scenario requiring both: e.g., 'Design a resilient microservices architecture.' You should propose Bulkhead for service isolation and Circuit Breaker for inter-service calls.

7.

Azure Services: Know that Azure API Management has policies like 'rate limit' and 'retry' but not a full circuit breaker. Azure Functions can implement circuit breaker via Polly. Azure Service Bus can be used for bulkhead isolation.

Common Wrong Answers

Wrong: 'Use a load balancer to isolate failures.' (Load balancer distributes, not isolates)

Wrong: 'Use circuit breaker for database calls.' (Database calls typically have retry logic, not circuit breaker, because database failures are often transient and short)

Wrong: 'Set circuit breaker timeout to 5 seconds.' (Too short; typical is 30 seconds)

Wrong: 'Bulkhead means scaling out.' (Scaling out adds instances, not isolation)

Key Takeaways

Bulkhead pattern isolates resources into separate pools to prevent a failure in one component from cascading to others.

Circuit breaker pattern has three states: closed (normal), open (fail fast), half-open (test recovery).

Common circuit breaker defaults: 5 failures before opening, 30-second break, 1 half-open call.

Bulkhead in Azure App Service is achieved by using separate App Service plans for different workloads.

Circuit breaker is typically combined with retry pattern: retry transient faults, then open circuit on persistent failures.

Each instance of a service maintains its own circuit breaker state unless a distributed store is used.

The exam tests distinguishing between bulkhead (isolation) and load balancing (distribution).

Circuit breaker is for external dependencies; for internal database calls, prefer retry with exponential backoff.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Bulkhead Pattern

Isolates resources into separate pools to prevent cascading failures

Focuses on resource partitioning (thread pools, connections, services)

No state transitions; static partitioning

Implemented by separate Azure App Service plans, databases, or namespaces

Best for workloads with different criticality or resource profiles

Circuit Breaker Pattern

Detects failures and prevents repeated calls to a failing service

Focuses on call behavior (fail fast, allow recovery)

Has three states: closed, open, half-open

Implemented via libraries like Polly or custom code

Best for external dependencies with potential long outages

Watch Out for These

Mistake

Bulkhead and load balancing are the same thing.

Correct

Bulkhead isolates resources into separate pools so that a failure in one pool does not affect others. Load balancing distributes traffic across multiple instances to improve performance and availability, but all instances share the same resource pool. They are complementary but serve different purposes.

Mistake

Circuit breaker should be used for all types of failures.

Correct

Circuit breaker is designed for failures that are likely to persist for some time (e.g., service outage). For transient failures (e.g., network blips), retry with exponential backoff is more appropriate. Using circuit breaker for transient faults may cause unnecessary circuit openings and degrade performance.

Mistake

A circuit breaker in half-open state allows all requests through.

Correct

In half-open state, only a limited number of requests (typically 1) are allowed to test recovery. If that request succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit reopens. This prevents overwhelming the recovering service.

Mistake

Bulkhead is only about thread pools.

Correct

Bulkhead can be applied to any resource: thread pools, connection pools, memory, CPU, databases, message queues, etc. In Azure, it often translates to separate App Service plans, separate databases, or separate Service Bus namespaces.

Mistake

Circuit breaker state is shared across all instances of a service.

Correct

Typically, each instance maintains its own circuit breaker state. If you have 10 instances of a function app, each has its own failure counter and state machine. To share state, you would need a distributed circuit breaker (e.g., using Redis), which is more complex and not the default.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between bulkhead and circuit breaker patterns?

Bulkhead isolates resources into separate pools to contain failures, while circuit breaker detects failures and prevents repeated calls to a failing service. Bulkhead is about partitioning resources; circuit breaker is about call behavior. They are complementary: bulkhead ensures a failure in one component doesn't affect others, and circuit breaker ensures that calls to a failing component fail fast and allow recovery.

How do I implement bulkhead in Azure App Service?

Create separate App Service plans for different workloads. Each plan has its own set of VMs, resource limits, and scaling settings. For example, put critical APIs in a Standard plan and background jobs in a Basic plan. This ensures that a memory leak in the background jobs does not affect the critical APIs. You can also use separate function apps with dedicated plans.

When should I use circuit breaker instead of retry?

Use retry for transient failures (e.g., network timeouts, database deadlocks) that are likely to succeed after a short delay. Use circuit breaker for persistent failures (e.g., service outage, rate limiting) where repeated attempts would waste resources and delay recovery. Typically, you combine both: retry a few times, then open the circuit if failures continue.

Can I implement circuit breaker in Azure without code?

Azure API Management has policies like 'retry' and 'rate limit', but not a full circuit breaker with half-open state. For a true circuit breaker, you need to use a library like Polly in your application code. Azure Functions, App Service, and microservices can all integrate Polly. Alternatively, you can use Azure Service Fabric's built-in circuit breaker.

What are the key parameters of a circuit breaker?

Key parameters: failure threshold (number of consecutive failures or failure rate to open the circuit), reset timeout (duration before transitioning to half-open), half-open max calls (number of test calls allowed), and success threshold (number of successful test calls to close the circuit). Typical defaults: 5 failures, 30 seconds, 1 call, 2 successes.

Does bulkhead pattern affect scaling?

Yes, each partition can scale independently. For example, you can set autoscaling rules per App Service plan. This allows you to allocate more resources to critical partitions while keeping others lean. However, bulkhead also increases management complexity and cost because you need separate resource pools.

How does circuit breaker handle half-open state?

In half-open state, the circuit allows a limited number of requests (typically 1) to pass through to the downstream service. If the request succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit reopens and the reset timeout restarts. This prevents overwhelming a recovering service.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Bulkhead and Circuit Breaker Patterns — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?