AZ-305Chapter 47 of 103Objective 4.1

Azure Service Fabric

This chapter covers Azure Service Fabric, a distributed systems platform for building and managing microservices and containers at scale. For the AZ-305 exam, Service Fabric appears in approximately 5-8% of questions, typically in the context of choosing between compute options for stateful workloads, understanding when to use Service Fabric versus Azure Kubernetes Service (AKS), and evaluating its role in lift-and-shift scenarios. Mastery of Service Fabric's programming models, cluster architecture, and upgrade mechanisms is essential for the 'Design compute solutions' objective (4.1).

25 min read

Intermediate

Updated May 31, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Hotel Kitchen Brigade for Microservices

Azure Service Fabric operates like a high-end hotel kitchen brigade. The kitchen (cluster) has multiple stations (nodes), each with specialized chefs (services). A head chef (cluster manager) assigns tasks (service instances) to stations, tracks which chef is cooking what, and can reassign tasks if a station catches fire (node failure). The brigade uses a shared recipe book (state management) that all chefs consult; if a chef is writing in the book (write operation), others must wait (replication quorum). When a new dish (service version) is introduced, the head chef orchestrates a rolling upgrade: one station at a time, chefs learn the new recipe while others keep serving old dishes, ensuring guests (users) never wait long. If a station becomes overloaded, the head chef can split its tasks across multiple stations (partitioning). The entire brigade operates under a reservation system (reliable services) where each chef logs every action (reliable collections) so that if a chef collapses, their replacement can resume exactly where they left off.

How It Actually Works

What is Azure Service Fabric and Why It Exists

Azure Service Fabric is a distributed systems platform that simplifies building, packaging, deploying, and managing scalable and reliable microservices and containers. Originally developed by Microsoft to run core Azure infrastructure (e.g., SQL Database, Azure Cosmos DB, Azure Event Hubs), it was released as a product for customers to build their own mission-critical applications. Service Fabric addresses the challenges of distributed computing: state management, failure detection, load balancing, and upgrades without downtime.

How It Works Internally

Service Fabric runs on a cluster of virtual machines (nodes). Each node runs the Service Fabric runtime, which includes the following system services: - Cluster Manager Service: Manages cluster membership, node activation/deactivation, and partition placement. - Failover Manager Service: Tracks which replicas are primary, secondary, and idle; handles failover and reconfiguration. - Naming Service: Resolves service names to endpoints. - Image Store Service: Stores application packages for deployment. - Repair Manager Service: Automates repair actions for unhealthy nodes. - Upgrade Orchestration Service: Manages rolling upgrades of the cluster itself.

When an application is deployed, Service Fabric partitions the service (if configured) and places replicas across nodes according to placement constraints and load metrics. For stateful services, write operations must be acknowledged by a quorum of replicas (majority) before being committed. The primary replica handles all reads and writes; changes are replicated to secondary replicas. If the primary fails, the Failover Manager promotes a secondary to primary.

Key Components, Values, Defaults, and Timers

Reliable Services: A programming model that allows services to use Reliable Collections (dictionaries, queues) for state. State is replicated and persisted.

Reliable Actors: A virtual actor pattern where each actor (a single-threaded object) has a unique ID and state. Actors are automatically garbage-collected after inactivity (default 60 minutes).

Guest Executables: Any executable can be hosted as a stateless service.

Containers: Docker containers can be deployed and orchestrated.

Partitioning: Three schemes: singleton (no partition), uniform Int64 (range-based), and named (string-based). Default is singleton unless specified.

Replica set size: Default is 3 replicas for stateful services.

Placement constraints: JSON-like expressions (e.g., NodeType == "FrontEnd").

Upgrade domains (UDs) and fault domains (FDs): By default, a cluster has 5 UDs and 3 FDs. UDs ensure services are upgraded one domain at a time; FDs ensure replicas are placed in separate physical failure zones.

Health check retry timeout: Default 30 seconds. After a health check fails, the system waits 30 seconds before retrying.

Plat upgrade timeout: 2 hours per domain.

Configuration and Verification Commands

Service Fabric is managed via PowerShell, Azure CLI, or REST API. Common commands:

# Connect to cluster
Connect-ServiceFabricCluster -ConnectionEndpoint 'mycluster.westus.cloudapp.azure.com:19000'

# Get cluster health
Get-ServiceFabricClusterHealth

# Get node list
Get-ServiceFabricNode

# Get service details
Get-ServiceFabricService -ApplicationName 'fabric:/MyApp'

# Start rolling upgrade
Start-ServiceFabricApplicationUpgrade -ApplicationName 'fabric:/MyApp' -ApplicationTypeVersion '2.0' -Monitored

Azure CLI:

az sf cluster list
az sf application list --resource-group myGroup --cluster-name myCluster

How It Interacts with Related Technologies

Azure Load Balancer: Distributes incoming traffic to Service Fabric nodes. Service Fabric uses the load balancer's health probes to determine node health.

Azure Key Vault: Stores certificates used for cluster security.

Azure Monitor: Collects metrics and logs from Service Fabric clusters via Diagnostics extension.

Azure Pipelines: Can deploy applications to Service Fabric using built-in tasks.

AKS vs Service Fabric: AKS is Kubernetes-managed; Service Fabric is more opinionated with a focus on stateful services. The exam often asks when to choose one over the other. Choose Service Fabric when you need Reliable Services/Actors, fine-grained placement control, or are porting an existing Service Fabric app. Choose AKS for standard Kubernetes workloads, open-source ecosystem, and portability.

Architecture Deep Dive: Replication and Consensus

Service Fabric uses a custom replication protocol inspired by Paxos but optimized for its use case. Each partition has a primary and one or more secondary replicas. Write operations: 1. Client sends write to primary. 2. Primary writes to local log and sends the operation to all secondary replicas. 3. Secondaries acknowledge receipt (write to local log). 4. When primary receives acknowledgments from a majority (quorum), it commits the operation and responds to the client. Read operations can be served by the primary (default) or by secondaries (if configured with ReadOnly or SecondaryRead consistency).

Cluster Upgrades

Service Fabric supports two types of upgrades: - Application upgrades: Update application code/config without downtime. Can be monitored (automatic rollback on failure) or unmonitored manual. - Cluster upgrades: Update the Service Fabric runtime on all nodes. Microsoft manages these for managed clusters.

During a monitored upgrade, Service Fabric: 1. Upgrades nodes in one upgrade domain at a time. 2. Waits for health check after each domain (default 30 seconds). 3. If health check fails, retries for up to 2 hours per domain, then rolls back.

Security

Node-to-node security: Uses X.509 certificates to encrypt communication between nodes.

Client-to-node security: Certificates or Azure Active Directory for authentication.

Role-based access control (RBAC): Can assign roles to users for management operations.

Monitoring and Diagnostics

Azure Service Fabric Explorer: Web tool to view cluster topology, health, and metrics.

Azure Monitor: Collects performance counters, ETW events, and crash dumps.

Health model: Each entity (node, application, service, partition, replica) reports health. Health policies define acceptable health states.

Performance Considerations

Partition count: More partitions increase parallelism but also overhead. Rule of thumb: number of partitions should be at least 10x the number of nodes for stateless services.

Replica count: 3 replicas is standard for production. 5 replicas for critical workloads.

Load metrics: Custom metrics (e.g., memory, CPU, queue length) can be defined to balance load across nodes.

Common Configuration Parameters

MinReplicaSetSize: Minimum number of replicas that must be available for write quorum. Default: 2.

TargetReplicaSetSize: Desired number of replicas. Default: 3.

PlacementConstraints: Restrict where replicas can be placed.

DefaultMoveCost: Cost of moving a replica (Low, Medium, High). Used by the load balancer.

Integration with Azure Services

Azure Storage: For storing application packages and diagnostics data.

Azure SQL Database: Can be used for external state, but Reliable Collections are preferred for low latency.

Azure Redis Cache: For caching.

Azure API Management: Can expose Service Fabric services as APIs.

Exam-Relevant Details

Service Fabric is a Platform as a Service (PaaS) offering, but you manage the cluster (if not using the managed option).

The managed cluster option reduces operational overhead by offloading cluster upgrades and patching to Azure.

Service Fabric supports Windows and Linux containers, but the runtime is more mature on Windows.

Reliable Actors are not durable by default; state is stored in memory and persisted to Reliable Collections only if configured.

Reverse proxy (port 19081) allows client-to-service communication without a separate load balancer.

Code Example: Defining a Stateful Service

class MyStatefulService : StatefulService
{
    protected override async Task RunAsync(CancellationToken cancellationToken)
    {
        var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, long>>("myDictionary");
        while (true)
        {
            cancellationToken.ThrowIfCancellationRequested();
            using (var tx = this.StateManager.CreateTransaction())
            {
                var result = await myDictionary.TryGetValueAsync(tx, "Counter");
                await myDictionary.SetAsync(tx, "Counter", result.HasValue ? result.Value + 1 : 0);
                await tx.CommitAsync();
            }
            await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
        }
    }
}

Walk-Through

Create a Service Fabric Cluster

In the Azure portal, navigate to 'Service Fabric Cluster' and click 'Add'. You must specify a cluster name (unique within Azure region), resource group, location, and node type(s). Each node type defines a VM SKU, number of nodes (default 5), and placement properties. You also configure security: select a certificate for node-to-node and client-to-node authentication. The cluster creation includes a virtual network, load balancer, and public IP. The deployment typically takes 20-30 minutes. For production, use at least 3 node types (e.g., front-end, back-end, data). The default upgrade mode is 'Automatic' for the cluster runtime.

Define Application and Service Manifests

Service Fabric applications are defined by XML manifests. The ApplicationManifest.xml describes the application version, service packages, and default parameters. The ServiceManifest.xml defines the service type, code package (executable or DLL), config package (settings), and data package (static data). You also specify endpoints (e.g., HTTP port 8080), placement constraints, and load metrics. For stateful services, you set `MinReplicaSetSize` and `TargetReplicaSetSize`. Example: `<StatefulService MinReplicaSetSize="2" TargetReplicaSetSize="3">`.

Package and Deploy the Application

Use the `Copy-ServiceFabricApplicationPackage` PowerShell cmdlet to upload the application package to the Image Store. Then use `Register-ServiceFabricApplicationType` to register the application type. Finally, use `New-ServiceFabricApplication` to create the application instance. During deployment, Service Fabric places the service replicas according to placement constraints and load balancing. You can monitor progress via Service Fabric Explorer or `Get-ServiceFabricApplicationUpgrade`. The deployment is complete when all replicas are healthy.

Perform a Rolling Application Upgrade

To upgrade an application, modify the manifests and increment the version. Use `Start-ServiceFabricApplicationUpgrade` with the `-Monitored` flag. Service Fabric will upgrade one upgrade domain at a time. After each domain, it waits for health checks (default 30 seconds). If health check fails, it retries for up to 2 hours per domain; if still failing, it rolls back the entire upgrade. You can also set `-FailureAction` to `Rollback` or `Manual`. Use `Get-ServiceFabricApplicationUpgrade` to monitor progress. The upgrade is complete when all domains are updated and health checks pass.

Monitor and Scale the Cluster

Use Service Fabric Explorer to view cluster health, node metrics, and replica placement. To scale out, add nodes to a node type via the portal or PowerShell (`Add-ServiceFabricNode`). Scale up by changing the VM SKU (requires node replacement). For autoscaling, configure Azure Autoscale rules based on CPU or memory metrics. Service Fabric's load balancer automatically redistributes replicas when nodes are added or removed. You can also define custom load metrics (e.g., request queue length) to influence placement.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Trading Platform A large bank uses Service Fabric to host a real-time trading engine. The application requires stateful services to maintain order books and transaction histories with low latency (sub-millisecond writes). They deploy a 50-node cluster across three availability zones, with each stateful service configured with 5 replicas for high availability. The cluster uses custom load metrics (e.g., number of open orders) to balance load. They use monitored rolling upgrades to deploy new trading algorithms without downtime. Misconfiguration example: setting MinReplicaSetSize too low (e.g., 1) caused data loss during a two-node failure; they now set it to 3. Performance consideration: partition count is set to 100 for the order book service to allow parallel processing.

Enterprise Scenario 2: IoT Device Management A manufacturing company uses Service Fabric to manage millions of IoT devices. They use Reliable Actors to represent each device, with actor state stored in Reliable Collections. The cluster has 20 nodes, each running both stateless web services (for API endpoints) and stateful actor services. They use the reverse proxy (port 19081) to route requests from the Azure Load Balancer to the correct actor. Common issue: actor garbage collection timeout (default 60 minutes) caused actors to be deactivated prematurely; they increased the timeout to 4 hours. Scale consideration: they partition actors by device ID range (uniform Int64 partitioning) to distribute load evenly.

Enterprise Scenario 3: E-Commerce Checkout Service An online retailer migrated their monolithic checkout application to Service Fabric as a set of microservices: cart service (stateful), payment service (stateless), and inventory service (stateful). They use placement constraints to ensure cart replicas run on nodes with SSDs. The cluster is managed (Azure Service Fabric managed cluster) to reduce patching overhead. They use Azure Monitor to track health and performance. A misconfiguration: initially they did not define load metrics, causing all replicas to land on a single node; they added CPU and memory metrics to balance load. They also set up autoscaling rules to add nodes during Black Friday traffic spikes.

How AZ-305 Actually Tests This

AZ-305 tests Service Fabric under objective '4.1 Design compute solutions'. Expect 1-2 questions directly on Service Fabric, plus potential integration questions with other compute services. The exam focuses on: - Stateful vs stateless: When to use each. Stateful services are for low-latency access to local state; stateless for read-heavy or external state. - Reliable Services vs Reliable Actors: Actors are for independent units of work with single-threaded access; services for more complex state management. - Partitioning schemes: Singleton (default), uniform Int64, named. Exam asks which scheme to use for a given scenario (e.g., user ID range). - Upgrade modes: Monitored (with health checks) vs Unmonitored manual. Monitored is preferred for production. - Placement constraints: How to restrict services to specific node types. - Cluster types: Standalone (manual) vs managed. Managed reduces operational overhead. - Reverse proxy: When to use instead of Azure Load Balancer.

Common wrong answers: 1. Choosing AKS over Service Fabric for stateful workloads with Reliable Collections. Many candidates assume AKS is always better, but Service Fabric is specifically designed for stateful services with built-in replication. 2. Assuming Reliable Actors are durable by default. They are not; state is in memory unless persisted to Reliable Collections. 3. Thinking that partition count must equal number of nodes. Actually, partitions are independent; you can have more partitions than nodes. 4. Forgetting that monitored upgrades require health policies. Without health policies, the upgrade may not roll back on failure.

Edge cases the exam loves:

What happens if a node fails during an upgrade? The upgrade pauses until the node is restored or the replica is moved.

Can you mix Linux and Windows nodes in the same cluster? No, clusters are either Windows or Linux.

How does Service Fabric handle network partitions? It uses lease mechanism (default 30 seconds) to detect failures; if a partition occurs, the side with majority replicas remains functional.

Numbers to memorize:

Default replica set size: 3

Default upgrade domain count: 5

Default fault domain count: 3

Default health check retry timeout: 30 seconds

Default actor garbage collection timeout: 60 minutes

Reverse proxy port: 19081

To eliminate wrong answers: understand the mechanism. If a question mentions 'low-latency stateful' and 'automatic replication', Service Fabric is correct. If it mentions 'Kubernetes ecosystem' or 'portability', AKS is correct. If it mentions 'virtual actor pattern', Reliable Actors is the answer.

Key Takeaways

Service Fabric is a PaaS for building and managing microservices and containers, originally developed for Azure's core services.

Stateful services use Reliable Collections for replicated, persistent state; default replica set size is 3.

Reliable Actors are virtual actors with single-threaded access; state is in memory by default, not durable.

Clusters are deployed across fault domains (default 3) and upgrade domains (default 5) to ensure availability.

Monitored rolling upgrades use health checks (30-second timeout) and can auto-rollback on failure.

Partitioning schemes: singleton (default), uniform Int64, named. Choose based on data distribution needs.

Placement constraints restrict services to specific node types (e.g., 'NodeType == "GPU"').

Reverse proxy on port 19081 allows client-to-service communication without an additional load balancer.

Managed clusters offload runtime upgrades and patching to Azure, reducing operational overhead.

Service Fabric is preferred over AKS for stateful workloads requiring low-latency local state and fine-grained placement control.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Service Fabric

Built-in stateful service support with Reliable Collections and automatic replication.

Proprietary programming models (Reliable Services, Reliable Actors).

Fine-grained placement constraints and custom load metrics.

Mature on Windows; supports Linux but less common.

Reverse proxy for internal service routing.

Azure Kubernetes Service (AKS)

Stateful services require external storage or custom operators.

Open-source ecosystem with standard Kubernetes APIs.

Placement via node selectors and taints/tolerations.

Primarily Linux; Windows containers supported but limited.

Ingress controllers (e.g., NGINX) for routing.

Watch Out for These

Mistake

Service Fabric is only for Windows containers.

Correct

Service Fabric supports both Windows and Linux containers, though the runtime is more mature on Windows. The same programming models (Reliable Services, Reliable Actors) work on both.

Mistake

Reliable Actors state is automatically durable.

Correct

By default, Reliable Actors store state in memory and only persist to Reliable Collections if explicitly configured. Without persistence, state is lost on actor deactivation or node failure.

Mistake

Service Fabric requires a separate load balancer for every service.

Correct

Service Fabric includes a built-in reverse proxy (port 19081) that routes client requests to the appropriate service endpoint. An external load balancer is only needed for ingress traffic from the internet.

Mistake

You must manually upgrade the cluster runtime.

Correct

With Azure Service Fabric managed clusters, Microsoft automatically upgrades the runtime. For standalone clusters, you can configure automatic upgrades via the 'autoUpgrade' setting.

Mistake

Partition count must be a power of two.

Correct

No, partition count can be any positive integer. However, uniform Int64 partitioning often uses power-of-two ranges for even distribution.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Reliable Services and Reliable Actors in Service Fabric?

Reliable Services are the core programming model where you define a service with multiple replicas and manage state via Reliable Collections (dictionaries, queues). They are suitable for complex state management with low latency. Reliable Actors are a higher-level abstraction built on Reliable Services, where each actor is a single-threaded object with a unique ID. Actors are automatically garbage-collected after 60 minutes of inactivity. Use actors when you have many independent, fine-grained state entities (e.g., IoT devices). For the exam, remember that actors are not durable by default unless you persist their state.

How does Service Fabric handle node failures?

Can I run Service Fabric on-premises?

Yes, Service Fabric can run on-premises using the standalone installer for Windows Server. You can create a cluster on your own VMs, but you must manage the cluster lifecycle yourself. Azure also offers Service Fabric managed clusters that run in Azure, reducing operational burden. For the exam, know that standalone clusters are available but not the focus; managed clusters are the recommended option for new projects.

What is the reverse proxy in Service Fabric and when should I use it?

The reverse proxy is a built-in component that runs on every node (port 19081) and forwards incoming HTTP requests to the correct service endpoint based on the URL path. It resolves service names via the Naming Service. Use it when you want to avoid configuring a separate load balancer for internal service-to-service communication. However, for external traffic, you still need an Azure Load Balancer. The exam may ask you to identify the port (19081) or scenario for using the reverse proxy.

How do I choose between Service Fabric and Azure Kubernetes Service (AKS)?

Choose Service Fabric when you need built-in stateful service support (Reliable Collections), fine-grained placement control, or are porting an existing Service Fabric application. Choose AKS when you need standard Kubernetes APIs, open-source ecosystem, portability, or are already invested in Kubernetes tooling. For the exam, if the scenario mentions 'low-latency stateful' or 'Reliable Actors', Service Fabric is likely the answer. If it mentions 'container orchestration' or 'Kubernetes', AKS is correct.

What is the default number of replicas for a stateful service?

The default `TargetReplicaSetSize` is 3, and the default `MinReplicaSetSize` is 2. This means Service Fabric will try to maintain 3 replicas, and at least 2 must be healthy to accept writes. For critical workloads, you can increase to 5 replicas. The exam may ask you to identify the default values or the minimum required for quorum.

How do I perform a rolling upgrade in Service Fabric?

Use the `Start-ServiceFabricApplicationUpgrade` cmdlet with the `-Monitored` flag. Service Fabric upgrades one upgrade domain at a time, waiting for health checks (default 30 seconds) after each domain. If health checks fail, it retries for up to 2 hours per domain, then rolls back if still failing. You can also use `-UnmonitoredManual` for manual control. The exam tests the difference between monitored and unmonitored upgrades.

Terms Worth Knowing

Azure App Service Cloud computing Microsoft Entra ID Service endpoint Service principal

Ready to put this to the test?

You've just covered Azure Service Fabric — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Try AZ-305 practice questions Back to all chapters

Done with this chapter?

Azure Red Hat OpenShift

Azure Batch for Large-Scale Compute

See the full AZ-305 study guide