This chapter covers Azure Site Recovery (ASR) monitoring and reporting, a critical component of disaster recovery readiness. For the AZ-104 exam, this topic falls under Domain 5 (Monitor and Maintain Azure Resources) and Objective 5.2 (Monitor disaster recovery). Approximately 5-10% of exam questions touch on ASR monitoring, focusing on replication health, recovery plan testing, and job status. Mastering these concepts ensures you can validate DR preparedness and troubleshoot issues before a real disaster.
Jump to a section
Azure Site Recovery (ASR) monitoring and reporting is like a building's fire safety system with sensors, alarms, and drill logs. The fire alarm panel (Azure Monitor) continuously monitors smoke detectors (replication health) and sprinkler flow (recovery progress). When a detector triggers (replication failure), the panel logs the event with timestamp and location (Azure Activity Log). The building manager runs quarterly fire drills (test failovers) and documents who evacuated, how long it took, and any blocked exits (recovery plan testing). The drill report (ASR report) shows pass/fail status, duration, and recommendations. If a real fire occurs (disaster), the panel activates sprinklers (failover) and records water flow, pressure, and zone coverage (job status). The manager reviews drill history to ensure the system works—just as an Azure admin reviews ASR health, test results, and recovery plans to ensure business continuity. Without monitoring, a faulty detector might go unnoticed until a real fire, analogous to replication lag causing data loss during failover.
What is Azure Site Recovery Monitoring and Reporting?
Azure Site Recovery (ASR) is a disaster recovery service that orchestrates replication, failover, and recovery of Azure VMs, on-premises VMs, and physical servers. Monitoring and reporting are essential to verify that replication is healthy, recovery plans are valid, and failover jobs complete successfully. Without monitoring, you might discover replication failures only during a disaster, leading to data loss or extended downtime.
Why It Exists
ASR monitoring exists to provide continuous visibility into:
Replication health (e.g., replication lag, synchronization status)
Recovery plan readiness (e.g., test failover success/failure)
Job execution (e.g., enable replication, failover, commit)
Infrastructure health (e.g., configuration server, process server, mobility service)
Azure provides several tools for this: Azure Monitor, Azure Activity Log, Azure Resource Health, ASR built-in monitoring, and ASR reports (for on-premises scenarios).
How It Works Internally
ASR monitoring relies on a combination of Azure platform services and ASR-specific agents. For Azure-to-Azure replication, the Azure Site Recovery extension (mobility service) on each VM sends heartbeat and replication status to the ASR service every 5 minutes. The service aggregates this into a replication health state (Healthy, Warning, Critical). For on-premises, the configuration server and process server report health metrics every 15 minutes.
Replication health is determined by: - Replication lag: The difference between the latest replicated data and the source. For Azure VMs, the target RPO is 15 minutes by default; if lag exceeds 30 minutes, health becomes Warning; if exceeds 60 minutes, Critical. - Synchronization status: Initial replication (seed) progress, delta sync progress, or resync required. - Agent connectivity: Mobility service must communicate with ASR service every 5 minutes. If missing for 15 minutes, health becomes Warning; if 30 minutes, Critical.
Key Components, Values, Defaults, and Timers
Replication health states:
Healthy: Replication is progressing normally.
Warning: Replication lag > 30 minutes or agent heartbeat missing for 15 minutes.
Critical: Replication lag > 60 minutes or agent heartbeat missing for 30 minutes, or initial replication failed.
Recovery point objective (RPO): Default 15 minutes for Azure-to-Azure. Can be configured to 30 seconds with premium disk and high-churn settings, but cost implications.
Test failover: Should be performed every 90 days as per Microsoft best practices. Test failover creates a VM in an isolated VNet; it does not affect production.
Job history: Retained for 30 days in Azure portal. Use Azure Monitor to retain longer.
Metrics:
Replication lag (seconds) – available via Azure Monitor for Azure VMs.
Data transfer rate (MB/s) – from process server to Azure.
RPO (seconds) – actual vs. configured.
Configuration and Verification Commands
To monitor ASR via Azure CLI:
# List replication protected items
az site-recovery protected-item list --vault-name <vault> --resource-group <rg> --fabric-name <fabric> --protection-container <container>
# Get replication health for a VM
az site-recovery replication-protected-item show --name <item> --vault-name <vault> --resource-group <rg> --fabric-name <fabric> --protection-container <container> --query "properties.replicationHealth"
# List recovery plans
az site-recovery recovery-plan list --vault-name <vault> --resource-group <rg>
# Start test failover
az site-recovery recovery-plan test-failover --name <plan> --vault-name <vault> --resource-group <rg> --failover-direction PrimaryToRecovery --network-id <subnet-id>Azure PowerShell:
Get-AzRecoveryServicesAsrReplicationProtectedItem -FriendlyName <VM>
Get-AzRecoveryServicesAsrJob -State Succeeded -StartTime (Get-Date).AddDays(-7)
Start-AzRecoveryServicesAsrTestFailoverJob -RecoveryPlan <plan>Azure Monitor metrics for ASR:
Navigate to the Recovery Services vault, select Metrics.
Add metric: "Replication Lag" (in seconds) for a specific VM or all.
Add alert rule: If replication lag > 1800 seconds (30 min) for 5 minutes, trigger email/SMS.
How It Interacts with Related Technologies
Azure Monitor: Collects metrics and logs from ASR. You can create dashboards and alerts for replication health, RPO, and job failures.
Azure Activity Log: Audits all ASR operations (enable replication, failover, commit). Retained for 90 days by default.
Azure Resource Health: Shows if the Recovery Services vault itself is healthy.
Azure Backup: Often used alongside ASR; backup provides archival, ASR provides replication. Monitoring is separate.
Azure Policy: Can enforce that all VMs have ASR enabled and test failovers are performed regularly.
Step-by-Step: Monitoring Replication Health
Identify unhealthy items: In the portal, go to Recovery Services vault > Replicated items. Filter by health state.
Check replication lag: For a VM, click on it, see "Latest recovery point" and "RPO". If RPO > 15 min, investigate.
Verify agent heartbeat: In the VM, check if Azure Site Recovery extension is responding. If not, restart the VM or reinstall extension.
Review job history: Look for failed jobs like "Enable replication" or "Update mobility service".
Test failover: Run a test failover to validate the recovery plan. Check that the test VM boots, apps start, and network connectivity works.
Common Timers and Thresholds
Agent heartbeat: 5 min interval; Warning at 15 min, Critical at 30 min.
Replication lag: RPO default 15 min; Warning at 30 min, Critical at 60 min.
Initial replication: Should complete within 24 hours for typical workloads; if not, check bandwidth.
Job timeout: Most ASR jobs (failover, enable replication) have a timeout of 60 minutes. If exceeded, job fails.
Enable Replication Monitoring
After configuring ASR for Azure VMs, the Azure Site Recovery extension automatically starts sending health data every 5 minutes. In the Recovery Services vault, under 'Replicated items', you can see each VM's replication health (Healthy, Warning, Critical). The health is computed from replication lag, sync status, and agent connectivity. Default RPO is 15 minutes; if lag exceeds 30 minutes, health becomes Warning; if >60 minutes, Critical. Also, if the agent fails to send heartbeat for 15 minutes, Warning; 30 minutes, Critical. You can also enable Azure Monitor metrics to track replication lag in seconds and create alerts.
Review Recovery Plan Status
Recovery plans group VMs and define failover order. In the vault, go to 'Recovery Plans (Site Recovery)' to see each plan's status. The plan has a 'Test failover' status (Success, Failed, Not performed). Microsoft recommends test failover every 90 days. The plan also shows the number of VMs and their replication health. If any VM in the plan has Critical health, the plan is considered at risk. You can also see the last test failover time and duration. Use this to identify plans that need validation.
Run a Test Failover
To validate a recovery plan, run a test failover. In the portal, select the recovery plan, click 'Test Failover'. Choose a recovery point (latest, latest processed, or custom). Select an isolated Azure VNet for the test VMs. The job creates VMs in the target region, runs startup scripts, and then waits for you to clean up. Monitor the job in 'Site Recovery jobs'. The test VM is created but not connected to production. After validation, click 'Cleanup test failover' to delete the test VMs. The job logs show pass/fail, duration, and any errors.
Monitor Job History
All ASR operations (enable replication, failover, commit, test failover) are recorded as jobs. In the vault, go to 'Site Recovery jobs'. Jobs have a status (Succeeded, InProgress, Failed, Cancelled). You can filter by time, type, and status. Failed jobs show error details. For example, a failed 'Enable replication' job might indicate insufficient permissions or network connectivity issues. Job history is retained for 30 days. For longer retention, export to Azure Monitor Logs or Log Analytics. Use PowerShell to retrieve job details: Get-AzRecoveryServicesAsrJob.
Set Up Alerts and Dashboards
In Azure Monitor, create metric alerts for replication lag. For example, set a condition: 'Replication Lag > 1800 seconds' for 5 minutes, then trigger an action group (email, SMS, webhook). Also, create alerts for job failures using Azure Activity Log: when 'Site Recovery job failed' occurs. Build a dashboard with charts for replication lag, data transfer rate, and RPO. Use the 'Metrics' blade in the vault to add charts. For on-premises, use the ASR report (from configuration server) to get a CSV of replication health. This report includes VM name, health, RPO, and last sync time.
Enterprise Scenario 1: Financial Services Company with Hybrid DR
A financial company runs 200 VMs on-premises (VMware) and replicates them to Azure using ASR. They perform a test failover every quarter to meet compliance. The DR team monitors replication health daily via the Azure portal and Azure Monitor dashboards. They set up alerts: if any VM's replication lag exceeds 30 minutes, an email goes to the admin. One day, a VM shows Critical health because the mobility service crashed. The admin restarts the service and resyncs. Without monitoring, this would have been missed, risking data loss. The company also uses ASR reports to audit test failover results for regulators. The reports show pass/fail, RPO achieved, and recovery time objective (RTO) during tests. They discovered that one recovery plan had a misconfigured network mapping, causing test VMs to fail. They fixed it before the next disaster.
Enterprise Scenario 2: E-commerce Platform with Azure-to-Azure DR
An e-commerce company replicates its production VMs in West US to East US. They use Azure Monitor to track replication lag for each VM. During Black Friday, a VM with high churn (database server) exceeded RPO of 15 minutes. An alert triggered, and the team increased the disk size and changed to premium SSD to reduce lag. They also schedule test failovers monthly. In one test, the SQL Server VM failed to start because the DNS configuration was missing. The admin updated the recovery plan to include a DNS script. The monitoring reports showed that the test failover took 45 minutes, exceeding their RTO of 30 minutes. They optimized the startup order and reduced it to 25 minutes. Without monitoring, they would have faced extended downtime during a real failover.
Common Misconfigurations
Ignoring Warning state: Many admins ignore Warning health, but it often precedes Critical. Should investigate immediately.
Not performing test failovers: Some skip testing, assuming replication is enough. But test failovers reveal network, DNS, and app dependencies.
Not setting alerts: Relying on manual checks leads to missed failures. Always configure alerts for replication lag and job failures.
Forgetting to clean up test failovers: Test VMs left running incur costs and may conflict with production. Always run cleanup.
What AZ-104 Tests
Objective 5.2: Monitor disaster recovery. The exam tests your ability to:
Interpret replication health status (Healthy, Warning, Critical) and know the thresholds (RPO > 30 min = Warning, >60 min = Critical).
Identify correct tools for monitoring: Azure Monitor for metrics and alerts, Activity Log for audits, ASR reports for on-premises.
Understand test failover requirements: Must be performed in isolated network; does not affect production; cleanup is required.
Know default RPO: 15 minutes for Azure-to-Azure, can be lowered to 30 seconds with premium disk.
Recognize that job history is retained for 30 days, but can be exported to Log Analytics.
Common Wrong Answers and Traps
"Replication health is determined by the last successful sync time" – Wrong. It's based on lag, sync status, and agent heartbeat. A VM with old sync but low lag may be Healthy.
"Test failover creates a VM that is connected to production network" – Wrong. It creates in an isolated VNet; you specify the test network.
"You can monitor replication lag only via the portal" – Wrong. You can use Azure Monitor metrics and CLI/PowerShell.
"ASR reports are available for Azure-to-Azure replication" – Wrong. Reports are for on-premises only; for Azure-to-Azure, use Azure Monitor.
"The default RPO is 30 seconds" – Wrong. Default is 15 minutes; 30 seconds is a configurable option with premium disks.
Specific Numbers and Terms
RPO thresholds: 15 min default, 30 min Warning, 60 min Critical.
Agent heartbeat: 5 min interval, 15 min Warning, 30 min Critical.
Job retention: 30 days.
Test failover recommendation: every 90 days.
Recovery plan: groups VMs, defines failover order, and can include scripts.
Edge Cases
If a VM is replicated but the target region is unavailable, replication may show Warning/Critical due to inability to write recovery points.
If you change the VM size or disk, replication may need to be re-enabled.
For on-premises, if the configuration server fails, all VMs show Critical.
Eliminating Wrong Answers
Understand the underlying mechanism: ASR monitors replication at the disk level via the mobility service. If a question asks about monitoring tool for on-premises, choose ASR report (CSV export) over Azure Monitor (which works for Azure VMs). If asked about RPO, remember default 15 min. If asked about test failover, remember it's non-disruptive and requires cleanup.
Replication health states: Healthy, Warning (lag >30 min or agent heartbeat missing 15 min), Critical (lag >60 min or heartbeat missing 30 min).
Default RPO for Azure-to-Azure replication is 15 minutes; can be reduced to 30 seconds with premium disks.
Test failover should be performed every 90 days; it creates VMs in an isolated network and requires cleanup.
Job history is retained for 30 days; export to Log Analytics for longer retention.
Use Azure Monitor metrics for replication lag (in seconds) and create alerts for thresholds.
ASR reports (CSV) are only for on-premises replication; for Azure-to-Azure, use Azure Monitor.
Recovery plans group VMs and define failover order; test failover validates the plan.
Agent heartbeat interval is 5 minutes; missing heartbeat for 15 minutes triggers Warning, 30 minutes Critical.
These come up on the exam all the time. Here's how to tell them apart.
Azure Monitor for ASR
Used for Azure-to-Azure replication
Provides real-time metrics like replication lag in seconds
Can create alerts and dashboards
Metrics retained for 93 days (default) or longer with Log Analytics
Accessible via portal, CLI, PowerShell, REST API
ASR Reports (On-premises)
Used for on-premises to Azure replication (VMware/Hyper-V)
Provides a CSV report of replication health, RPO, last sync
No real-time alerts; report generated on demand or scheduled
Data retained as per report schedule (e.g., 30 days)
Downloaded from the configuration server or via portal
Mistake
Replication health is based only on the last sync time.
Correct
Replication health considers replication lag, synchronization status, and agent connectivity. A VM with a recent sync but high lag can be Warning.
Mistake
Test failover creates a VM that is connected to the production network.
Correct
Test failover creates VMs in an isolated Azure VNet you specify. They do not affect production traffic.
Mistake
You can monitor replication lag only through the Azure portal.
Correct
Replication lag is available as a metric in Azure Monitor, and can be queried via CLI, PowerShell, or exported to Log Analytics.
Mistake
ASR reports are available for Azure-to-Azure replication.
Correct
ASR reports (CSV export) are only for on-premises to Azure replication. For Azure-to-Azure, use Azure Monitor and the portal.
Mistake
The default RPO for Azure Site Recovery is 30 seconds.
Correct
The default RPO is 15 minutes. You can configure a lower RPO (e.g., 30 seconds) but it requires premium managed disks and may increase costs.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
In the Azure portal, go to your Recovery Services vault, select 'Replicated items'. Each VM shows its health (Healthy, Warning, Critical). You can also use Azure Monitor metrics: add metric 'Replication Lag' for the vault. For on-premises, use the ASR report (CSV) from the configuration server.
The default RPO (recovery point objective) for Azure-to-Azure replication is 15 minutes. This means that in the event of a failover, you may lose up to 15 minutes of data. You can configure a lower RPO (e.g., 30 seconds) by using premium managed disks and enabling high-churn settings, but this incurs additional costs.
Microsoft recommends performing a test failover every 90 days at minimum. This validates that your recovery plan works, VMs boot correctly, and applications are accessible. Test failovers should be done in an isolated network to avoid affecting production.
Warning health indicates that replication is not optimal. Possible causes: replication lag exceeds 30 minutes, or the mobility service heartbeat has been missing for 15 minutes. You should investigate to prevent it from becoming Critical. Check replication lag in the VM's settings and verify agent connectivity.
Azure Monitor provides limited metrics for on-premises replication (e.g., overall vault health). For detailed per-VM replication health and RPO, you must use the ASR report. This report is generated by the configuration server and can be downloaded from the portal or directly from the configuration server.
ASR job history is retained in the Azure portal for 30 days. After that, jobs are automatically deleted. To retain job data longer, you can export activity logs to a Log Analytics workspace or storage account.
First, check the job error details in 'Site Recovery jobs'. Common issues: wrong network mapping, VM startup scripts failing, or missing dependencies. Correct the recovery plan (e.g., update network mapping, fix scripts) and run the test again. Ensure the test VNet has proper DNS and connectivity.
You've just covered Azure Site Recovery Monitoring and Reporting — now see how well it sticks with free AZ-104 practice questions. Full explanations included, no account needed.
Done with this chapter?