AZ-104Chapter 96 of 168Objective 3.1

Azure Batch for HPC Workloads

This chapter covers Azure Batch, a managed service for running large-scale parallel and high-performance computing (HPC) workloads in the cloud. For the AZ-104 exam, this topic falls under Domain 3.1 (Compute) and typically appears in 5-10% of questions, focusing on understanding Batch concepts, pool configuration, job scheduling, and integration with other Azure services. Mastering Batch is essential for scenarios like rendering, financial risk modeling, and big data analysis where you need to run many tasks in parallel without managing the underlying infrastructure.

25 min read
Intermediate
Updated May 31, 2026

Batch Processing as a Factory Assembly Line

Think of Azure Batch as an automated factory assembly line for data processing. Instead of hiring one worker to build a product from start to finish, you break the work into small tasks (like assembling individual parts) and assign them to many temporary workers (compute nodes) who work in parallel. A factory manager (the Batch service) oversees the entire operation: it receives the job order (job), breaks it into task instructions, hires workers from a pool (pool of VMs), monitors their progress, handles any failures (e.g., a worker drops a part – the task is retried on another worker), and finally delivers the completed product. The key mechanic is that the factory manager doesn't own the workers permanently; it rents them from a labor agency (Azure compute resources) and can scale the workforce up or down based on the workload. This is fundamentally different from a dedicated team (always-on VMs) where you pay for idle time. In Azure Batch, you define the pool specifications (VM size, number of nodes, scaling formula) and the job/workflow, and the service orchestrates the parallel execution, handling scheduling, retries, and resource lifecycle automatically.

How It Actually Works

What is Azure Batch and Why Does It Exist?

Azure Batch is a platform service that schedules and orchestrates compute-intensive work across a managed pool of virtual machines. It is designed for HPC and batch computing workloads where you have a large number of similar, independent tasks that can be executed in parallel. Instead of manually provisioning VMs, installing software, and managing job queues, you define the compute environment and the work to be done, and Batch handles the rest.

Key use cases include: - Rendering – rendering frames of a movie or animation in parallel. - Financial risk modeling – running Monte Carlo simulations across thousands of scenarios. - Image processing – applying filters or transformations to a large set of images. - Gene sequencing – aligning DNA sequences in parallel. - Any embarrassingly parallel workload – tasks that don't require inter-task communication.

How Azure Batch Works Internally

The Batch architecture consists of four primary components: Batch account, pool, job, and task. The service runs on top of Azure compute resources (VMs), but the user does not directly interact with those VMs.

1.

Batch Account: A container that holds pools, jobs, and tasks. It is the top-level resource. You create a Batch account in a specific Azure region. All pools, jobs, and tasks belong to this account. The account also has an associated URL and authentication keys (or Azure AD integration).

2.

Pool: A group of compute nodes (VMs) that execute tasks. You define the pool configuration: node size (e.g., Standard_D2_v3), number of nodes, scaling policy, operating system (Windows or Linux), and application packages (software to install on each node). Nodes can be allocated from either dedicated (reserved for your pool) or low-priority (surplus capacity, cheaper but can be preempted) VMs. The pool can be scaled manually or automatically using a scaling formula based on metrics like number of active tasks.

3.

Job: A collection of tasks that share a common execution context. A job runs on a specific pool. You can define job-level constraints, such as maximum wall clock time, retry policy, and scheduling priority. Jobs can be organized in a job schedule that triggers jobs at regular intervals or based on events.

4.

Task: The smallest unit of work. Each task runs on a single node in the pool. A task can be a command line, a script, or a program. Tasks can have dependencies (task B depends on task A) and can produce output files that are uploaded to Azure Storage.

Execution Flow: 1. User creates a Batch account. 2. User creates a pool with desired VM configuration. 3. User creates a job targeting that pool. 4. User adds tasks to the job. Each task specifies a command line (e.g., myapp.exe --input file1.txt --output result1.txt). 5. Batch service schedules tasks onto available nodes. The scheduler considers task dependencies, node availability, and resource requirements (CPU, memory). 6. Each node runs the Batch agent, which pulls tasks from the service, executes them, and reports status. 7. Upon completion, task output can be retrieved from the node or uploaded to Azure Storage. 8. After all tasks complete, the job is finished. The pool can be deallocated or scaled down.

Key Components, Values, Defaults, and Timers

Node allocation timeout: Default is 15 minutes. If a node fails to start within this time, it is considered failed.

Task retention period: After a task completes, its output on the node is retained for a maximum of 7 days by default. You can configure this.

Max tasks per node: Default is 1, but you can increase this (e.g., 4 tasks per node) for better utilization.

Pool resize timeout: When scaling a pool, the resize operation has a timeout (default 15 minutes). If nodes fail to add/remove in time, the operation may fail.

Low-priority VMs: Can be preempted at any time. Batch automatically requeues tasks from preempted nodes.

Application packages: Versions of software that are automatically installed on pool nodes. You can specify version and install order.

Scaling formula: An expression (using the Batch auto-scale formula language) that adjusts the pool size based on metrics like $PendingTasks and $RunningTasks. Example: $TargetDedicatedNodes = min(10, max(1, $PendingTasks)).

Configuration and Verification Commands

Using Azure CLI:

# Create a Batch account
az batch account create --name mybatchaccount --resource-group myrg --location eastus

# Create a pool (using JSON configuration)
az batch pool create --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --json-file pool-config.json

Example pool-config.json:

{
  "id": "mypool",
  "vmSize": "Standard_D2_v3",
  "virtualMachineConfiguration": {
    "imageReference": {
      "publisher": "microsoftwindowsserver",
      "offer": "windowsserver",
      "sku": "2022-datacenter",
      "version": "latest"
    },
    "nodeAgentSkuId": "batch.node.windows amd64"
  },
  "targetDedicatedNodes": 3,
  "targetLowPriorityNodes": 0
}
# Create a job
az batch job create --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --id myjob --pool-id mypool

# Add a task
az batch task create --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --job-id myjob --task-id task1 --command-line "echo hello"

# List tasks
az batch task list --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --job-id myjob

# Delete a pool
az batch pool delete --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --pool-id mypool

To verify pool status:

az batch pool show --account-name mybatchaccount --account-endpoint https://mybatchaccount.eastus.batch.azure.com --pool-id mypool --query "allocationState"

How Azure Batch Interacts with Related Technologies

Azure Storage: Batch often uses Azure Blob or File storage for input data and output results. Task command lines can reference files using https:// URLs or mount Azure Files shares.

Azure Virtual Network: Pools can be deployed into a VNet to access on-premises resources securely. This is configured in the pool's network configuration.

Azure Active Directory: Batch supports AAD for authentication (instead of keys). You can use managed identities for Batch to access other Azure resources.

Azure Monitor: Batch emits metrics (e.g., running tasks, node count) to Azure Monitor. You can set alerts.

Azure Data Factory: ADF can trigger Batch jobs as part of a pipeline.

Azure Functions: Functions can create tasks in Batch in response to events (e.g., blob upload).

Important Exam Details

Batch is free; you only pay for the underlying compute and storage.

The maximum number of nodes per pool depends on the Batch account quota (default 20 dedicated cores, can be increased via support request).

Low-priority VMs can reduce costs by up to 80% but may be reclaimed.

Task dependencies can be specified using task IDs (must be in same job).

Batch supports both Windows and Linux nodes.

Auto-scaling formulas evaluate every 5 minutes by default.

You can use job preparation and job release tasks to run setup/cleanup on each node.

Batch supports multi-instance tasks for MPI workloads (requires inter-node communication).

Walk-Through

1

Create a Batch Account

First, you create a Batch account in your Azure subscription. This account is the container for all Batch resources. You must specify a region and optionally associate it with an Azure Storage account (for task input/output). The account has an endpoint URL (e.g., https://mybatchaccount.eastus.batch.azure.com) and authentication keys. You can also enable Azure AD authentication. The account's quota limits the number of cores you can allocate across all pools. Creation is done via Azure portal, CLI, or PowerShell.

2

Create a Pool of Compute Nodes

Define a pool that specifies the VM size, OS image, number of dedicated and low-priority nodes, and scaling policy. The pool's configuration includes the node agent SKU (e.g., batch.node.windows amd64). You can also attach a virtual network and specify application packages to install. The pool can be created with a fixed size or with an auto-scaling formula. Once created, the Batch service starts provisioning the nodes. The allocation state goes from 'resizing' to 'steady' when all nodes are ready.

3

Create a Job Targeting the Pool

A job is a logical grouping of tasks that run on a specific pool. You create a job with a unique ID and specify the pool ID. You can set job constraints like maximum wall clock time (e.g., 7 days) and retry policy. Jobs can also have a job schedule that triggers them at specified intervals. The job's state transitions from 'active' to 'disabling' to 'disabled' to 'terminating' to 'completed' as tasks are processed.

4

Add Tasks to the Job

Each task is a unit of work. You specify a command line (e.g., `python myscript.py --input input.txt`), resource files (files to download from Azure Storage), environment variables, and constraints (e.g., max task retry count, retention time). Tasks can have dependencies on other tasks in the same job. When added, tasks enter the 'active' state. The Batch scheduler assigns them to nodes based on availability and resource requirements. Tasks run to completion or failure.

5

Monitor and Retrieve Results

As tasks execute, you can monitor their status via the Azure portal, CLI, or Batch API. Completed tasks have an exit code; failed tasks can be retried. Output files can be uploaded to Azure Storage automatically using output file specifications in the task. After all tasks complete, the job state becomes 'completed'. You can then delete the job and scale down or delete the pool to stop incurring costs.

What This Looks Like on the Job

Scenario 1: Rendering Farm for Animation Studio

A small animation studio needs to render a 10-minute short film. Each frame is an independent task. They use Azure Batch with a pool of Standard_NV6 VMs (GPU-enabled) running Windows. They configure the pool with 20 dedicated nodes and auto-scaling based on pending tasks. The job contains 14,400 tasks (10 min * 24 fps * 60 sec). Each task runs Blender to render a single frame. Input files (scene files) are stored in Azure Blob Storage; output frames are uploaded back to Blob. The entire job completes in 2 hours. Without Batch, they would need to buy 20 powerful workstations. Cost: ~$5 per hour for the pool (using low-priority VMs for 80% savings). Misconfiguration: If they forget to set the output file upload, frames are lost when nodes are deallocated. They learned to always specify output files in the task resource files.

Scenario 2: Financial Risk Simulation

A bank runs Monte Carlo simulations for value-at-risk calculations. They have 10,000 scenarios, each requiring 5 minutes of CPU time on a single core. They use a Batch pool of Standard_D2_v3 VMs (2 cores each) with Linux OS. They set max tasks per node to 2 (one per core). The pool auto-scales between 10 and 100 nodes based on pending tasks. The job completes in 25 minutes. They use job preparation tasks to copy the simulation binary to each node. They also use low-priority VMs to cut costs, but they implement a retry strategy for preempted tasks. Common pitfall: They initially used a fixed pool of 100 nodes and paid for idle time when tasks were finishing. Auto-scaling solved this.

Scenario 3: Image Processing Pipeline

An e-commerce company processes millions of product images daily. When a new image is uploaded to Blob Storage, an Azure Function triggers a Batch task to resize, compress, and watermark the image. They use a pool of Standard_F2s_v2 VMs (compute optimized) with auto-scaling from 0 to 50 nodes. The scaling formula uses $PendingTasks metric. Tasks are short (2-3 seconds each). They set task retention period to 1 hour. They use Azure Files for shared output directory. Challenge: The pool was slow to scale up from 0 because node allocation takes 5-10 minutes. They solved it by keeping a minimum of 2 idle nodes always running (warm pool).

How AZ-104 Actually Tests This

What AZ-104 Tests on Azure Batch

The exam focuses on understanding the core components (Batch account, pool, job, task) and their relationships. You need to know when to use Batch vs. other compute options (VM Scale Sets, Azure Functions, Azure Container Instances). Specific objective codes: 3.1 (Configure compute resources).

Common Wrong Answers and Why Candidates Choose Them

1.

"Azure Batch is used for always-on web applications." – Wrong. Batch is for batch/parallel jobs, not continuous serving. Candidates confuse Batch with App Service or VM Scale Sets.

2.

"You must manually manage the VMs in a pool." – Wrong. Batch automates VM lifecycle. Candidates think Batch is just a scheduler on top of manually created VMs.

3.

"Batch tasks can run on any Azure resource (e.g., App Service)." – Wrong. Tasks run only on Batch pool nodes. Candidates assume tasks run on the Batch service itself.

4.

"Low-priority VMs are always available." – Wrong. They can be preempted. Candidates think they are just cheaper but with same reliability.

Specific Numbers and Terms That Appear on the Exam

Batch account: Must be created in a region.

Pool: Can have dedicated and low-priority nodes.

Job: Can have constraints like max wall clock time.

Task: Can have dependencies, resource files, environment settings.

Auto-scaling formula: Uses $TargetDedicatedNodes and $PendingTasks.

Node agent SKU: e.g., batch.node.windows amd64.

Application packages: Versioned software deployed to nodes.

Job schedule: Triggers jobs at intervals.

Edge Cases and Exceptions

If a pool is not associated with a VNet, tasks cannot access on-prem resources.

If you use low-priority nodes, you must design for preemption (tasks should be idempotent).

The maximum number of tasks per node can be set when creating the pool; default is 1.

Task output files must be explicitly uploaded; they are not automatically saved.

Batch account quotas can be increased via support request.

How to Eliminate Wrong Answers

If the scenario requires interactive or always-on workloads, eliminate Batch.

If the scenario requires inter-task communication (except MPI), eliminate Batch (use VM Scale Sets or AKS).

If the scenario requires custom software on nodes, ensure application packages are used.

If cost is a concern, look for low-priority VMs in the answer.

If the question mentions "parallel processing of many independent tasks", Batch is likely correct.

Key Takeaways

Azure Batch is a managed service for running large-scale parallel and HPC workloads.

Core components: Batch account, pool, job, task.

Pools consist of dedicated and/or low-priority VMs; low-priority can be preempted.

Tasks are independent units of work; they can have dependencies and use resource files.

Auto-scaling uses formulas with metrics like $PendingTasks.

Batch integrates with Azure Storage, VNet, AAD, and Monitor.

You pay only for the underlying compute and storage; Batch service is free.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Batch

Managed service for batch processing with automatic job scheduling.

Supports both Windows and Linux nodes.

Built-in auto-scaling based on task queue length.

Handles task lifecycle, retries, and dependencies.

Lower-level control over individual tasks and job orchestration.

Azure VM Scale Sets

Infrastructure-as-a-Service for scaling VMs behind a load balancer.

Primarily for hosting web applications or services.

Auto-scaling based on CPU/memory metrics or custom rules.

No built-in job scheduling; you must manage application logic.

More suitable for always-on workloads requiring load balancing.

Watch Out for These

Mistake

Azure Batch is a type of VM that runs batch jobs.

Correct

Azure Batch is a managed service that orchestrates jobs on a pool of VMs. It is not a VM itself. The VMs are compute nodes within a pool.

Mistake

Batch tasks run on the Batch service infrastructure, not on user VMs.

Correct

Tasks run on VMs that you provision in a pool. You pay for those VMs. The Batch service only schedules and monitors the tasks.

Mistake

You cannot use low-priority VMs with Azure Batch.

Correct

You can use low-priority VMs in a pool. They are cheaper but can be preempted. Batch automatically requeues tasks from preempted nodes.

Mistake

Batch requires a dedicated storage account for every job.

Correct

You can use one storage account for multiple jobs. The storage account is linked to the Batch account, not per job.

Mistake

Tasks in a job can run on different pools.

Correct

A job is tied to a single pool. All tasks in that job run on nodes from that pool. To use different pools, create separate jobs.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Azure Batch and Azure Functions?

Azure Batch is designed for long-running, parallel batch jobs (minutes to hours) that require many VMs. Azure Functions is for short-lived, event-driven functions (seconds) that scale on a serverless plan. Use Batch for HPC workloads like rendering or simulations; use Functions for microservices or simple data processing triggered by events.

Can I use Azure Batch with containers?

Yes, you can run tasks in containers on a pool. You need to configure the pool with a container image and a container registry. The task command line can then run a container. This is useful for reproducible environments.

How do I handle task failures in Azure Batch?

You can set a maximum retry count per task (default 0). If a task fails, Batch will retry it up to that many times. You can also configure job-level retry policies. For low-priority VM preemptions, tasks are automatically requeued.

What is the maximum number of tasks per node?

The default is 1, but you can set it to a higher value (e.g., 4) when creating the pool. This allows multiple tasks to run concurrently on a multi-core VM, improving utilization.

How do I access task output files?

You can specify output file uploads in the task configuration. Files can be uploaded to Azure Blob Storage or Azure File shares. You can also retrieve files directly from the node using the Batch API, but the node's retention period is limited (default 7 days).

Can I use Azure Batch for MPI workloads?

Yes, through multi-instance tasks. You specify the number of instances and the command to run. Batch will allocate the nodes and set up communication between them. This is suitable for tightly coupled parallel workloads.

Is there a cost for the Batch service itself?

No, there is no cost for the Azure Batch service. You only pay for the underlying compute resources (VMs, storage, networking) that your pools consume. This is a key advantage.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Batch for HPC Workloads — now see how well it sticks with free AZ-104 practice questions. Full explanations included, no account needed.

Done with this chapter?