AI-900Chapter 95 of 100Objective 2.4

Azure ML Pipelines for Batch Inference

This chapter covers Azure ML Pipelines for batch inference, a key concept in the Machine Learning domain of the AI-900 exam (Objective 2.4). Batch inference is the process of applying a trained machine learning model to a large collection of data points at once, typically on a schedule or on demand. This topic appears in approximately 5-10% of AI-900 exam questions, often in the context of understanding when to use batch vs. real-time inference, and how pipelines automate the workflow. You will learn the architecture, components, and configuration of batch inference pipelines, including the specific Azure services involved and common exam scenarios.

25 min read
Intermediate
Updated May 31, 2026

Batch Inference as an Automated Factory Assembly Line

Imagine a factory that produces custom widgets. Each widget starts as a raw part (input data) that needs to be processed through several stations: first, a cleaning station (data preprocessing), then an assembly station (model scoring), then a quality control station (post-processing), and finally a packaging station (output). Instead of having workers manually carry each widget from station to station, the factory uses a conveyor belt (pipeline) that automatically moves widgets between stations. Each station has its own specialized machine (compute resource) and can process multiple widgets simultaneously (parallel execution). The conveyor belt is designed to handle a large batch of widgets at once—say, 10,000—rather than one at a time. The factory manager (Azure ML Pipeline) defines the order of stations and the exact specifications for each machine. When a batch of raw parts arrives, the conveyor belt starts moving, and each station automatically processes its step. If a machine breaks down (failure), the conveyor belt can reroute or retry that widget at a later time. The entire process runs without human intervention once started, and a final report (output) shows how many widgets passed quality control. This is exactly how Azure ML Pipelines for batch inference work: you define steps, assign compute, and let the pipeline process large volumes of data in a scalable, repeatable, and automated way.

How It Actually Works

What is Batch Inference and Why Use Pipelines?

Batch inference, also known as batch scoring, is the process of generating predictions for a large set of input data using a trained machine learning model. Unlike real-time inference, where a single input is scored immediately (e.g., a credit card transaction), batch inference processes many records together, often in an offline or scheduled manner. Examples include generating monthly credit risk scores for all customers, predicting equipment failure for thousands of sensors, or processing images from a daily satellite feed.

Azure Machine Learning Pipelines provide a way to automate the steps involved in batch inference. A pipeline is a workflow of distinct, reusable steps: data loading, preprocessing, model scoring, post-processing, and output writing. Each step runs on a specified compute target (e.g., Azure ML Compute Cluster, Azure Databricks, or serverless Spark) and can be parallelized. Pipelines ensure reproducibility, scalability, and orchestration, reducing manual effort and errors.

How Azure ML Pipelines for Batch Inference Work Internally

At its core, a batch inference pipeline consists of a directed acyclic graph (DAG) of steps. The pipeline is defined using the Azure ML SDK (Python) or CLI, or through the Azure ML Studio designer. The key components are:

Pipeline: The top-level object that contains all steps.

Step: A single unit of work, such as a Python script, a data transfer, or a model scoring step.

Compute Target: The infrastructure where each step runs (e.g., AmlCompute, ComputeInstance, Databricks).

Dataset: The input data, typically stored in Azure Blob Storage or Azure Data Lake, registered as a FileDataset or TabularDataset.

Output: The scored results, often written back to Blob Storage or a database.

Pipeline Parameter: Allows you to pass variables (e.g., input path, model version) to customize runs without changing code.

The flow is: 1. Trigger: The pipeline is triggered manually, on a schedule (e.g., daily at 2 AM), or by an event (e.g., new data arrival). 2. Data Loading Step: Reads input data from the specified dataset. For large datasets, this step can use parallel processing with ParallelRunStep or Parquet files. 3. Preprocessing Step: Cleans, transforms, or engineers features. This step runs the same preprocessing code used during training. 4. Scoring Step: Loads the trained model (registered in Azure ML Model Registry) and applies it to each batch of data. This is often done in parallel using ParallelRunStep, which splits the data into mini-batches and processes them on multiple nodes. 5. Post-processing Step: Aggregates results, applies thresholds, or formats output. 6. Output Step: Writes the scored results to a destination, such as a CSV file in Blob Storage or an Azure SQL Database.

Key Components, Values, Defaults, and Timers

- ParallelRunStep: Azure ML's built-in step for parallel batch inference. It accepts parameters like parallel_run_config, which defines the number of nodes (node_count), processes per node (process_count_per_node), and mini-batch size (mini_batch_size). Default mini_batch_size is 10 files. The error_threshold parameter controls how many failures are tolerated before the step fails (default 0). The output_action can be append_row (default) or summary_only. - Pipeline Defaults: You can set default compute, datastore, and output for all steps. For example:

pipeline = Pipeline(workspace=ws, steps=[step1, step2], default_compute='cpu-cluster')

- Scheduling: Use ScheduleRecurrence to define frequency: frequency (Day, Week, Month) and interval (integer). For example, daily at 3:30 AM:

schedule = Schedule.create(workspace=ws, name='daily_batch', pipeline_id=pipeline.id,
                             recurrence=ScheduleRecurrence(frequency='Day', interval=1, start_time='2025-01-01T03:30:00'))

Retry Policy: Steps can have a retry setting with retry_count (default 3) and retry_backoff (exponential backoff, default 10 seconds).

Monitoring: Use Azure Monitor and Application Insights to track pipeline runs. The PipelineRun object provides status, duration, and logs.

Configuration and Verification Commands

To create a batch inference pipeline using the SDK:

from azureml.core import Workspace, Dataset, Datastore
from azureml.pipeline.steps import PythonScriptStep, ParallelRunStep, ParallelRunConfig
from azureml.pipeline.core import Pipeline

ws = Workspace.from_config()

dataset = Dataset.get_by_name(ws, name='batch_input_data')
model = Model(ws, name='my_model')

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script='score.py',
    mini_batch_size='100',
    error_threshold=5,
    output_action='append_row',
    environment=batch_env,
    compute_target='cpu-cluster',
    node_count=4,
    process_count_per_node=2
)

parallel_step = ParallelRunStep(
    name='batch-score',
    parallel_run_config=parallel_run_config,
    inputs=[dataset.as_named_input('input')],
    output=OutputFileDatasetConfig(name='scored_output'),
    arguments=['--model_name', model.name],
    allow_reuse=True
)

pipeline = Pipeline(workspace=ws, steps=[parallel_step])
pipeline.validate()
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion()

To verify pipeline runs and outputs:

In Azure ML Studio, navigate to Pipelines > Pipeline runs.

Use CLI: az ml pipeline run list --workspace-name <ws> --resource-group <rg>

Check the output datastore for scored files.

How Batch Inference Pipelines Interact with Related Technologies

Azure Data Factory (ADF): Can trigger an Azure ML Pipeline as an activity. ADF handles data movement and orchestration across services.

Azure Synapse Analytics: Used for large-scale data preprocessing before batch scoring, or for storing results.

Azure Blob Storage: Primary storage for input datasets and output results. Data is read in parallel using Azure ML's datastore connections.

Azure Event Grid: Can trigger pipeline runs when new data arrives in Blob Storage (event-driven batch inference).

Azure Container Registry: Stores custom Docker images for steps that require specific dependencies.

Azure Key Vault: Securely stores secrets (e.g., database credentials) used by pipeline steps.

Best Practices and Common Pitfalls

Data Skew: Ensure input data is partitioned evenly to avoid straggler nodes. Use mini_batch_size to control granularity.

Model Loading: For parallel steps, load the model once per node (in init() method of the entry script) to avoid repeated loading.

Error Handling: Set error_threshold appropriately. Too high may mask real issues; too low may fail on transient errors.

Reproducibility: Pin versions of dependencies in the environment definition. Use allow_reuse=True to cache step outputs.

Cost Management: Use low-priority VMs for batch jobs to reduce cost, but be aware they can be preempted.

Walk-Through

1

Define the Pipeline Structure

First, you define the overall pipeline as a directed acyclic graph (DAG) of steps. Using the Azure ML SDK, you create a `Pipeline` object and specify the steps in order. Each step can have dependencies on previous steps. For example, the scoring step depends on the preprocessing step. The pipeline is defined in a Python script or YAML file. You also set default compute and datastore at the pipeline level, which applies to all steps unless overridden. This step ensures the workflow is logically correct before any execution.

2

Prepare the Input Dataset

Register the input data as a dataset in Azure ML. This dataset can be a `TabularDataset` for structured data (CSV, Parquet) or a `FileDataset` for unstructured data (images, text). The dataset is stored in Blob Storage or Data Lake and is versioned. You can specify a path or a SQL query. For large datasets, consider partitioning the data into folders or files to enable parallel processing. The dataset is passed as an input to the pipeline step, which reads it during execution.

3

Configure Parallel Scoring Step

This is the core step. You create a `ParallelRunConfig` that defines the compute target, environment, entry script, and parallelization parameters. The entry script (e.g., `score.py`) must have an `init()` method to load the model and a `run()` method to process a mini-batch of data. The `mini_batch_size` parameter controls how many rows or files are sent to each `run()` call. The `node_count` and `process_count_per_node` determine the degree of parallelism. The step outputs results to a temporary location that can be consumed by downstream steps.

4

Define Post-Processing and Output

After scoring, you may need to aggregate results, apply business rules, or format the output. This is done in a subsequent `PythonScriptStep` that reads the scored output from the previous step. The final step writes the results to a persistent storage, such as Blob Storage or Azure SQL Database. You use `OutputFileDatasetConfig` or `PipelineData` to pass data between steps. The output is registered as a dataset or stored in the default datastore for later use.

5

Publish and Schedule the Pipeline

Once the pipeline is defined and validated, you publish it to Azure ML. Publishing creates a REST endpoint that can be triggered programmatically or via the UI. You can then create a schedule using `Schedule.create()` with a recurrence pattern (e.g., daily at midnight). Alternatively, you can trigger the pipeline via Azure Data Factory or Event Grid. After scheduling, monitor runs using the Azure ML Studio or CLI. Failed runs can be investigated using logs and metrics.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Demand Forecasting

A large retailer needs to generate demand forecasts for 50,000 products across 200 stores every night. The input data includes historical sales, promotions, weather data, and store attributes. Using Azure ML Pipelines, they create a batch inference pipeline that runs daily at 2 AM. The pipeline has three steps: (1) data ingestion from Azure SQL Database into Blob Storage, (2) feature engineering using PySpark on Azure Databricks, and (3) model scoring using a pre-trained LightGBM model on a 10-node GPU cluster via ParallelRunStep. The output is written back to Azure SQL Database for consumption by the inventory system. The pipeline processes over 10 million rows in under 30 minutes. Common issues include data skew (some products have far more history) and model versioning (the pipeline must always use the latest certified model). The team uses allow_reuse=False for the scoring step to ensure fresh predictions every run.

Enterprise Scenario 2: Healthcare Image Analysis

A healthcare provider processes thousands of medical images (X-rays, MRIs) daily for abnormality detection. The batch pipeline runs on a scheduled basis after images are uploaded to Blob Storage. The pipeline uses a ParallelRunStep with a GPU compute cluster (NC6s_v3) to run a deep learning model (ResNet-50). Each node loads the model once in init() and processes mini-batches of 16 images per call. The output is a CSV file containing image IDs, predicted probabilities, and confidence scores. The pipeline is triggered via Event Grid whenever a new batch of images arrives. A key challenge is handling large image sizes (up to 50 MB each), which requires careful tuning of mini_batch_size and process_count_per_node to avoid out-of-memory errors. The team also uses error_threshold set to 10 to allow for occasional corrupted images without failing the entire run.

What Goes Wrong When Misconfigured

Incorrect `mini_batch_size`: Setting it too large causes memory exhaustion on nodes; too small increases overhead and slows processing.

Missing dependencies: If the environment lacks required Python packages (e.g., TensorFlow, scikit-learn), the step fails with import errors. Always use a curated or custom Docker environment.

Data access issues: If the compute cluster does not have permission to read the datastore, the pipeline fails. Use managed identity or key-based authentication.

Model not found: The pipeline references a model by name and version. If the model is deleted or the version is incorrect, the scoring step fails. Use pipeline parameters to pass the model ID dynamically.

How AI-900 Actually Tests This

What AI-900 Tests on This Topic

AI-900 exam objective 2.4: 'Describe how to use Azure Machine Learning for batch inference.' The exam focuses on understanding the difference between batch and real-time inference, the purpose of pipelines, and the key components. You are NOT expected to write code or remember SDK method names. Instead, know:

When to use batch inference (large volumes, offline, scheduled) vs. real-time (low latency, single request).

That Azure ML Pipelines automate the batch inference workflow.

The typical steps: data preparation, model scoring, output.

That ParallelRunStep enables parallel processing of large datasets.

That pipelines can be scheduled or triggered by events.

Common Wrong Answers and Why Candidates Choose Them

1.

'Batch inference is the same as real-time inference.' Candidates confuse the two because both use a model to make predictions. The key difference is latency and volume. Batch inference processes many records at once, often taking minutes to hours; real-time inference returns a single prediction in milliseconds.

2.

'You must use Azure Functions for batch inference.' Azure Functions is for serverless real-time inference, not batch. The correct service is Azure ML Pipelines with ParallelRunStep.

3.

'Batch inference pipelines cannot be scheduled.' Some think pipelines are only manual. In reality, you can schedule them using Schedule.create() or trigger via Data Factory/Event Grid.

4.

'All steps in a pipeline run on the same compute.' Steps can have different compute targets (e.g., CPU for preprocessing, GPU for scoring). This is a key flexibility.

Specific Numbers and Terms

ParallelRunStep is the step type for parallel batch inference.

mini_batch_size default is 10.

error_threshold default is 0.

Pipeline schedules support frequencies: Day, Week, Month.

Batch inference is also called 'batch scoring' or 'batch prediction'.

Edge Cases and Exceptions

If the model requires GPU, but the compute target is CPU-only, the pipeline fails. Ensure compute matches model requirements.

When using ParallelRunStep, the entry script must have init() and run(mini_batch) methods. If run() returns a list of dictionaries, the output is automatically appended to a single CSV.

Pipelines can be published as REST endpoints, enabling integration with external systems.

How to Eliminate Wrong Answers

If a question mentions 'low latency' or 'single prediction,' it is real-time, not batch.

If a question mentions 'automated workflow' or 'scheduled,' think pipelines.

If a question mentions 'large dataset' or 'offline,' think batch inference.

If a question mentions 'serverless compute' for batch, be cautious—Azure ML Pipelines use dedicated compute, not serverless (though serverless Spark is an option).

Key Takeaways

Batch inference is for scoring large datasets offline, typically on a schedule or trigger.

Azure ML Pipelines automate the batch inference workflow with reusable, parallel steps.

ParallelRunStep is the key step for distributed batch scoring; it splits data into mini-batches.

Pipeline schedules support daily, weekly, or monthly recurrence; you can also trigger via Event Grid or Data Factory.

Each pipeline step can use a different compute target (CPU, GPU, Spark).

The entry script for ParallelRunStep must have init() and run(mini_batch) methods.

Mini-batch size default is 10; error_threshold default is 0 (fail on any error).

Batch inference is also called batch scoring or batch prediction.

Use pipelines when you need reproducibility, orchestration, and monitoring for batch jobs.

Common exam question: 'Which service should you use for scheduled scoring of 1 million records?' Answer: Azure ML Pipeline with ParallelRunStep.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Batch Inference (Azure ML Pipeline)

Processes large volumes of data (thousands to millions of records) at once.

Latency is minutes to hours; not suitable for real-time applications.

Uses ParallelRunStep for parallel processing on a compute cluster.

Typically scheduled or event-driven (e.g., daily at 2 AM).

Cost-effective for large-scale offline predictions; can use low-priority VMs.

Real-Time Inference (Azure ML Endpoint)

Processes one or a few records per request (low latency).

Latency is milliseconds to seconds; suitable for interactive apps.

Uses a deployed endpoint (ACI, AKS, or managed online endpoint) with auto-scaling.

Triggered by HTTP requests from applications or users.

Higher cost per inference due to always-on compute; scales to zero with serverless options.

Watch Out for These

Mistake

Batch inference and real-time inference use the same Azure service.

Correct

Batch inference uses Azure ML Pipelines (often with ParallelRunStep), while real-time inference uses Azure ML Endpoints (formerly AKS or ACI) or Azure Functions. They are different deployment patterns.

Mistake

A batch inference pipeline can only run once and cannot be reused.

Correct

Pipelines are reusable and can be published as REST endpoints. They can be triggered multiple times with different parameters (e.g., different input data paths).

Mistake

ParallelRunStep processes one row at a time.

Correct

ParallelRunStep processes mini-batches of data. Each call to the `run()` method receives a list of rows or files, as defined by `mini_batch_size`. This reduces overhead and improves throughput.

Mistake

All steps in a pipeline must run on the same compute target.

Correct

Each step can specify its own compute target. For example, a CPU cluster for preprocessing and a GPU cluster for scoring. This allows cost optimization and resource matching.

Mistake

Batch inference pipelines cannot handle unstructured data like images.

Correct

Batch inference pipelines can handle any data type. For images, you use FileDataset and process them in ParallelRunStep. The entry script loads and preprocesses images before scoring.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between batch inference and real-time inference in Azure ML?

Batch inference processes a large set of data at once, often on a schedule, and is ideal for offline scenarios like monthly risk scoring. Real-time inference returns a prediction for a single input immediately (milliseconds) and is used for interactive applications like fraud detection on a transaction. Azure ML Pipelines handle batch inference, while Azure ML Endpoints handle real-time inference. On the exam, if the question mentions 'large dataset' or 'scheduled,' choose batch; if it mentions 'low latency' or 'single request,' choose real-time.

What is ParallelRunStep in Azure ML Pipelines?

ParallelRunStep is a step type that enables parallel execution of a Python script (usually scoring) across multiple compute nodes. It splits the input data into mini-batches and distributes them across nodes. You configure the number of nodes, processes per node, and mini-batch size. The entry script must have an `init()` method for loading the model once per node and a `run()` method that processes a mini-batch. It is the primary mechanism for scalable batch inference.

Can I schedule an Azure ML Pipeline to run daily?

Yes. You can create a schedule using the `Schedule.create()` method with a `ScheduleRecurrence` object. For example, to run daily at 3 AM: `ScheduleRecurrence(frequency='Day', interval=1, start_time='2025-01-01T03:00:00')`. You can also use the Azure ML Studio UI to create schedules. Alternatively, you can trigger the pipeline from Azure Data Factory or via Event Grid for event-driven execution.

What compute options are available for batch inference pipelines?

You can use Azure ML Compute Cluster (CPU or GPU), Azure Databricks (for Spark-based processing), or serverless Spark (in preview). The compute target is specified per step. For parallel scoring, a compute cluster with multiple nodes is typical. You can also use low-priority VMs to reduce cost, but they may be preempted. The compute must have the necessary dependencies (e.g., Python packages, Docker image) defined in an environment.

How do I handle errors in a batch inference pipeline?

Azure ML Pipelines have built-in error handling. You can set `error_threshold` in `ParallelRunConfig` to specify how many mini-batch failures are tolerated before the step fails. The default is 0 (fail on first error). You can also set a `retry` policy for steps (default 3 retries with exponential backoff). For debugging, check the logs in Azure ML Studio or use Application Insights. Failed runs can be resubmitted with updated parameters.

What is the role of a dataset in a batch inference pipeline?

The dataset provides the input data for the pipeline. It is registered in Azure ML and can be a `TabularDataset` (for structured data like CSV) or a `FileDataset` (for files like images). The dataset is passed as an input to the pipeline step. Using datasets ensures versioning, reproducibility, and easy access to data stored in Azure Blob or Data Lake. You can also use `PipelineParameter` to pass the dataset path dynamically.

Can I use a custom Docker image in a batch inference pipeline?

Yes. You can define an `Environment` object that uses a custom Docker image from Azure Container Registry (ACR). This is useful when you have specific dependencies (e.g., a custom Python package or a specific CUDA version). The environment is then passed to the `ParallelRunConfig` or `PythonScriptStep`. Using a custom image ensures consistency across runs and avoids dependency conflicts.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure ML Pipelines for Batch Inference — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?