AI-900Chapter 49 of 100Objective 2.4

Azure ML Endpoints: Real-Time and Batch

This chapter covers Azure ML endpoints, specifically real-time and batch endpoints, which are critical for deploying machine learning models to production. The AI-900 exam tests your understanding of when to use each type, their key features, and how they differ. Approximately 10-15% of exam questions touch on model deployment and endpoints, making this a high-yield topic. By the end, you'll be able to choose the right endpoint for a given scenario and explain the underlying mechanics.

25 min read
Intermediate
Updated May 31, 2026

Azure ML Endpoints: Your Model Deployment Switchboard

Think of Azure ML endpoints as a sophisticated switchboard operator at a large corporation. The operator manages incoming calls and routes them to the right department. For real-time endpoints, the operator handles each call immediately as it arrives, connecting the caller to the appropriate specialist who can answer right away. The operator keeps a list of available specialists and their current workload, ensuring no caller waits too long. For batch endpoints, the operator collects all calls received during a set period, then routes them as a batch to a processing center that handles them together, perhaps overnight. The operator also handles scaling: when call volume spikes, the operator automatically brings in more specialists from a pool; when volume drops, some specialists are released. The operator monitors the health of each specialist, rerouting calls if one is unavailable. Azure ML endpoints work similarly: they manage the infrastructure to serve model predictions, scaling instances, routing requests, and handling failures, all while you only interact with the endpoint URL. The key difference is that the switchboard operator is replaced by Azure's managed services, which handle load balancing, auto-scaling, and fault tolerance automatically.

How It Actually Works

What Are Azure ML Endpoints?

Azure Machine Learning endpoints are a managed service that allows you to deploy trained machine learning models as web services or batch processing jobs. They abstract away the underlying infrastructure, providing a secure, scalable, and reliable way to serve predictions. The exam focuses on two primary endpoint types: real-time endpoints and batch endpoints. Understanding the difference is crucial for selecting the right deployment strategy.

Real-Time Endpoints: Low-Latency Inference

Real-time endpoints are designed for scenarios where you need immediate predictions, typically with latency in the milliseconds to seconds range. They are ideal for interactive applications like chatbots, fraud detection, or recommendation systems that must respond to user actions in real time.

How It Works Internally: - A real-time endpoint is backed by a compute target, such as Azure Kubernetes Service (AKS) or Azure Container Instances (ACI). AKS is recommended for production due to its auto-scaling and advanced networking capabilities. - When a request is sent to the endpoint URL, Azure ML routes it through a load balancer to one of the running instances. - Each instance runs a scoring script (usually a Python script) that loads the model and processes the input data. The script must be stateless; any state must be stored externally. - The endpoint can be configured with multiple instances to handle concurrent requests. Azure ML automatically handles health checks and can restart unhealthy instances.

Key Configuration Values: - Instance count: The number of replicas to run. Default is 1, but production often uses 2 or more for high availability. - Instance type: The VM SKU for AKS nodes (e.g., Standard_DS3_v2). - Auto-scaling: Can be configured based on metrics like CPU utilization or request latency. Minimum and maximum instance counts are set, e.g., min=2, max=10. - Timeout: Default request timeout is 60 seconds. Can be increased but may affect concurrent request handling. - Authentication: Supports key-based or token-based authentication. Keys are provided in the request header.

Deployment Process: 1. Register the model in the Azure ML workspace. 2. Create a scoring script (score.py) with init() and run() functions. 3. Define an environment (conda dependencies, Docker image). 4. Create an inference configuration that points to the scoring script and environment. 5. Create a deployment configuration specifying compute target and scaling settings. 6. Deploy the model to an endpoint using the Azure ML SDK or CLI.

Batch Endpoints: High-Throughput, Asynchronous Inference

Batch endpoints are designed for scenarios where you need to run inference on large datasets, often in a scheduled or event-driven manner. They are ideal for tasks like processing historical data, generating predictions for a customer base overnight, or running periodic model evaluations.

How It Works Internally: - Batch endpoints use a compute cluster (e.g., Azure Machine Learning compute cluster or AKS) to process data in parallel. - You submit a batch inference job that specifies the input data location (e.g., Azure Blob Storage or a datastore), the model, and the output location. - The job is divided into mini-batches, and each node in the cluster processes one mini-batch at a time. - The scoring script must handle mini-batches of data, typically using pandas DataFrames. - Results are written to the specified output location, usually in Parquet or CSV format. - Batch endpoints are asynchronous: you submit a job and can monitor its progress; you do not get a synchronous response.

Key Configuration Values: - Compute target: Azure ML compute cluster with a specified VM size (e.g., Standard_DS12_v2). - Mini-batch size: Number of records per mini-batch. Default is 10, but can be adjusted based on memory and performance. - Parallelism: Number of nodes in the cluster. Can be set to auto-scale based on workload. - Error handling: Options to abort on failure or continue on error. - Output format: Supported formats include CSV, Parquet, and custom delimited.

Deployment Process: 1. Register the model. 2. Create a batch scoring script that accepts a pandas DataFrame and returns predictions. 3. Define an environment. 4. Create a batch inference pipeline or use the batch endpoint deployment feature. 5. Submit the batch job via SDK, CLI, or REST API.

Real-Time vs. Batch: Key Differences

| Feature | Real-Time Endpoint | Batch Endpoint | |---------|-------------------|----------------| | Latency | Low (milliseconds to seconds) | High (minutes to hours) | | Input | Single request (JSON, string) | Large dataset (files) | | Output | Immediate response | Written to storage | | Use Case | Interactive apps, real-time scoring | Periodic processing, large volumes | | Scaling | Auto-scales based on load | Scales based on cluster size | | Cost | Higher per-request cost | Lower per-request cost (batch) |

Interacting with Endpoints

REST API: Both endpoint types expose a REST API. For real-time endpoints, you send a POST request with input data; for batch endpoints, you submit a job via a POST request.

Azure ML SDK: You can use the Python SDK to deploy, manage, and invoke endpoints.

Azure CLI: The az ml extension supports endpoint management.

Example CLI Command for Real-Time Deployment:

az ml online-endpoint create --name my-endpoint -g my-resource-group -w my-workspace
az ml online-deployment create --endpoint-name my-endpoint --name blue --model my-model:1 --scoring-script score.py --environment my-env --instance-type Standard_DS3_v2 --instance-count 2

Example CLI Command for Batch Endpoint:

az ml batch-endpoint create --name my-batch-endpoint -g my-resource-group -w my-workspace
az ml batch-deployment create --endpoint-name my-batch-endpoint --name default --model my-model:1 --scoring-script batch_score.py --environment my-env --compute my-cluster

Monitoring and Logging

Azure ML provides built-in monitoring for endpoints: - Metrics: Request latency, request count, CPU/memory utilization, error rate. - Logs: Application Insights can be integrated for detailed logging. - Alerts: Can be set up for anomalies like high error rates or latency.

Security and Networking

Authentication: Real-time endpoints can use key or token authentication. Batch endpoints use Azure AD tokens.

Networking: Endpoints can be deployed to a virtual network (VNet) for private access.

Data encryption: Data in transit and at rest is encrypted.

Exam Trap: Common Misunderstandings

Trap 1: Thinking batch endpoints can return synchronous responses. They are asynchronous; you must poll for job completion.

Trap 2: Assuming real-time endpoints are always cheaper. For high volume, batch is often more cost-effective.

Trap 3: Confusing batch endpoints with pipeline endpoints. Pipeline endpoints are for orchestrating ML workflows, not just inference.

Summary of Exam-Relevant Facts

Real-time endpoints: low latency, synchronous, for interactive apps.

Batch endpoints: high throughput, asynchronous, for large datasets.

Both support auto-scaling but in different ways.

Deployment requires a scoring script, environment, and compute target.

Security features include authentication and VNet integration.

Walk-Through

1

Register the Model

Before deploying, you must register your trained model in the Azure ML workspace. This makes the model versioned and trackable. Use the SDK's `Model.register()` method or the CLI `az ml model create`. The model can be stored in the workspace's default datastore or a custom location. Versioning is automatic; you can specify a version or let Azure assign one. This step is crucial because the endpoint references the model by its registered name and version.

2

Create Scoring Script

Write a Python script (typically `score.py`) that contains two functions: `init()` and `run()`. The `init()` function is called once when the container starts, and it loads the model into memory. The `run()` function is called for each request (real-time) or each mini-batch (batch). For real-time, `run()` receives a JSON string and returns a JSON string. For batch, it receives a pandas DataFrame and returns predictions. The script must be stateless; any external state (e.g., database connections) should be handled carefully.

3

Define Environment

Create an environment that includes all dependencies needed to run the scoring script. This can be a curated environment (e.g., `AzureML-sklearn-0.24-ubuntu18.04-py37-cpu`) or a custom Docker image. Use `Environment.from_conda_specification()` or `DockerImage(..., dockerfile=...)`. The environment is versioned and can be reused. For production, use a custom environment to pin exact package versions for reproducibility.

4

Create Inference Configuration

Combine the scoring script and environment into an inference configuration. For real-time endpoints, use `InferenceConfig(entry_script='score.py', environment=my_env)`. For batch endpoints, use `BatchInferenceConfig(entry_script='batch_score.py', environment=my_env)`. This configuration tells Azure ML how to run the model inference. It also allows setting concurrency (for batch) and other parameters.

5

Deploy to Endpoint

Deploy the model to an endpoint. For real-time, create an `OnlineEndpoint` and then an `OnlineDeployment` with the inference config and deployment config (compute target, instance count). For batch, create a `BatchEndpoint` and a `BatchDeployment`. Use `Model.deploy()` or the CLI. After deployment, Azure ML provisions the compute, runs the scoring script, and exposes the endpoint. For real-time, you can then send requests to the endpoint URL. For batch, you submit jobs.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Fraud Detection

A financial services company needs to detect fraudulent credit card transactions in real time. Each transaction triggers a request to a real-time endpoint. The model is a gradient-boosted tree trained on historical transaction data. The endpoint is deployed on AKS with auto-scaling configured to handle peak holiday traffic (up to 10,000 requests per second). The scoring script preprocesses the transaction features, runs inference, and returns a fraud probability. The endpoint is integrated with Azure API Management for throttling and authentication. Misconfiguration: If the instance count is set too low, latency spikes during peak hours, causing timeouts. Solution: Set auto-scaling with a minimum of 5 instances and a maximum of 20, with scaling based on CPU utilization above 70%.

Enterprise Scenario 2: Batch Customer Churn Prediction

A telecommunications company wants to predict customer churn monthly. They have a dataset of 10 million customer records stored in Azure Blob Storage. They use a batch endpoint with an Azure ML compute cluster of 50 nodes (Standard_DS12_v2). The batch endpoint processes the data in mini-batches of 10,000 records each. The scoring script loads a logistic regression model and outputs churn probabilities. The results are written to a Parquet file for downstream reporting. Misconfiguration: If the mini-batch size is too large (e.g., 100,000), nodes run out of memory and fail. Solution: Set mini-batch size to 10,000 and use a memory-optimized VM. Also, enable error handling to continue on individual row failures.

Enterprise Scenario 3: Hybrid Deployment with A/B Testing

An e-commerce company deploys a new recommendation model alongside the existing one. They use a real-time endpoint with two deployments: 'blue' (current model) and 'green' (new model). They route 10% of traffic to 'green' using the endpoint's traffic distribution feature. After monitoring metrics for a week, they shift 100% traffic to 'green'. This is done via the Azure ML Studio or CLI by updating the endpoint's traffic allocation. Misconfiguration: Forgetting to set the traffic percentage correctly can cause all traffic to go to one deployment. Solution: Use the az ml online-endpoint update --traffic command to set exact percentages.

How AI-900 Actually Tests This

What AI-900 Tests on This Topic

The AI-900 exam objective 2.4 focuses on 'Select the appropriate model deployment type' and 'Deploy a model as a service.' Specifically, you need to:

Distinguish between real-time and batch inference scenarios.

Understand when to use AKS vs ACI for real-time endpoints.

Know the components of a deployment: model, scoring script, environment, compute target.

Recognize that batch endpoints are asynchronous and process large datasets.

Identify that real-time endpoints are for low-latency, interactive applications.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing batch endpoint for a chatbot: Candidates see 'batch' and think it can handle many requests. But chatbots need immediate responses, so real-time is correct. The trap: 'batch' sounds like it processes many requests, but it's asynchronous.

2.

Selecting ACI for production real-time endpoint: ACI is simpler but lacks auto-scaling and advanced networking. AKS is recommended for production. The trap: ACI seems easier, but the exam expects AKS for production.

3.

Thinking batch endpoints return a response immediately: Candidates confuse batch with real-time. Batch jobs are asynchronous; you submit a job and monitor it. The exam tests this distinction.

4.

Confusing model deployment with pipeline deployment: Pipeline endpoints are for ML workflows, not just inference. The exam may ask which endpoint is used for scoring a model. The correct answer is real-time or batch, not pipeline.

Specific Numbers and Terms to Memorize

Default request timeout for real-time endpoints: 60 seconds.

Default mini-batch size for batch endpoints: 10 records.

Compute targets: AKS (production), ACI (dev/test), Azure ML Compute (batch).

Scoring script functions: init() and run().

Authentication: Key-based or token-based for real-time; Azure AD for batch.

Edge Cases and Exceptions

GPU models: For deep learning models, you must select a GPU VM size. The exam may test that AKS supports GPU instances.

Private endpoints: For secure deployments, endpoints can be deployed inside a VNet. This is an advanced topic but may appear.

Multi-model endpoints: A single endpoint can host multiple models by using different deployments. The exam may ask about traffic routing.

How to Eliminate Wrong Answers

If the scenario mentions 'immediate response' or 'real-time', eliminate batch endpoints.

If the scenario mentions 'large dataset' or 'scheduled processing', eliminate real-time endpoints.

If the scenario requires 'auto-scaling', AKS is the compute target.

If the scenario is for development/testing, ACI is acceptable.

If the scenario involves 'asynchronous processing', it's batch.

Key Takeaways

Real-time endpoints provide low-latency, synchronous predictions for interactive applications.

Batch endpoints are asynchronous, process large datasets, and output results to storage.

Production real-time endpoints should use AKS for auto-scaling and high availability.

Batch endpoints use mini-batches (default 10 records) for efficient processing.

Scoring scripts must include init() and run() functions; the environment must match dependencies.

Authentication: real-time uses keys/tokens; batch uses Azure AD.

Default request timeout for real-time endpoints is 60 seconds.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Real-Time Endpoint

Synchronous response; ideal for low-latency applications like chatbots or fraud detection.

Typically deployed on AKS for production; supports auto-scaling based on request load.

Pays for compute per request; cost scales with request volume.

Input is a single JSON payload per request; output is immediate.

Uses a scoring script with init() and run() functions; run() receives a JSON string.

Batch Endpoint

Asynchronous job submission; ideal for processing large datasets or periodic scoring.

Deployed on Azure ML Compute cluster; scales by adding nodes to the cluster.

Pays for compute per job; cost is based on cluster size and runtime.

Input is a dataset (e.g., CSV files in Blob Storage); output is written to storage.

Uses a scoring script that processes mini-batches (pandas DataFrame) and returns predictions.

Watch Out for These

Mistake

Batch endpoints can return synchronous responses if you wait long enough.

Correct

Batch endpoints are fundamentally asynchronous. You submit a job and receive a job ID, then poll for completion. You cannot get a synchronous HTTP response. The design is for offline processing.

Mistake

Real-time endpoints are always more expensive than batch endpoints.

Correct

For low volume, real-time may be cheaper because you pay only for compute time used. For high volume, batch is more cost-effective because it uses spot VMs and processes data in bulk. The exam tests cost considerations based on volume.

Mistake

You must use AKS for all real-time endpoints.

Correct

ACI is a valid option for dev/test or low-traffic scenarios. However, for production with auto-scaling and high availability, AKS is required. The exam expects you to know that AKS is the production choice.

Mistake

Batch endpoints require a scoring script that processes one record at a time.

Correct

Batch scoring scripts process mini-batches of data (a list of records). The default mini-batch size is 10. The script receives a pandas DataFrame, not a single row. This is important for performance.

Mistake

Real-time endpoints can only be deployed using the Azure ML Studio.

Correct

You can deploy using the SDK, CLI, REST API, or Studio. The exam focuses on the concepts, not the specific tool, but you should know that multiple methods exist.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a real-time endpoint and a batch endpoint in Azure ML?

Real-time endpoints are for synchronous, low-latency predictions, returning results immediately. Batch endpoints are for asynchronous, high-throughput processing of large datasets, writing results to storage. Real-time is for interactive apps like chatbots; batch is for offline processing like monthly churn predictions.

When should I use AKS vs ACI for a real-time endpoint?

Use AKS for production deployments requiring auto-scaling, high availability, and advanced networking. Use ACI for dev/test or low-traffic scenarios where simplicity is preferred. The exam expects AKS for production.

Can I deploy multiple models to the same endpoint?

Yes, you can have multiple deployments under one endpoint and route traffic between them. For example, you can have a 'blue' and 'green' deployment for A/B testing. This is done by setting traffic percentages on the endpoint.

How do I authenticate requests to a real-time endpoint?

Real-time endpoints support key-based or token-based authentication. You include the key in the request header (e.g., 'Authorization: Bearer <key>'). Keys are generated during deployment and can be regenerated.

What is the default mini-batch size for batch endpoints?

The default mini-batch size is 10 records. You can adjust it based on your data size and memory constraints. Larger mini-batches may cause out-of-memory errors.

Can I use GPUs with batch endpoints?

Yes, you can use GPU-based VM sizes for batch endpoints if your model requires GPU acceleration. The compute cluster must be configured with GPU VMs (e.g., Standard_NC6).

How do I monitor the performance of my endpoint?

Azure ML provides metrics like request latency, request count, error rate, and CPU/memory utilization. You can also integrate Application Insights for detailed logging and set up alerts.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure ML Endpoints: Real-Time and Batch — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?