A developer wants to deploy a custom generative AI model using Azure Machine Learning. Which compute target should they choose for low-latency real-time inference?
AKS is designed for real-time inference with low latency.
Why this answer
Azure Kubernetes Service (AKS) is the correct compute target for low-latency real-time inference because it supports horizontal pod autoscaling, GPU acceleration, and can be configured with a low-latency ingress controller (e.g., NGINX or Azure Application Gateway) to route inference requests directly to model containers. AKS also integrates with Azure Machine Learning's real-time inference endpoint, which uses a gRPC or HTTP-based scoring protocol to achieve sub-100ms response times.
Exam trap
The trap here is that candidates often confuse Azure Functions' serverless convenience with real-time capability, overlooking the cold-start penalty and lack of GPU support, while AKS is the only option that provides the necessary infrastructure for consistent low-latency inference.
How to eliminate wrong answers
Option A is wrong because local deployment (e.g., a local Docker container or Jupyter notebook) is intended for development and testing only, not for production-grade low-latency real-time inference, as it lacks scalability, load balancing, and network-level optimizations. Option B is wrong because Azure Batch is designed for high-throughput, parallel batch processing jobs (e.g., offline scoring of large datasets) and is not optimized for low-latency real-time inference due to its job-queue scheduling overhead and lack of persistent endpoints. Option C is wrong because Azure Functions, while serverless and capable of handling HTTP triggers, has a cold-start latency problem (often 1-10 seconds) and limited GPU support, making it unsuitable for sub-second real-time inference workloads.