A team has a large deep learning model that needs to be deployed for real-time inference with GPU acceleration. They want to use the Triton Inference Server on SageMaker to maximize throughput. Which instance type and configuration should they choose?
The ml.g4dn instance has NVIDIA GPU, and the Triton container is optimized for high throughput inference.
Why this answer
Option A is correct because the Triton Inference Server is specifically designed for high-performance inference on large deep learning models, supporting GPU acceleration and dynamic batching to maximize throughput. The ml.g4dn.xlarge instance provides a cost-effective GPU (T4) with sufficient memory for many models, and SageMaker's pre-built Triton container enables seamless deployment with features like model concurrency and request scheduling.
Exam trap
Cisco often tests the misconception that any GPU instance (like ml.p3) is suitable for deep learning inference, but the key is matching the container (Triton) to the workload, not just the instance type, and avoiding CPU-only instances for GPU-accelerated tasks.
How to eliminate wrong answers
Option B is wrong because ml.c5.2xlarge is a CPU-only instance (no GPU), which cannot provide the GPU acceleration required for real-time inference of large deep learning models, leading to high latency and low throughput. Option C is wrong because the SageMaker built-in XGBoost container is designed for gradient-boosted tree models, not deep learning models, and ml.p3.2xlarge (V100 GPU) is overkill for XGBoost and incompatible with the container. Option D is wrong because ml.m5.large is a general-purpose CPU instance with no GPU, and the standard TensorFlow Serving container lacks the advanced features of Triton (e.g., dynamic batching, model ensembles, concurrent model execution) needed to maximize throughput for large models.