A data scientist needs to deploy a single ML model that will serve real-time predictions with low latency (under 10 ms) for a high-traffic web application. The model fits in memory and requires GPU acceleration. Which SageMaker inference option is MOST suitable?
Trap 1: Real-time endpoint on ml.m5 instances
ml.m5 instances are CPU-only; they may not achieve sub-10 ms latency for GPU-accelerated models.
Trap 2: Batch Transform
Batch Transform is for offline, asynchronous predictions on large datasets, not real-time serving.
Trap 3: Serverless Inference
Serverless Inference has a cold start latency that can exceed 10 ms and does not support GPU acceleration.
- A
Real-time endpoint on ml.m5 instances
Why wrong: ml.m5 instances are CPU-only; they may not achieve sub-10 ms latency for GPU-accelerated models.
- B
Batch Transform
Why wrong: Batch Transform is for offline, asynchronous predictions on large datasets, not real-time serving.
- C
Real-time endpoint on ml.g4dn instances
ml.g4dn instances offer GPU acceleration and are designed for low-latency, real-time inference.
- D
Serverless Inference
Why wrong: Serverless Inference has a cold start latency that can exceed 10 ms and does not support GPU acceleration.