A company wants to deploy a machine learning model that provides real-time inference with low latency. The model is a small ensemble of three tree-based models. Which Amazon SageMaker approach is most appropriate?
Real-time endpoints provide low-latency inference.
Why this answer
A SageMaker real-time endpoint with a single inference container is the most appropriate approach because it provides persistent, low-latency inference by keeping the model loaded in memory and handling requests synchronously. For a small ensemble of three tree-based models, a single container can host all models (e.g., using a custom inference script or a multi-model endpoint) and deliver sub-second response times, meeting the real-time requirement.
Exam trap
The trap here is that candidates often confuse 'real-time inference' with 'serverless' or 'batch processing,' assuming that serverless or Lambda are always cheaper or simpler, but they fail to account for cold-start latency and execution limits that break low-latency requirements.
How to eliminate wrong answers
Option B is wrong because SageMaker batch transform jobs are designed for asynchronous, offline inference on large datasets and do not provide real-time, low-latency responses. Option C is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and limited memory (up to 10 GB), making it unsuitable for hosting even a small ensemble of models that require persistent, low-latency inference; additionally, packaging models in Lambda layers adds cold-start latency and complexity. Option D is wrong because SageMaker Serverless Inference endpoints automatically scale to zero when not in use, incurring cold-start latency that can exceed acceptable thresholds for real-time inference, and they are optimized for intermittent or bursty traffic, not sustained low-latency workloads.