A data science team wants to deploy a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (under 100 ms). The model is a small ensemble of three tree-based models, each about 50 MB. The team expects around 1000 requests per minute, with occasional spikes to 5000 requests per minute. Which instance type and deployment strategy would be MOST cost-effective while meeting the latency requirement?
The ml.c5.large provides sufficient compute for the latency requirement, and Auto Scaling scales out during spikes. This is the most cost-effective approach.
Why this answer
Option A is correct because deploying a single model endpoint on an ml.c5.large instance with Auto Scaling based on invocations per minute provides the necessary compute capacity for the expected 1000 requests per minute while scaling up to handle spikes up to 5000 requests per minute. The ml.c5.large instance offers sufficient memory (4 GB) and compute for three 50 MB tree-based models, and the target tracking policy ensures low latency by maintaining a buffer of capacity without over-provisioning, keeping inference under 100 ms.
Exam trap
The trap here is that candidates might confuse provisioned concurrency (a Lambda concept) with SageMaker's scaling options, or incorrectly assume Multi-Model endpoints are suitable for ensemble models, leading to choosing B or D without considering the real-time latency constraint.
How to eliminate wrong answers
Option B is wrong because Multi-Model endpoints are designed to host multiple independent models on a single instance, but here the ensemble is a single model composed of three sub-models that must be loaded together for each inference; using a Multi-Model endpoint would require loading each sub-model separately, increasing latency and complexity. Option C is wrong because SageMaker batch transform is an asynchronous, offline processing method that does not support real-time inference with sub-100 ms latency; it is designed for large-scale batch jobs, not low-latency endpoints. Option D is wrong because provisioned concurrency is a feature for AWS Lambda, not Amazon SageMaker endpoints; SageMaker uses Auto Scaling or manual instance scaling, and an ml.c5.xlarge instance would be over-provisioned for the baseline load, increasing cost unnecessarily.