A team is deploying a SageMaker endpoint for a model that was trained with scikit-learn. The endpoint receives spikes in traffic during business hours. The team wants to minimize cost while ensuring availability during spikes. Which endpoint configuration is MOST appropriate?
Auto-scaling handles traffic spikes efficiently.
Why this answer
Option B is correct because a production variant endpoint with auto-scaling based on CPU utilization allows the SageMaker endpoint to dynamically adjust the number of instances in response to traffic spikes, ensuring availability during business hours while minimizing cost by scaling down during off-peak periods. This approach is ideal for a scikit-learn model, which is CPU-bound, making CPU utilization a relevant and effective scaling metric.
Exam trap
The trap here is that candidates often confuse serverless inference with cost optimization for predictable spikes, overlooking that auto-scaling with a relevant metric like CPU utilization provides both cost efficiency and availability for scheduled traffic patterns.
How to eliminate wrong answers
Option A is wrong because SageMaker Serverless Inference is designed for intermittent or unpredictable traffic patterns with low latency requirements, but it can incur cold start latency and is not optimal for consistent daily spikes during business hours, potentially leading to higher costs or performance issues. Option C is wrong because a multi-model endpoint with a single instance type does not provide auto-scaling; it hosts multiple models on a single instance, which cannot handle traffic spikes by itself and would still require scaling mechanisms to ensure availability. Option D is wrong because deploying a single large instance that can handle peak load results in over-provisioning and higher costs during off-peak hours, as the instance remains fully running regardless of actual traffic, contradicting the goal of minimizing cost.