Your team has deployed a scikit-learn model using a custom container on Vertex AI Prediction. The model receives about 100 requests per second, and the endpoint is configured with a single n1-standard-4 machine. You notice that response times are around 200 ms on average, but occasionally spike to over 10 seconds during traffic bursts. You have set the min replicas to 1 and max replicas to 10. Despite this, spikes still occur. What is the most likely cause and the best course of action?
Reducing cooldown and increasing max replicas helps autoscaling respond faster to bursts.
Why this answer
Option A is correct because the occasional spikes suggest that autoscaling is too slow; enabling batching reduces the number of inference calls and smooths out bursts. Option B could help but the autoscaling may still be too slow. Option C is not necessarily needed if average latency is acceptable.
Option D is unlikely to cause intermittent spikes.