A data science team needs to ensure that a deployed Vertex AI model can handle varying traffic patterns with minimal latency and cost. What should they do?
Autoscaling adjusts replicas based on traffic, balancing latency and cost.
Why this answer
Vertex AI Prediction with autoscaling dynamically adjusts the number of serving instances based on incoming traffic, ensuring minimal latency during spikes and cost efficiency during lulls. This is the recommended approach for handling variable traffic patterns in production, as it leverages Google Cloud's managed infrastructure to scale from zero to thousands of nodes automatically.
Exam trap
Google Cloud often tests the misconception that batch prediction can substitute for online serving in variable traffic scenarios, but the key distinction is that batch prediction lacks real-time latency guarantees and cannot scale dynamically per request.
How to eliminate wrong answers
Option B is wrong because batch prediction is designed for asynchronous, large-scale offline inference on static datasets, not for real-time traffic with varying patterns; it cannot handle low-latency online requests. Option C is wrong because pre-warming all instances defeats the purpose of autoscaling, leading to constant high cost regardless of actual traffic, and is not a dynamic solution. Option D is wrong because deploying to a single large machine type creates a single point of failure and cannot scale horizontally to handle traffic spikes, resulting in either over-provisioning cost or latency under load.