A machine learning team deploys a PyTorch model for online prediction on Vertex AI using a custom container. They notice that the first few requests after scaling up experience high latency. What is the most likely cause and how should they mitigate it?
Giving the container more time to load the model reduces premature traffic and latency.
Why this answer
Option C is correct because the high latency on the first few requests after scaling up is a classic symptom of a slow container initialization. By setting `initialDelaySeconds` in the health probe, you allow the container more time to start up and become ready before it receives traffic, preventing premature routing that causes timeouts or retries. This is a common tuning parameter for custom containers on Vertex AI, where model loading or dependency initialization can take several seconds.
Exam trap
The trap here is that candidates confuse slow initialization with autoscaling misconfiguration, assuming that scale-to-zero or smaller machines would fix the latency, when in fact the root cause is the readiness probe timing.
How to eliminate wrong answers
Option A is wrong because the problem occurs after scaling up, not from a cold start with zero replicas; setting min_replica=0 would actually worsen latency by requiring full cold starts. Option B is wrong because a corrupted model file would cause persistent prediction failures or errors, not just high latency on the first few requests after scaling. Option D is wrong because using a smaller machine type (n1-standard-2) would increase startup overhead and latency, not reduce it, as it provides fewer CPU and memory resources for initialization.