A team uses custom training and deploys a TensorFlow model using Vertex AI Endpoints. They set up Cloud Monitoring alerts for online prediction latency. However, they notice the latency metric shows a spike every hour, but the actual user experience is fine. What could be the cause?
Periodic log dumping can cause hourly spikes in measured latency.
Why this answer
Option A is correct because Vertex AI Endpoints' latency metric includes both the model inference time and the time taken to write prediction logs to Cloud Logging. This log writing occurs asynchronously but can cause periodic spikes in the reported latency metric when log buffers flush, even though the actual user-facing prediction latency remains unaffected. The spike every hour aligns with log rotation or buffer flush intervals, not with actual prediction performance degradation.
Exam trap
Google Cloud often tests the misconception that latency metrics reflect only model inference time, when in reality they may include ancillary operations like logging, causing candidates to overlook the logging overhead as the source of periodic spikes.
How to eliminate wrong answers
Option B is wrong because the alert threshold being too low would cause continuous or frequent alerts, not a predictable hourly spike in the latency metric itself. Option C is wrong because sampling every hour would produce a single data point per hour, not a spike within the metric; the metric is reported continuously, and sampling frequency does not create spikes. Option D is wrong because a monitoring agent on the VM would add consistent overhead, not a periodic hourly spike, and Vertex AI Endpoints are managed services where customers do not manage VMs directly for prediction serving.