You are the DevOps engineer for a large gaming company. Your game backend runs on Compute Engine instances behind a global HTTP(S) Load Balancer. You have set up Cloud Monitoring with an uptime check for the load balancer's IP address, and you are using logging to capture 404 errors. Recently, a new game update caused a surge in traffic, and you started receiving many alerts from your uptime check indicating that the site is down. However, you verify that the backend instances are healthy and the load balancer is responding correctly, though some requests are timing out due to the increased load. Your alerting policy currently triggers when 2 consecutive checks fail. What is the most likely reason for the false positive alerts?
During traffic surge, response time increases; if timeout is too short, check fails despite site being up.
Why this answer
Option D is correct because the uptime check's timeout is too short for the current response times. When a surge in traffic causes some requests to time out, the load balancer may still respond correctly to most requests, but the uptime check—which has a fixed timeout (default 10 seconds)—fails if the response does not arrive within that window. Since the alert triggers after 2 consecutive failures, the check falsely reports the site as down even though the backend and load balancer are healthy.
Exam trap
Google Cloud often tests the distinction between health checks (which verify backend instance health) and uptime checks (which verify end-to-end availability from a monitoring perspective), leading candidates to confuse a healthy backend with a successful uptime check response.
How to eliminate wrong answers
Option A is wrong because the global load balancer's health check is separate from the uptime check; the health check monitors backend instance health, and the scenario states the backend instances are healthy, so the health check is not failing. Option B is wrong because Cloud Monitoring does not have a hard limit on concurrent uptime checks; the limit is on the number of uptime checks per project (100), not on concurrency, and a surge in traffic would not cause a limit to be reached. Option C is wrong because the scenario mentions capturing 404 errors, not 503 errors; a 503 status code would indicate the backend is unavailable, but the problem states the backend is healthy and the load balancer is responding correctly, so the uptime check is not receiving a 503.