During a load test, a Vertex AI endpoint serving a large language model experiences high latency and increased error rates. The endpoint is configured with autoscaling. What is the most likely cause?
GPU-bound models require GPU-based metrics for effective autoscaling.
Why this answer
If autoscaling is based on CPU utilization but the model is GPU-bound, the scaling metric does not reflect the actual load, causing insufficient resources.