A company is using Vertex AI to deploy a text generation model for a chatbot. They want to reduce the response latency. Which configuration change is most effective?
Smaller models have faster inference, directly reducing latency.
Why this answer
Option B is correct because using a smaller model variant directly reduces the number of parameters and computational operations required per inference, which lowers latency. In Vertex AI, smaller models like `text-bison@002` have fewer layers and attention heads than larger counterparts, resulting in faster token generation without requiring hardware changes.
Exam trap
Google Cloud often tests the misconception that increasing compute resources (GPUs) or batch size always reduces latency, when in fact these optimizations target throughput, not per-request response time.
How to eliminate wrong answers
Option A is wrong because model quantization (e.g., reducing weights from FP32 to INT8) can reduce memory footprint and improve throughput, but it does not guarantee lower latency per request and may introduce accuracy trade-offs; it is not the most effective single change for latency reduction. Option C is wrong because increasing the number of GPUs can improve throughput for batch processing but does not reduce per-request latency; in fact, it may increase communication overhead and cost without speeding up individual inference. Option D is wrong because using a larger batch size increases throughput for concurrent requests but actually increases the latency for each individual request, as the model processes more sequences together before returning results.