A company uses Vertex AI for online predictions with a large ensemble model that requires GPU acceleration. They want to reduce inference latency by batching multiple requests into a single GPU inference call. What should they configure?
Correct: Triton supports dynamic batching to improve GPU utilization and reduce per-request latency.
Why this answer
NVIDIA Triton Inference Server supports dynamic batching, which automatically groups multiple inference requests into a single GPU call. This reduces overhead and improves GPU utilization, directly addressing the need to lower latency for online predictions with a large ensemble model on Vertex AI.
Exam trap
Cisco often tests the distinction between model-level optimizations (quantization, compilation) and runtime optimizations (batching), leading candidates to confuse techniques that improve single-request speed with those that improve throughput via request aggregation.
How to eliminate wrong answers
Option A is wrong because Vertex AI Model Optimization for automatic compilation focuses on model-level optimizations like pruning or quantization, not on batching runtime requests. Option C is wrong because increasing GPU replicas improves concurrency but does not batch requests into a single inference call; it may even increase latency due to inter-replica coordination. Option D is wrong because model quantization using TensorRT reduces model size and speeds up computation per request, but it does not implement request batching at the inference server level.