You are deploying a PyTorch model for online predictions on Vertex AI. The model expects input tensors and performs GPU-accelerated inference. You want to minimize prediction latency and maximize throughput. Which approach should you use?
Triton is optimized for GPU inference and can reduce latency and increase throughput.
Why this answer
Option B is correct because NVIDIA Triton Inference Server provides advanced features like dynamic batching, concurrent model execution, and GPU scheduling that maximize throughput and minimize latency for GPU-accelerated inference. Vertex AI's prebuilt PyTorch serving container with Triton is specifically designed to handle online prediction workloads efficiently, outperforming a plain custom container without an inference server.
Exam trap
Cisco often tests the misconception that model optimization alone (e.g., quantization) is sufficient for low-latency serving, when in fact the inference server's request handling and batching capabilities are critical for minimizing latency and maximizing throughput in online predictions.
How to eliminate wrong answers
Option A is wrong because a custom container without any inference server lacks request batching, model queuing, and GPU utilization optimizations, leading to higher latency and lower throughput under concurrent requests. Option C is wrong because Vertex AI Model Optimization for FP16 quantization reduces model size and can improve throughput, but it does not address the serving infrastructure needed for low-latency online predictions; the deployment still requires an inference server like Triton to handle request management and GPU scheduling. Option D is wrong because batch prediction is designed for high-throughput, offline processing of large datasets and typically has higher latency per request due to job queuing and resource provisioning, making it unsuitable for minimizing prediction latency in online scenarios.