Your organization has deployed a generative AI model for a multilingual translation service on OCI Model Deployment. The model is a 13B parameter transformer hosted on a single VM.GPU.A100.1 shape with 2 replicas. Recently, the service experiences intermittent timeouts when a burst of requests arrives. You have enabled autoscaling based on CPU utilization, but the scaling is too slow. After investigation, you find that the model inference time is highly variable due to different sequence lengths. You need to ensure the service can handle sudden spikes without timeouts. Which solution should you implement?
Queuing decouples traffic spikes from the model, preventing timeouts.
Why this answer
Option A is correct because implementing a request queue (e.g., OCI Queue) decouples request ingestion from processing, allowing the service to buffer bursts of requests and process them asynchronously. This prevents timeouts by smoothing out the variable inference times caused by differing sequence lengths, as the queue absorbs spikes and the model processes at its own pace. Autoscaling based on CPU utilization is too slow for sudden spikes, but a queue provides immediate relief by not dropping requests.
Exam trap
The trap here is that candidates often assume autoscaling (option B or D) is sufficient for burst handling, but they overlook that autoscaling has inherent latency (minutes to provision new replicas), whereas a request queue provides immediate buffering to absorb spikes without dropping requests.
How to eliminate wrong answers
Option B is wrong because increasing the maximum number of replicas and prewarming them only helps if the scaling mechanism is fast enough to react; it does not address the root cause of variable inference times and still relies on autoscaling, which is too slow for sudden bursts. Option C is wrong because reducing the model size to a 7B parameter model would degrade translation quality and does not solve the intermittent timeout issue caused by variable sequence lengths; it might reduce average inference time but not eliminate spikes. Option D is wrong because autoscaling based on the number of messages in the request queue would still be reactive and subject to latency in provisioning new replicas, and it does not prevent timeouts during the scaling delay; the queue itself is the primary solution to buffer requests.