PMLE • Practice Test 37
Free PMLE practice test — 15 questions with explanations. Set 37. No signup required.
You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?