PMLE Serving and scaling models Practice Test 5 — 15 Questions

Question 1

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

Accepted Answer

Change the machine type to n1-highmem-16 (104 GB memory).. Option C is correct because n1-standard-8 has only 30 GB memory, which may be insufficient for a large BERT model (e.g., around 1.5 GB parameters but with intermediate tensors can exceed 30 GB). Upgrading to a high-memory machine type provides more memory. Option A is wrong because adding a GPU does not increase system memory. Option B is wrong because model quantization reduces model size but not necessarily memory spikes during inference. Option D is wrong because batch prediction is not for real-time, and OOM might still occur.

Answer

Use batch prediction instead of online prediction.

Answer

Add a GPU accelerator (e.g., NVIDIA T4) to offload computation.

Answer

Quantize the model from FP32 to INT8.

Question 2

After deploying a new version of a model to a Vertex AI Endpoint, the team notices that predictions are still returning results from the old version. The deployment command used a traffic split of 100% to the new version. What is the most likely cause?

Accepted Answer

The traffic split was not properly updated; the endpoint is still routing 100% to the old version.. Option A is correct because the traffic split update may not have taken effect if the command failed silently, or the new version is not healthy, causing the endpoint to route traffic to the old version. Option B is wrong because caching is not a typical issue for Vertex AI Endpoint. Option C is wrong because the deployment succeeded but traffic split might need explicit update. Option D is wrong because a stale model artifact would affect the new version only.

Answer

The model artifact uploaded was identical to the old version.

Answer

The new model version failed health checks and was automatically rolled back.

Answer

The prediction client is caching the old model response.

Question 3

Which TWO actions can help reduce prediction latency for a model deployed on Vertex AI Endpoint without changing the model architecture?

Accepted Answer

Attach a GPU accelerator to the endpoint's machine type.. Options A and D are correct. Option A (GPU accelerator) can significantly speed up inference for deep learning models. Option D (model quantization) reduces model size and inference time. Option B (increasing batch size) increases latency per request. Option C (multiregion deployment) reduces network latency but not prediction latency. Option E (smaller machine type) may increase latency.

Answer

Increase the batch size of prediction requests.

Answer

Deploy the model in multiple regions and use global load balancing.

Answer

Use a smaller machine type to reduce complexity.

Question 4

A company deploys a model to Vertex AI Endpoint with autoscaling enabled. During a traffic spike, they observe high tail latency (99th percentile > 2s). Which TWO factors are most likely contributing to this latency?

Accepted Answer

The min_replica_count is set too low, causing cold starts.. Options A and C are correct. Option A: if min replicas is too low, new replicas must be created and loaded with the model, causing cold start latency. Option C: a large model file increases cold start time as new replicas load the model. Option B (underpowered machine) would cause high average latency, not just tail. Option D (too many traffic splits) is unrelated. Option E (target CPU utilization set too low) would cause earlier scaling, reducing tail latency; too high would delay scaling.

Answer

The machine type is underpowered for the model.

Answer

The autoscaling target_cpu_utilization is set too low (e.g., 0.3).

Answer

The endpoint has too many traffic splits configured.

Question 5

A team wants to serve a large PyTorch model (3 GB) for online predictions with low latency. Which THREE actions should they take?

Accepted Answer

Use a custom container that preloads the model into memory.. Options A, B, and E are correct. Option A: GPU accelerator speeds up inference. Option B: model optimization (TorchScript, quantization) reduces inference time. Option E: custom container with model preloading reduces cold start latency. Option C (multiregion) reduces network latency, not prediction latency. Option D (batch prediction) is not for online.

Answer

Use batch prediction instead of online prediction.

Answer

Deploy in multiple regions with Cloud Load Balancing.

Question 6

You deploy a PyTorch model to Vertex AI Online Prediction. After deployment, you observe that inference latency is approximately 300ms per request, but the desired SLA is under 100ms. The model uses a custom container with CPU only. Which action is most likely to reduce latency to the target?

Accepted Answer

Deploy the model on a machine with a GPU accelerator.. Enabling GPU acceleration can significantly speed up inference for deep learning models. Adding more CPU instances may help with throughput but not per-request latency. Switching to batch prediction changes the use case, and using a smaller instance type might reduce latency if the model is small, but GPU is more impactful.

Answer

Switch from online prediction to batch prediction.

Answer

Increase the min_replica_count to ensure more instances are always available.

Answer

Use a smaller machine type with less CPU to reduce overhead.

Question 7

Your team is serving a large language model on Vertex AI using a custom container. The endpoint experiences intermittent 502 errors during traffic spikes. The autoscaling configuration uses a CPU utilization target of 60% and the model is deployed on n1-standard-4 instances. The model requires significant memory. Which combination of changes is most likely to resolve the issue?

Accepted Answer

Switch to a machine type with more memory, e.g., n1-highmem-8, and increase min_replica_count.. The 502 errors likely indicate the instances are overwhelmed or timing out. Increasing the machine type to a high-memory instance reduces memory pressure, and adding more replicas through a lower target scaling metric or higher min replicas provides capacity. Tuning batch size helps but is secondary. GPU may not help if the issue is memory.

Answer

Increase the target CPU utilization to 90% to allow more requests per instance.

Answer

Enable canary traffic splitting to reduce load on the main endpoint.

Answer

Reduce the model batch size from 32 to 1 to lower memory per request.

Question 8

You need to serve a TensorFlow model that has a cold start latency of 20 seconds. The model is used for a real-time application with unpredictable traffic, but occasional bursts require immediate responses. What is the best deployment strategy to minimize both cold start impact and cost?

Accepted Answer

Set min_replica_count to 1 to keep at least one instance always warm.. Setting a minimum number of replicas (min_replica_count) ensures that some instances are always warm, avoiding cold start for the first requests. This balances cost and latency. Prewarming requests or increasing target utilization wouldn't help directly.

Answer

Use a larger machine type to reduce cold start time.

Answer

Set min_replica_count to 0 and rely on autoscaling to handle bursts.

Answer

Enable serving on Cloud Run for faster cold start.

Question 9

Your team deploys a multi-model endpoint on Vertex AI with two models: Model A (small, low latency) and Model B (large, high latency). You configure traffic splitting so that 90% goes to Model A and 10% to Model B. However, you notice that the latency for Model A increases when Model B receives traffic. What is the most likely cause?

Accepted Answer

The models are collocated on the same instances, leading to resource contention.. In a multi-model endpoint, all models share the underlying infrastructure. When Model B handles requests, it consumes resources (CPU/memory), causing contention that degrades Model A's latency. Collocation of models on the same instance is the issue.

Answer

Model A is being overloaded because autoscaling is based on aggregate traffic.

Answer

The traffic split is misconfigured, causing requests to be routed incorrectly.

Answer

Model B's logging is generating too much output, slowing down the predictor.

Question 10

You are deploying a scikit-learn model for online predictions. The model size is 200 MB. You want to minimize latency and cost. Which serving option should you choose?

Accepted Answer

Deploy to Vertex AI online prediction using a prebuilt container for scikit-learn.. Vertex AI online prediction with custom containers is suitable for scikit-learn models. Vertex AI will host the container and scale. Using AI Platform or Cloud Functions with a 200 MB model might hit limits.

Answer

Use Cloud Run with a custom container.

Answer

Create a Kubernetes cluster on GKE and deploy the model there.

Answer

Export the model as a Cloud Function.

Question 11

A company is serving a model for their e-commerce website. They expect traffic to be low at night and very high during flash sales. They want to minimize costs while ensuring availability during spikes. Which autoscaling configuration should they use?

Accepted Answer

min_replica_count=1, max_replica_count=20, target_cpu=60. Setting a high max_replica_count allows scaling to handle spikes, while a low min_replica_count saves cost during low traffic. CPU utilization target of 60% is reasonable.

Answer

min_replica_count=5, max_replica_count=5, target_cpu=60

Answer

min_replica_count=10, max_replica_count=10, target_cpu=60

Answer

min_replica_count=0, max_replica_count=100, target_cpu=80

Question 12

Your model serving endpoint on Vertex AI is experiencing increased memory usage after a recent update. The model was converted from TensorFlow to TF Lite for faster inference. You notice that the endpoint's instances occasionally get killed due to out-of-memory (OOM) errors. What is the most likely cause?

Accepted Answer

The number of inference threads in the TF Lite runtime is set too high, causing memory consumption.. TF Lite models can have different memory footprint depending on the number of threads used for inference. If the custom container or the runtime allocates many threads, memory usage can spike. The model conversion itself may not reduce memory; thread count is a key factor.

Answer

The TF Lite model is larger in size than the original model.

Answer

The Vertex AI endpoint is not configured with enough CPU.

Answer

The traffic to the endpoint has increased significantly.

Question 13

You are using Vertex AI continuous evaluation (model monitoring) for your deployed model. You receive an alert that the prediction distribution is significantly different from the training distribution. What should you do first?

Accepted Answer

Analyze the input data to understand if there is a skew or drift.. When a monitoring alert triggers, the first step is to investigate the root cause: check if input data has changed, retraining is needed, or there is a data pipeline issue. Simply rolling back or retraining without analysis might be premature.

Answer

Roll back the model to the previous version immediately.

Answer

Increase the alerting threshold to reduce false positives.

Answer

Retrain the model using the latest data and redeploy.

Question 14

You have a model that requires GPU for efficient inference. You deploy it on Vertex AI with a single NVIDIA T4 GPU accelerator and notice that the GPU utilization hovers around 30%. The endpoint has 10 replicas. What is the best way to improve cost efficiency while maintaining throughput?

Accepted Answer

Reduce the number of replicas to increase GPU utilization per instance.. If GPU utilization is low, you can reduce the number of replicas or increase the batch size per request to fully utilize the GPU. Reducing replicas directly saves cost. Increasing batch size may also help but requires code changes.

Answer

Use a larger GPU like V100 to process requests faster.

Answer

Enable autoscaling to increase the number of replicas.

Answer

Switch to a CPU-only instance; the model can run on CPU.

Question 15

Which TWO actions can help reduce the latency of online prediction requests for a deep learning model served on Vertex AI?

Accepted Answer

Use a GPU accelerator for the deployed model.. Using a GPU accelerator speeds up inference, and batching requests reduces overhead per request. Minimizing replicas doesn't help latency; increasing CPU doesn't always help if GPU is better.

Answer

Increase the number of CPU vCPUs per machine.

Answer

Set min_replica_count to 0 to avoid idle instances.

Answer

Decrease the number of replicas to reduce resource contention.