PMLE Serving and scaling models Practice Test 4 — 15 Questions

Question 1

A team deploys a model using Vertex AI Endpoint with automatic scaling. They observe that during traffic spikes, new instances take a long time to become ready, causing high latency for some requests. What should they configure to reduce this startup time?

Accepted Answer

Use a custom container with a smaller footprint. Option D is correct because using a custom container with a smaller footprint (e.g., smaller base image, fewer dependencies) reduces the time to pull and initialize the container. Option A increases max replicas but does not speed up startup. Option B may help trigger scaling earlier but startup time remains. Option C is not a standard setting.

Answer

Increase the max replicas

Answer

Enable predictive autoscaling

Answer

Set a higher target CPU utilization

Question 2

A company uses Vertex AI Endpoints for model serving and wants to implement A/B testing between model versions. They need to gradually shift traffic from the old to the new version while monitoring performance. Which Vertex AI feature allows this with minimal operational overhead?

Answer

Using a custom load balancer with weighted backend services

Answer

Model Deployments with traffic splitting

Answer

Vertex AI Experiments for tracking

Answer

Cloud Run revisions with traffic migration

Question 3

Which TWO options are best practices for reducing model serving latency on Vertex AI Endpoints? (Choose two.)

Accepted Answer

Optimize the model using quantization or pruning. Options C and E are correct. Deploying in the same region as clients reduces network latency. Optimizing the model (quantization/pruning) reduces compute time without major accuracy loss. Option A increases cost but not necessarily latency. Option B is not a feature. Option D increases latency due to batch processing.

Answer

Use a larger machine type with more memory

Answer

Use batch prediction instead of online prediction

Answer

Enable model caching at the endpoint

Question 4

Which THREE factors should be considered when choosing between using Vertex AI Endpoints and Cloud Run for model serving? (Choose three.)

Accepted Answer

Built-in model monitoring. Options A, B, and C are key differentiators. Vertex AI Endpoints supports GPUs natively, Cloud Run has limited GPU support. Cloud Run inherently scales to zero, Vertex AI endpoints don't always scale to zero easily. Vertex AI Endpoints has built-in model monitoring, Cloud Run does not. Options D and E are less differentiating: both services have similar cost structures and container requirements.

Answer

Complexity of model containerization

Answer

Cost per request

Question 5

Which TWO options can help detect model performance degradation in production? (Choose two.)

Accepted Answer

Vertex AI Model Monitoring (drift detection). Options A and E are correct. Vertex AI Model Monitoring detects drift in input features, which can indicate performance degradation. Storing predictions in BigQuery and comparing with ground truth labels directly measures performance. Option B monitors infrastructure, not model performance. Option C is training-time. Option D logs errors but not degradation.

Answer

Vertex AI Experiments on historical data

Answer

Cloud Logging for prediction errors

Answer

Cloud Monitoring custom metrics from serving logs

Question 6

Refer to the exhibit. A data scientist deploys a new model version (model_v2) to an existing endpoint with 20% traffic. After a few days, they notice that model_v2's error rate is higher than model_v1's. They want to route all traffic back to model_v1 immediately. Which command achieves this with minimal disruption?

Accepted Answer

gcloud ai endpoints update my-endpoint --region=us-central1 --traffic-split=model_v1=1,model_v2=0. Option A is correct. The gcloud ai endpoints update command with --traffic-split allows setting the traffic split to 100% for model_v1 and 0% for model_v2, routing all traffic to the stable model without redeploying. Option B removes the model, which may cause temporary unavailability. Option C uses a misspelled command. Option D changes the endpoint's update time but not traffic.

Answer

gcloud ai endpoints update my-endpoint --region=us-central1 --remove-deployed-model=model_v2

Answer

gcloud ai endpoints undepoly-model my-endpoint --region=us-central1 --model=model_v2

Answer

gcloud ai endpoints update-traffic my-endpoint --region=us-central1 --model=model_v1 --traffic-percentage=100

Question 7

Refer to the exhibit. A team deploys a model with the above configuration. They observe that during traffic spikes, the endpoint does not scale up quickly enough, causing increased latency. The average CPU utilization never exceeds 50%. What is the most likely reason for the slow scaling?

Accepted Answer

The autoscaling metric is not configured. Option C is correct. The configuration shows strategy: manual, meaning autoscaling is disabled. Without autoscaling, the endpoint does not add instances in response to load. Option A increases min replicas but still manual. Option B changes machine type but scaling remains manual. Option D is irrelevant because CPU utilization is low.

Answer

The minReplicaCount is too low

Answer

The accelerator is causing a bottleneck

Answer

The machineType does not have enough CPU

Question 8

Refer to the exhibit. A team deploys a model using Cloud Run. They notice that after scaling up, the new instances take about 90 seconds to become ready and serve requests. They want to reduce this startup time. Which configuration change is most likely to help?

Accepted Answer

Change the container image to use a smaller base image. Option D is correct. Using a smaller container image (e.g., a minimal base image) reduces pull and initialization time, directly lowering startup latency. Option A increases concurrency but doesn't affect startup. Option B reduces the probe delay but the instance may not be ready earlier. Option C reduces memory but could cause OOM if model requires more.

Answer

Reduce the startupProbe initialDelaySeconds to 30

Answer

Reduce the memory limit to 4Gi

Answer

Increase the containerConcurrency to 100

Question 9

A company deploys a custom TensorFlow model to Vertex AI Endpoint for online predictions. After deployment, prediction latency is consistently high (over 500ms) even under low traffic. The model is CPU-only and the default machine type (n1-standard-2) is used. Which action will most likely reduce prediction latency?

Accepted Answer

Change the machine type to n1-highcpu-16 with a GPU accelerator.. Option A is correct because using a machine type with more CPUs or adding a GPU accelerator can reduce inference time for compute-intensive models. Option B is wrong because increasing max replicas does not improve single-request latency. Option C is wrong because batch size affects throughput, not latency per request. Option D is wrong because increasing min replicas reduces cold start but not steady-state latency.

Answer

Increase the max_replica_count to 10 to allow more parallel requests.

Answer

Set min_replica_count to 3 to ensure always-on capacity.

Answer

Increase the batch size in the prediction request.

Question 10

A data scientist runs a batch prediction job on Vertex AI using a custom container. The job processes a large JSONL file (10 GB) and fails with an out-of-memory error. The machine type is n1-standard-4 (15 GB memory). Which action should be taken to resolve the error while minimizing cost?

Accepted Answer

Use a machine type with more memory, such as n1-highmem-8 (52 GB).. Option C is correct because out-of-memory errors suggest the machine's memory is insufficient for the model or data size; increasing to a high-memory machine type adds more memory. Option A is wrong because splitting input data does not reduce per-instance memory pressure if the model itself is large. Option B is wrong because the batch size may need adjustment but the primary issue is memory. Option D is wrong because using a GPU does not increase memory.

Answer

Reduce the batch size in the prediction request.

Answer

Split the input data into smaller files and run multiple batch jobs.

Answer

Add a GPU accelerator to offload computation.

Question 11

A company needs to serve a model for real-time predictions with a strict latency SLA of 100ms at the 99th percentile. The model is lightweight and traffic patterns are highly variable with occasional spikes. Which deployment strategy best meets the SLA while controlling cost?

Accepted Answer

Deploy to Vertex AI Endpoint with min_replica_count=3 and autoscaling enabled.. Option D is correct because setting a minimum number of replicas ensures baseline capacity to handle initial spikes without cold start delays, while autoscaling handles larger spikes. Option A is wrong because batch prediction is not real-time. Option B is wrong because no scaling may cause over-provisioning or under-provisioning. Option C is wrong because Cloud Run with no accelerator may not meet latency SLA for ML models.

Answer

Deploy the model as a Cloud Run service with autoscaling to zero.

Answer

Deploy to Vertex AI Endpoint with manual scaling and a fixed number of replicas.

Answer

Use Vertex AI Batch Prediction.

Question 12

A machine learning team wants to perform A/B testing between two model versions (v1 and v2) on Vertex AI Endpoint. They need to gradually route 10% of traffic to v2 while monitoring performance. What is the most efficient way to achieve this?

Accepted Answer

Deploy both versions to the same endpoint and set traffic_split to 90% for v1 and 10% for v2.. Option B is correct because Vertex AI Endpoint natively supports traffic splitting between model versions. Option A is wrong because creating separate endpoints adds complexity and cost. Option C is wrong because Cloud Load Balancing operates at the network level, not model level. Option D is wrong because batch prediction is not for real-time A/B testing.

Answer

Use a Cloud Load Balancer to route traffic based on a header.

Answer

Create two separate endpoints and use a weighted DNS round-robin.

Answer

Run batch predictions for v2 and log results separately.

Question 13

A team deploys a TensorFlow model using a custom container to Vertex AI Endpoint. The container expects the saved model at the /model directory, but predictions fail with a 'model not found' error. The team used the default Vertex AI serving container in the past. What is the most likely cause?

Accepted Answer

The container reads from a fixed directory /model, but Vertex AI mounts the model at /tmp/model.. Option D is correct because Vertex AI mounts the model artifact at a path specified by the environment variable AIP_STORAGE_URI, typically under /tmp/model. The custom container must read from this location or copy the model. Option A is wrong because the model format is not the issue. Option B is wrong because Vertex AI does not require the model to be in a Cloud Storage bucket mounted at /gcs in this context. Option C is wrong because the container can be GPU-enabled; the error is about file not found.

Answer

The container does not have a GPU accelerator configured.

Answer

The model artifact must be downloaded from Cloud Storage and placed in /gcs.

Answer

The model was saved in a different format (e.g., SavedModel vs. HDF5).

Question 14

A company deploys a model on Vertex AI Endpoint and expects high traffic spikes during promotional events. The current configuration uses manual scaling with 2 replicas. Which autoscaling configuration should they use to handle spikes while minimizing cost during normal traffic?

Accepted Answer

Enable basic scaling with target_cpu_utilization=0.6 and set min_replica_count=2, max_replica_count=10.. Option B is correct because basic scaling with a target metric (e.g., CPU utilization) automatically adjusts replicas based on load, reducing cost during low traffic and scaling up during spikes. Option A is wrong because no scaling cannot adapt. Option C is wrong because manual scaling requires constant adjustments. Option D is wrong because custom metric scaling is possible but basic scaling is simpler and sufficient for CPU-bound models.

Answer

Keep manual scaling but increase replicas to 10.

Answer

Set min_replica_count=2 and max_replica_count=10 with no scaling metric.

Answer

Use custom metric scaling with a Cloud Monitoring metric for prediction latency.

Question 15

A startup wants to deploy a small machine learning model for real-time predictions but has a very limited budget. Traffic is minimal and predictable. They want to avoid paying for idle resources. Which serving option is most cost-effective?

Accepted Answer

Deploy the model as a Cloud Run service using a custom container.. Option B is correct because Cloud Run with a custom container can scale to zero when idle, incurring no cost when not in use. Option A is wrong because Vertex AI Endpoint requires at least one replica (min_replica_count >= 1). Option C is wrong because batch prediction is not real-time. Option D is wrong because deploying on a Compute Engine VM requires 24/7 cost even when idle.

Answer

Deploy the model on a single Compute Engine VM with a GPU.

Answer

Use Vertex AI Batch Prediction for each prediction request.

Answer

Deploy the model to Vertex AI Endpoint with min_replica_count=0.