PMLE Practice Test 24 — 15 Questions

Question 1

A team is scaling their prototype inference model to handle high-throughput requests with low latency. They use a custom container on Vertex AI Prediction. They notice that latency spikes occur under heavy load. What is the most effective strategy?

Accepted Answer

Optimize model serving with batching and model warm-up.. Option C is correct because optimizing model serving with batching and model warm-up reduces per-request overhead and ensures consistent latency. Option A is wrong because adding CPUs may not help if the bottleneck is model inference computation. Option B is wrong because auto-scaling doesn't reduce latency spikes; it adds replicas over time. Option D is wrong because GPU may help but not specifically for latency spikes due to load variation.

Answer

Enable auto-scaling with a higher minimum number of replicas.

Answer

Use a larger machine type with more CPUs.

Answer

Use a GPU-based machine.

Question 2

A machine learning engineer is training a large-scale text classification model using a distributed strategy on TPUs. The training loss decreases normally but the validation loss starts increasing after a few epochs while training loss continues to decrease. The engineer suspects overfitting. Which technique is most appropriate to address this while scaling training?

Accepted Answer

Add dropout regularization.. Option B is correct because dropout regularization is a common technique to prevent overfitting in neural networks, and it can be applied in distributed training without major modifications. Option A is wrong because reducing learning rate may not directly address overfitting. Option C is wrong because increasing batch size can sometimes help generalization but is not a primary anti-overfitting method. Option D is wrong because early stopping prevents further overfitting but does not address the cause during training.

Answer

Use early stopping with patience.

Answer

Reduce the learning rate.

Answer

Increase the batch size.

Question 3

An ML team is converting a prototype model to a production pipeline using Vertex AI. They want to ensure model versioning and lineage. Which two practices should they adopt? (Select TWO)

Accepted Answer

Use Vertex AI Model Registry to manage model versions.. Options A and B are correct. Storing model artifacts in Cloud Storage with versioned directories and using Vertex AI Model Registry provide organized versioning and lineage tracking. Option C is wrong because keeping only the latest version loses history. Option D is wrong because using a separate GCP project per version is unnecessary and complex. Option E is wrong because not tracking versions is poor practice.

Answer

Only keep the latest model version to save storage.

Answer

Train models directly in production without tracking.

Answer

Use a separate GCP project for each model version.

Question 4

A data scientist needs to scale a prototype deep learning model to train on a massive dataset using multiple GPUs. Which three strategies are essential for efficient distributed training? (Select THREE)

Accepted Answer

Implement data parallelism.. Options A, C, and E are correct. Data parallelism (C) is the foundation for scaling across GPUs. Synchronous gradient updates (A) are commonly used to maintain convergence quality. An optimized input pipeline (E) prevents I/O bottlenecks. Option B is wrong because asynchronous updates can cause convergence issues and are not essential. Option D is wrong because using a single large batch size across all workers is not essential; per-worker batch size must be tuned.

Answer

Use a single large batch size across all workers.

Answer

Use asynchronous gradient updates to reduce communication overhead.

Question 5

A team has successfully trained a deep learning model on Vertex AI using a custom container and distributed training with TensorFlow. They want to serve this model for online predictions with low latency. They deploy the model to Vertex AI Endpoint with a single n1-standard-4 machine. During load testing, they observe that the median latency is 200ms, but the 99th percentile latency spikes to 2 seconds. The model is a complex neural network that takes variable-length text as input. Which approach will best reduce tail latency while maintaining throughput?

Accepted Answer

Implement request batching to process multiple inputs per request.. Option C is correct because batching multiple requests together amortizes overhead and reduces per-request latency variability, particularly for variable-length inputs. Option A is wrong because increasing memory does not address compute-bound latency spikes. Option B is wrong because GPU might improve throughput but not necessarily reduce tail latency from variability. Option D is wrong because autoscaling adds replicas over time but does not reduce per-request latency spikes.

Answer

Use autoscaling with a target CPU utilization of 70%.

Answer

Use a GPU machine type like n1-standard-4 with an attached GPU.

Answer

Increase the machine type to n1-highmem-8 to allocate more memory.

Question 6

A model serving team notices that during a flash sale, a real-time recommendation model experiences sudden spikes in traffic, causing some requests to time out. The endpoint is configured with `min_replica_count=3`, `max_replica_count=10`, and autoscaling metric set to `target_utilization=0.6` on CPU. Despite this, autoscaling is too slow. What change will most improve the autoscaling responsiveness?

Accepted Answer

Change the autoscaling metric to 'average request count per replica' with an appropriate target.. Option A is correct because using request count per replica (transactions per second) as a direct measure of load triggers autoscaling faster. Option B is wrong because increasing target utilization makes it slower. Option C is wrong because GPU metrics are only relevant for GPU models. Option D is wrong because reducing min replicas may cause underprovisioning.

Answer

Add a custom metric based on GPU utilization, assuming the model uses GPU.

Answer

Increase the target CPU utilization to 0.8 to reduce the number of replicas and save cost.

Answer

Reduce `min_replica_count` to 1 to allow more aggressive scaling.

Question 7

A team uses Vertex AI Prediction with a custom container. They want to perform canary deployments by sending 5% of traffic to a new model version. Which method should they use?

Accepted Answer

Deploy two separate endpoints and use a load balancer. Option C is correct because Vertex AI endpoints support traffic splitting between deployed models, allowing a controlled canary rollout. Option A is not possible as endpoints cannot have separate traffic splits on different deployments without manual configuration. Option B is incorrect as Model Registry itself does not handle traffic splitting. Option D uses Cloud Run which is not integrated with Vertex AI Prediction.

Answer

Create a new endpoint with manual traffic splitting

Answer

Use Cloud Run for serving with gradual rollout

Answer

Use the Vertex AI Model Registry and configure traffic splitting on the endpoint

Question 8

Which TWO options can help detect model performance degradation in production? (Choose two.)

Accepted Answer

Vertex AI Model Monitoring (drift detection). Options A and E are correct. Vertex AI Model Monitoring detects drift in input features, which can indicate performance degradation. Storing predictions in BigQuery and comparing with ground truth labels directly measures performance. Option B monitors infrastructure, not model performance. Option C is training-time. Option D logs errors but not degradation.

Answer

Vertex AI Experiments on historical data

Answer

Cloud Logging for prediction errors

Answer

Cloud Monitoring custom metrics from serving logs

Question 9

A company needs to serve a model for low-frequency inference requests (a few hundred per month) from multiple regions. The priority is simplicity and minimal cost without maintaining infrastructure. Which serving option should they choose?

Accepted Answer

Use Vertex AI Batch Prediction triggered as needed.. Option D is correct because Vertex AI Batch Prediction runs on demand and is cost-effective for infrequent large batches. Option A is wrong because real-time endpoint incurs per-hour cost even if idle. Option B is wrong because Cloud Run is better for online, not offline. Option C is wrong because Dataflow is more complex and designed for streaming.

Answer

Deploy a real-time Vertex AI Endpoint with min replicas set to 1.

Answer

Set up a Dataflow streaming pipeline to process requests.

Answer

Use Cloud Run with serving container and scale to zero.

Question 10

An organization needs to serve a large model (10 GB) with low latency across multiple regions. Which Vertex AI feature best meets this requirement?

Accepted Answer

Global endpoints. Option A is correct because Vertex AI Global Endpoints automatically route traffic to the nearest region with capacity, reducing latency for geographically distributed users. Option B is for batch jobs, not real-time. Option C is for private access within VPC, which does not address multi-region latency. Option D is for monitoring, not serving.

Answer

Private endpoints

Answer

Batch prediction

Answer

Model Monitoring

Question 11

Refer to the exhibit. A team deploys a model with the above configuration. They observe that during traffic spikes, the endpoint does not scale up quickly enough, causing increased latency. The average CPU utilization never exceeds 50%. What is the most likely reason for the slow scaling?

Accepted Answer

The autoscaling metric is not configured. Option C is correct. The configuration shows strategy: manual, meaning autoscaling is disabled. Without autoscaling, the endpoint does not add instances in response to load. Option A increases min replicas but still manual. Option B changes machine type but scaling remains manual. Option D is irrelevant because CPU utilization is low.

Answer

The minReplicaCount is too low

Answer

The accelerator is causing a bottleneck

Answer

The machineType does not have enough CPU

Question 12

A team deploys a real-time model using a custom container on Vertex AI Prediction. The container is large (5 GB) and cold starts are causing latency spikes. The endpoint is configured with `min_replica_count=0` to reduce cost. The team wants to keep the cost low while reducing cold starts. What is the best approach?

Accepted Answer

Set `min_replica_count=1` to keep at least one replica always warm.. Option B is correct because configuring a minimum number of always-on replicas (e.g., 1) eliminates cold starts for most traffic. Option A is wrong because it may not help if container is large. Option C is wrong because prebuilding images doesn't reduce cold start startup overhead. Option D is wrong because SSD can help but not eliminate cold start latency.

Answer

Use a prebuilt container for the model framework to reduce image size.

Answer

Enable container memory optimization to reduce startup time.

Answer

Provision a Persistent Disk (SSD) for the container image to speed up download.

Question 13

A data scientist uses Vertex AI Workbench to train a model and then deploys it to an endpoint. They want to automate the retraining and redeployment pipeline when new data arrives. Which service should they use?

Accepted Answer

Vertex AI Pipelines. Option C is correct because Vertex AI Pipelines provides a serverless, managed pipeline orchestration service that can automate retraining and redeployment. Option A (Cloud Composer) is a workflow orchestration service but is more complex and not as integrated with Vertex AI. Option B (Cloud Functions) is event-driven but lacks pipeline capabilities. Option D (Cloud Scheduler) is for scheduled jobs, not event-driven retraining.

Answer

Cloud Composer

Answer

Cloud Scheduler

Answer

Cloud Functions

Question 14

Refer to the exhibit. A team deploys a model using Cloud Run. They notice that after scaling up, the new instances take about 90 seconds to become ready and serve requests. They want to reduce this startup time. Which configuration change is most likely to help?

Accepted Answer

Change the container image to use a smaller base image. Option D is correct. Using a smaller container image (e.g., a minimal base image) reduces pull and initialization time, directly lowering startup latency. Option A increases concurrency but doesn't affect startup. Option B reduces the probe delay but the instance may not be ready earlier. Option C reduces memory but could cause OOM if model requires more.

Answer

Reduce the startupProbe initialDelaySeconds to 30

Answer

Reduce the memory limit to 4Gi

Answer

Increase the containerConcurrency to 100

Question 15

Which TWO are best practices for deploying models to Vertex AI Prediction? (Choose 2.)

Accepted Answer

Monitor prediction latency and error rates with Cloud Monitoring alerts.. Options B and D are correct. Option A is wrong because exact same environment may not be available. Option C is wrong because version aliases should be used for easy rollback. Option E is wrong because logging all inputs may cause privacy issues.

Answer

Log all raw prediction inputs and outputs for every request for auditing.

Answer

Always deploy the model in the same environment as training to avoid incompatibility.

Answer

Use the default model version alias 'default' for all deployments to simplify updates.