PMLE Serving and Scaling Models Practice Test 5 — 15 Questions

Question 1

A data science team has trained a custom TensorFlow model for real-time fraud detection. They need to deploy it on Vertex AI with minimal latency and support for multiple concurrent requests. The model requires a GPU for inference. Which machine type should they choose for the Vertex AI endpoint?

Accepted Answer

n1-standard-4 with NVIDIA T4. Option B is correct because the n1-standard-4 with NVIDIA T4 provides the GPU acceleration required for real-time inference, while the n1 machine family supports GPU attachments on Vertex AI. The T4 GPU is optimized for low-latency inference workloads, and the n1-standard-4 offers sufficient CPU and memory for serving a custom TensorFlow model with multiple concurrent requests.

Answer

n2-standard-8

Answer

e2-standard-4

Answer

n1-highmem-4

Question 2

You deploy a new version of a model to a Vertex AI endpoint and want to gradually shift traffic from the old version to the new version over 24 hours. The endpoint currently serves 100% traffic to the old version. What should you do?

Accepted Answer

Update the endpoint to split traffic between the two model versions using the traffic split configuration.. Option C is correct because Vertex AI endpoints support a built-in traffic split configuration that allows you to gradually shift traffic between model versions deployed to the same endpoint. By updating the endpoint's traffic split percentages (e.g., from 100% old / 0% new to 0% old / 100% new over 24 hours), you can achieve a smooth, controlled rollout without changing client code or managing multiple endpoints.

Answer

Use Vertex AI Experiments to run an A/B test between the two versions.

Answer

Deploy the new version to a separate endpoint and update your client to use the new endpoint for a percentage of requests.

Answer

Delete the old version and redeploy the new version with a different endpoint name, then update DNS.

Question 3

You have a Vertex AI endpoint serving a model for real-time predictions. The endpoint is configured with minReplicaCount=2 and maxReplicaCount=10. Over the past week, you notice that the actual number of replicas rarely exceeds 2, but the average CPU utilization is around 85%. You want to reduce costs without impacting performance. What should you do?

Accepted Answer

Decrease minReplicaCount to 1.. Option B is correct because decreasing minReplicaCount to 1 allows the endpoint to scale down to a single replica when traffic is low, reducing compute costs. Since the actual replica count rarely exceeds 2, the current minReplicaCount=2 forces at least two replicas to run continuously, even when one would suffice. With average CPU utilization at 85%, the model is already efficiently handling load, so scaling down to one replica will not impact performance while saving costs.

Answer

Increase minReplicaCount to 5.

Answer

Increase maxReplicaCount to 20.

Answer

Decrease the CPU utilization target to 50%

Question 4

You are deploying a PyTorch model for online predictions on Vertex AI. The model expects input tensors and performs GPU-accelerated inference. You want to minimize prediction latency and maximize throughput. Which approach should you use?

Accepted Answer

Deploy using a prebuilt PyTorch serving container with NVIDIA Triton Inference Server.. Option B is correct because NVIDIA Triton Inference Server provides advanced features like dynamic batching, concurrent model execution, and GPU scheduling that maximize throughput and minimize latency for GPU-accelerated inference. Vertex AI's prebuilt PyTorch serving container with Triton is specifically designed to handle online prediction workloads efficiently, outperforming a plain custom container without an inference server.

Answer

Package the model in a custom container without any inference server.

Answer

Use Vertex AI Model Optimization to quantize the model to FP16 and deploy using the optimized model.

Answer

Use batch prediction instead of online prediction to reduce latency.

Question 5

Your team has deployed a model on Vertex AI endpoints. You need to monitor the prediction latency to ensure it meets a 99th percentile SLO of 500ms. You want to set up an alert if the latency exceeds this threshold. Which metric should you use?

Accepted Answer

The 99th percentile of the `prediction/online/response_latencies` metric.. Option A is correct because the `prediction/online/response_latencies` metric in Vertex AI provides a distribution of latency values, allowing you to query the 99th percentile directly. This aligns with the SLO requirement to monitor the tail latency, not the average or maximum, ensuring that the worst-case performance for 1% of requests stays under 500ms.

Answer

The number of prediction requests that timeout.

Answer

Average prediction latency from the endpoint's logs.

Answer

The maximum prediction latency from the endpoint's monitoring dashboard.

Question 6

You are using Vertex AI Matching Engine (Vector Search) to serve similarity search for an e-commerce product recommendation system. The index is updated daily with new product embeddings via a batch job. However, you notice that some new products are not appearing in the search results for up to 24 hours. You need to ensure that new products are discoverable within 1 hour of ingestion. What should you do?

Accepted Answer

Switch to a streaming update configuration for the index.. Option A is correct because Vertex AI Matching Engine supports streaming updates to the index, which allows new embeddings to be added in near real-time (typically within minutes) without requiring a full batch rebuild. By switching to a streaming update configuration, new product embeddings become searchable within the 1-hour requirement, as the index is updated incrementally as data arrives.

Answer

Deploy multiple index endpoints and use traffic splitting.

Answer

Increase the frequency of the batch update job to run every 30 minutes.

Answer

Use a brute-force index instead of an ANN index.

Question 7

Your company runs a high-traffic web application that serves the same machine learning model prediction for many identical requests (e.g., product recommendations for the same user profile). You want to reduce latency and load on the prediction endpoint by caching responses. Which Google Cloud service should you use?

Accepted Answer

Cloud Memorystore. Cloud Memorystore (B) is correct because it provides a managed in-memory cache (Redis or Memcached) that can store the results of identical prediction requests, reducing latency and load on the prediction endpoint. By caching responses keyed on the user profile or request parameters, subsequent identical requests can be served directly from Memorystore in microseconds, avoiding redundant model inference.

Answer

Cloud CDN

Answer

Cloud Spanner

Answer

BigQuery

Question 8

You have a TensorFlow model that you want to deploy on edge devices for real-time inference. The model was trained in Vertex AI. You need to convert it to a format suitable for on-device inference. Which approach should you use?

Accepted Answer

Convert the model to TensorFlow Lite using the TensorFlow Lite converter.. Option C is correct because TensorFlow Lite is specifically designed for on-device inference on edge devices, offering optimized performance and reduced model size. The TensorFlow Lite converter transforms a TensorFlow model (e.g., from a SavedModel) into the FlatBuffer format (.tflite), which is lightweight and compatible with mobile and embedded platforms. This directly addresses the requirement for real-time inference on edge devices.

Answer

Export the model as a serialized TFX pipeline.

Answer

Export the model to a SavedModel and deploy it using Vertex AI Edge Manager.

Answer

Use Vertex AI Model Optimization to compile the model for edge devices.

Question 9

You need to run batch predictions on a large dataset stored in BigQuery using a Vertex AI model. The dataset contains 10 million rows, and each prediction takes about 100ms. You want to minimize cost and execution time. What should you do?

Accepted Answer

Use Vertex AI batch prediction with BigQuery as the source and sink.. Vertex AI batch prediction natively supports BigQuery as both input and output, eliminating the need for data export or custom pipelines. For 10 million rows at 100ms each, batch prediction processes them in parallel across multiple machines, minimizing execution time while avoiding the per-node costs of online prediction or the overhead of managing Dataflow or GKE clusters.

Answer

Export the BigQuery data to CSV in GCS, then run a custom Dataflow pipeline to make predictions.

Answer

Use Vertex AI online prediction and send all rows as separate requests.

Answer

Use a custom container running on Google Kubernetes Engine to perform inference.

Question 10

You have a Vertex AI endpoint with two deployed models: a champion (v1) and a challenger (v2). You set the traffic split to 90% v1 and 10% v2. After a week, you observe that v2 has better business metrics. You want to shift all traffic to v2 gradually over 3 days to avoid any risk. What should you do?

Accepted Answer

Update the traffic split configuration on the endpoint multiple times over the 3 days to gradually increase v2's percentage.. Option C is correct because Vertex AI endpoints support live traffic splitting between deployed models, allowing you to gradually shift traffic from v1 to v2 by updating the traffic split configuration multiple times over the 3-day period. This approach minimizes risk by enabling incremental rollouts and immediate rollback if issues arise, without requiring client-side changes or downtime.

Answer

Deploy v2 to a new endpoint and update your clients to use the new endpoint.

Answer

Use Vertex AI Experiments to compare v1 and v2, then redeploy v2 with 100% traffic.

Answer

Delete v1 from the endpoint so that all traffic automatically goes to v2.

Question 11

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

Accepted Answer

Enable autoscaling with minReplicaCount=0. Option C is correct because Vertex AI endpoints support autoscaling with a `minReplicaCount` of 0, which allows the endpoint to scale down to zero instances when there are no incoming requests, thereby minimizing costs. This feature is specifically designed for serverless model serving, where the endpoint automatically scales up from zero when traffic arrives and scales down to zero during idle periods.

Answer

Deploy the model to a Compute Engine instance and use instance groups.

Answer

Use a custom metric for autoscaling

Answer

Set maxReplicaCount to 0

Question 12

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

Accepted Answer

Approximate Nearest Neighbor (ANN) index using the ScaNN algorithm.. Option B is correct because the ScaNN (Scalable Nearest Neighbors) algorithm is specifically designed for high-dimensional, large-scale similarity search with strict latency requirements. With 10 million 512-dimensional embeddings, an ANN index like ScaNN can achieve sub-10ms query latency at the 99th percentile by trading a small amount of recall for dramatic speed improvements, which is exactly what Vertex AI Matching Engine optimizes for.

Answer

Brute-force index with cosine distance.

Answer

A custom distance-based index using Cloud SQL.

Answer

A tree-based index from scikit-learn deployed as a custom container.

Question 13

You are deploying a large deep learning model on Vertex AI endpoints. The model requires GPU acceleration and you want to minimize cold-start latency. Which TWO actions should you take? (Choose 2 correct answers)

Accepted Answer

Use a custom container that loads the model during startup.. Option B is correct because loading the model during container startup (e.g., in the Dockerfile's ENTRYPOINT or CMD) ensures that the model is already in memory when the first prediction request arrives, drastically reducing cold-start latency. This is a standard practice for Vertex AI endpoints where the container must be ready to serve immediately after scaling up.

Answer

Set minReplicaCount to 0 to allow scale-to-zero.

Answer

Increase maxReplicaCount to a high number.

Answer

Use batch prediction instead of online prediction.

Question 14

Your team has deployed a model on Vertex AI endpoints and you are planning an A/B test to compare a new challenger model (v2) against the current champion (v1). The test should measure business metrics such as click-through rate. Which THREE steps should you take to set up the A/B test correctly? (Choose 3 correct answers)

Accepted Answer

Deploy the challenger model (v2) to the same endpoint as the champion (v1).. Option A is correct because deploying both v1 and v2 to the same Vertex AI endpoint allows you to use the built-in traffic splitting feature. This enables you to route a percentage of requests to each model version without managing separate endpoints or DNS changes, which is the standard approach for A/B testing on Vertex AI.

Answer

Create a new endpoint for v2 and gradually shift DNS traffic.

Answer

Use Vertex AI Experiments to compare model performance.

Question 15

You need to deploy a model that requires a large amount of memory (over 200 GB) for inference. The model is a custom PyTorch model. Vertex AI endpoints have machine type limitations. Which TWO actions can you take to handle this memory requirement? (Choose 2 correct answers)

Accepted Answer

Use a machine type from the n1-highmem series, such as n1-highmem-32 (208 GB) or higher.. Option A is correct because the n1-highmem-32 machine type provides 208 GB of memory, which meets the requirement of over 200 GB. Vertex AI endpoints support this machine series, allowing you to deploy a custom PyTorch model with sufficient RAM for inference without modification.

Answer

Use multiple replicas and split the model across them.

Answer

Use batch prediction instead of online prediction.

Answer

Deploy the model on Cloud Run with 32 GB memory.