PMLE Serving and Scaling Models Practice Test 3 — 15 Questions

Question 1

A company deploys a model on Vertex AI Endpoints for real-time inference. They need to minimize latency for prediction requests that are identical to previous requests. Which approach should they use?

Accepted Answer

Implement a caching layer using Cloud Memorystore with request hashing.. Option B is correct because caching identical prediction requests using Cloud Memorystore with request hashing reduces latency by serving cached responses directly from an in-memory cache, avoiding redundant model inference. This approach is ideal for real-time inference where many requests are identical, as it bypasses the model endpoint entirely for cached requests, minimizing response time.

Answer

Use a regional load balancer with session affinity.

Answer

Use Cloud CDN to cache prediction responses.

Answer

Enable prediction caching on Vertex AI Endpoints.

Question 2

A data science team needs to serve multiple versions of the same ML model on Vertex AI Endpoints for A/B testing. They want to gradually shift traffic from the current 'champion' model to a new 'challenger' model. Which feature should they use?

Accepted Answer

Deploy both models to the same endpoint and use traffic splitting.. Vertex AI Endpoints natively support traffic splitting, allowing you to deploy multiple model versions (e.g., champion and challenger) to the same endpoint and assign a percentage of traffic to each. This enables gradual A/B testing without additional infrastructure, as the endpoint automatically routes requests based on the configured split. Option C is correct because it leverages this built-in feature, which is designed specifically for this use case.

Answer

Deploy the challenger to a separate endpoint and use a proxy to split traffic.

Answer

Use Cloud Load Balancing with weighted backend services.

Answer

Use Vertex AI Experiments to manage model versions.

Question 3

An ML engineer is optimizing a large model for deployment on Vertex AI with GPU acceleration. They want to reduce model size and improve inference latency without significant accuracy loss. Which tool should they use?

Accepted Answer

Use Vertex AI Model Optimization with TensorRT.. Option C is correct because Vertex AI Model Optimization with TensorRT is specifically designed to reduce model size and improve inference latency on NVIDIA GPUs by applying techniques like quantization, pruning, and graph optimizations. TensorRT optimizes the model for the target GPU architecture, enabling faster inference with minimal accuracy loss, which directly addresses the engineer's goals.

Answer

Use gcloud CLI to prune the model.

Answer

Use Cloud TPU for faster inference.

Answer

Use TensorFlow.js converter to optimize the model for web.

Question 4

Which Vertex AI service is designed for building and managing approximate nearest neighbor (ANN) indexes for similarity search at scale?

Accepted Answer

Vertex AI Matching Engine (Vector Search). Vertex AI Matching Engine (now Vector Search) provides ANN indexes for similarity search, enabling fast vector similarity queries at scale.

Answer

Vertex AI AutoML

Answer

Vertex AI Workbench

Answer

Vertex AI Prediction

Question 5

A company wants to run batch predictions on millions of records stored in BigQuery. They need to preprocess the data (e.g., feature engineering) before feeding it to the model. Which approach is most scalable and cost-effective?

Accepted Answer

Preprocess with Cloud Dataflow, output to Cloud Storage, then submit a Vertex AI batch prediction job.. Option D is the most scalable and cost-effective because Cloud Dataflow (Apache Beam) provides serverless, auto-scaling preprocessing that handles large volumes of data efficiently, and Vertex AI batch predictions natively read from Cloud Storage, avoiding the need to manage infrastructure. This decouples preprocessing from prediction, allowing each to scale independently and minimizing costs by using ephemeral, pay-per-use resources.

Answer

Use a large DataProc cluster to preprocess and run batch predictions.

Answer

Preprocess inline in the batch prediction job using a custom container.

Answer

Use a custom Python script on a Compute Engine instance.

Question 6

Which of the following is a benefit of using Vertex AI Endpoints with autoscaling and scale-to-zero?

Accepted Answer

It reduces costs by scaling down to zero replicas when no requests are received.. Vertex AI Endpoints with autoscaling and scale-to-zero allow the number of serving replicas to dynamically adjust based on incoming traffic. When no requests are received, the endpoint can scale down to zero replicas, meaning you are not charged for idle compute resources. This directly reduces operational costs compared to maintaining a minimum number of always-on instances.

Answer

It eliminates the need for a load balancer.

Answer

It reduces model training time.

Answer

It automatically upgrades the model version.

Question 7

An engineer deploys a model to a Vertex AI endpoint with minReplicas=1 and maxReplicas=3. The endpoint receives a sudden traffic spike, but it does not scale up beyond 1 replica. The CPU utilization target is 60%. What is the most likely cause?

Accepted Answer

The CPU utilization is below the target threshold, so the autoscaler does not add replicas.. Option C is correct because Vertex AI's autoscaler uses CPU utilization as a metric to decide when to add replicas. If the CPU utilization remains below the 60% target threshold, the autoscaler will not trigger scale-up, even during a traffic spike. The endpoint is configured with minReplicas=1 and maxReplicas=3, but without exceeding the target, it stays at the minimum.

Answer

The model is not deployed correctly.

Answer

The endpoint is configured with the wrong machine type.

Answer

The endpoint is using GPU which cannot autoscale.

Question 8

A company needs to perform real-time similarity search on a dataset of 10 million embedding vectors. They expect low latency (under 10ms) and high throughput. Which index type should they use in Vertex AI Vector Search?

Accepted Answer

Approximate nearest neighbor (ANN) index with ScaNN. For large datasets requiring low latency, an approximate nearest neighbor (ANN) index is appropriate. The Scann algorithm (ScaNN) is used by Vertex AI Vector Search for ANN.

Answer

Brute-force index

Answer

Hash-based index

Answer

Tree-based index

Question 9

Which API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints?

Accepted Answer

gRPC API. gRPC API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints because it uses HTTP/2 for multiplexed streaming, binary serialization (Protocol Buffers), and supports bidirectional streaming, which reduces latency and improves throughput compared to REST. Vertex AI's prediction service natively supports gRPC for real-time inference, making it the optimal choice for latency-sensitive applications.

Answer

Cloud Functions

Answer

REST API

Answer

Cloud Pub/Sub

Question 10

An organization wants to deploy a TensorFlow model on edge devices such as smartphones and IoT devices for offline inference. Which format should they export the model to?

Accepted Answer

TensorFlow Lite (TFLite). TensorFlow Lite is designed for on-device inference on mobile and edge devices, with reduced model size and optimized performance.

Answer

ONNX format

Answer

SavedModel format

Answer

HDF5 format

Question 11

A company is deploying multiple models on a single Vertex AI endpoint to reduce costs. Each model has different traffic patterns. Which configuration should they use?

Accepted Answer

Use a single endpoint with multiple deployed models and traffic allocation.. Vertex AI endpoints support deploying multiple models behind a single endpoint with traffic splitting, allowing you to route different percentages of requests to each model based on their traffic patterns. This reduces infrastructure costs compared to separate endpoints, as the endpoint's underlying compute resources are shared. Traffic allocation can be adjusted dynamically to match changing model usage without redeploying.

Answer

Use Cloud Run to serve each model as a separate service.

Answer

Deploy each model to separate endpoints and use a load balancer.

Answer

Use Vertex AI Matching Engine to serve models.

Question 12

An ML engineer needs to update a model deployed on a Vertex AI endpoint without downtime. They want to gradually shift traffic to the new version while monitoring for errors. What is the correct procedure?

Accepted Answer

Deploy the new model to the same endpoint with 0% traffic initially, then gradually increase traffic while monitoring.. The correct procedure is to deploy the new model version to the same endpoint, initially with 0% traffic, then gradually increase its traffic allocation while monitoring.

Answer

Use a canary deployment by deploying to a separate endpoint and using a load balancer with weighted routing.

Answer

Deploy the new model to a new endpoint, then update DNS to point to the new endpoint.

Answer

Delete the old model and deploy the new one with the same endpoint.

Question 13

A company wants to use Vertex AI Vector Search for real-time product recommendations based on user embeddings. They need to update the index frequently with new product embeddings without significant downtime. Which TWO options should they consider? (Choose 2)

Accepted Answer

Maintain two separate indexes: one for building and one for serving, and swap them after building.. Option A is correct because maintaining two separate indexes allows you to build a new index in the background while the existing index continues to serve queries. Once the new index is fully built and validated, you can swap it into production with minimal downtime. This is a common pattern for high-availability systems where index rebuilds are necessary.

Answer

Use batch updates to rebuild the entire index daily.

Answer

Increase the number of replicas on the deployed index.

Answer

Use a GPU-powered endpoint for faster indexing.

Question 14

An organization is deploying a mission-critical model on Vertex AI Endpoints. They need to ensure high availability and meet a strict SLO of 99.9% uptime. Which THREE steps should they take? (Choose 3)

Accepted Answer

Set minReplicas to at least 2 to ensure redundancy within a region.. To meet a 99.9% SLO, they should deploy across multiple regions for redundancy, set minimum replicas to ensure baseline capacity, and configure health checks to route traffic away from unhealthy instances.

Answer

Use Cloud CDN to cache responses.

Answer

Use a single large instance instead of multiple small ones.

Question 15

Which TWO of the following can be used as input sources for Vertex AI batch prediction jobs? (Choose 2)

Accepted Answer

BigQuery. BigQuery is a supported input source for Vertex AI batch prediction jobs because Vertex AI can directly read data from BigQuery tables for batch predictions. This integration allows you to store your prediction requests in BigQuery and have Vertex AI write the predictions back to a BigQuery output table, streamlining the workflow for large-scale predictions without needing to export data to Cloud Storage first.

Answer

Cloud Firestore

Answer

Cloud SQL

Answer

Cloud Spanner