PMLE Serving and Scaling Models Practice Test 2 — 15 Questions

Question 1

You deployed a model to a Vertex AI endpoint with minReplicas=0 and maxReplicas=5. After sending prediction requests, you notice the endpoint takes about 30 seconds to respond initially, but subsequent requests are fast. What is the most likely cause?

Accepted Answer

Cold start occurs because the endpoint scaled down to zero.. Option B is correct because Vertex AI endpoints with minReplicas=0 scale down to zero when idle. The first request after a period of inactivity triggers a cold start, where the endpoint must provision a new VM instance and load the model, causing a ~30-second delay. Subsequent requests are fast because the instance remains warm and handles them without provisioning overhead.

Answer

The model is too large for the machine type.

Answer

The VPC Service Controls are blocking the initial request.

Answer

The endpoint's autoscaling is misconfigured.

Question 2

You have a champion model serving 100% traffic on a Vertex AI endpoint. You want to deploy a challenger model and gradually shift 10% of traffic to it for A/B testing. What is the correct approach?

Accepted Answer

Deploy the challenger on the same endpoint and use the traffic split parameter to allocate 10% traffic to it.. Vertex AI endpoints support traffic splitting by deploying multiple model versions and assigning traffic percentages. You deploy the challenger as a new deployed model on the same endpoint and set traffic split: champion 90%, challenger 10%.

Answer

Use Cloud Run to deploy both models and use Cloud Endpoints for traffic splitting.

Answer

Deploy the challenger on a separate endpoint and use Cloud Armor to split traffic.

Answer

Create a new endpoint for the challenger and route 10% of requests via a load balancer.

Question 3

You need to run batch predictions on 10 TB of text data stored in BigQuery using a custom container model hosted in Vertex AI. What is the most cost-effective and simple approach?

Accepted Answer

Use Vertex AI batch prediction with BigQuery source and sink.. Vertex AI batch prediction natively supports BigQuery as both source and sink, allowing you to run predictions on 10 TB of text data without any data movement or intermediate storage. This is the most cost-effective and simple approach because it eliminates the need for exporting data, managing infrastructure, or calling online endpoints, and it leverages Vertex AI's optimized batch inference infrastructure that scales automatically.

Answer

Use Cloud Run jobs to read from BigQuery and write results back.

Answer

Export BigQuery data to GCS, then run a Dataflow pipeline to call the model's online prediction endpoint for each row.

Answer

Use Cloud Dataproc to spin up a Spark cluster and run the model inference in parallel.

Question 4

Your team is deploying a large recommendation model on Vertex AI endpoints using GPUs. You need to minimise latency while optimising cost. The model serves many similar requests from the same users within short time windows. Which additional service would best reduce latency and cost?

Accepted Answer

Use Cloud Memorystore to cache prediction results.. Caching identical prediction requests can reduce load on the model and improve latency. Cloud Memorystore (Redis) can be used to cache responses based on a hash of the request, and the endpoint can check cache before invoking the model.

Answer

Switch to CPU-only instances to reduce cost.

Answer

Increase maxReplicas to handle the load without caching.

Answer

Set up a Cloud CDN in front of the endpoint.

Question 5

You want to deploy a TensorFlow model to a Vertex AI endpoint and enable online predictions. The model requires GPU for inference. Which machine type should you select when deploying the model?

Accepted Answer

a2-highgpu-1g (with A100 GPU). Option C is correct because the a2-highgpu-1g machine type is specifically designed for GPU-accelerated workloads on Vertex AI, featuring an NVIDIA A100 GPU that meets the inference requirements of a TensorFlow model. Vertex AI online prediction endpoints require a machine type that supports GPU attachment, and the A2 series is the only option among the choices that provides a dedicated GPU for inference.

Answer

n1-standard-4

Answer

e2-standard-4

Answer

n1-highmem-8

Question 6

Your Vertex AI endpoint is experiencing high latency during traffic spikes. You have set maxReplicas=10 and minReplicas=2. The CPU utilisation target is 60%. During spikes, the endpoint never scales beyond 4 replicas. What is the most likely reason?

Accepted Answer

The maxReplicas limit is set to 10, but the cooldown period is preventing rapid scaling.. The default cooldown period (usually 120 seconds) prevents rapid scaling. If traffic spikes are very short, the endpoint may not trigger a scale-up because the utilisation spike doesn't persist long enough.

Answer

You need to enable GPU acceleration.

Answer

The endpoint is using a legacy model framework.

Answer

The machine type is too small.

Question 7

You need to deploy a PyTorch model for online inference on Vertex AI but the model was trained using custom ops that are not natively supported. You want to use NVIDIA Triton Inference Server for optimisation. How should you proceed?

Accepted Answer

Build a custom container with NVIDIA Triton Inference Server and deploy it to Vertex AI.. Vertex AI supports deploying models with NVIDIA Triton Inference Server. You can build a custom container with Triton and the model, then deploy it to a Vertex AI endpoint. This allows using Triton's optimisations.

Answer

Convert the model to TFLite and deploy on an edge device.

Answer

Export the model to ONNX and deploy using Vertex AI's built-in TensorFlow serving.

Answer

Use Vertex AI Model Optimisation to automatically quantise the model.

Question 8

Your team has built a low-latency similarity search service using Vertex AI Matching Engine (Vector Search). The index is updated daily with new embeddings. You need to serve the latest index without downtime. What is the correct deployment strategy?

Accepted Answer

Create a new index version, deploy it to the same endpoint, and then update the endpoint to use the new index version.. Option B is correct because Vertex AI Matching Engine supports deploying a new index version to the same endpoint without downtime. You create a new index version from the updated embeddings, deploy it to the existing endpoint, and then update the endpoint to use the new version. This allows traffic to seamlessly switch to the updated index once it is fully deployed, avoiding any service interruption.

Answer

Use streaming updates to continuously update the index.

Answer

Update the existing index in place by calling the index update API.

Answer

Delete the old index and create a new index each day, then deploy the new index to a new endpoint and update DNS.

Question 9

You need to serve a model on an edge device with low latency and offline capability. Which approach should you use?

Accepted Answer

Export the model to TensorFlow Lite and use Vertex AI Edge Manager for deployment.. TensorFlow Lite is specifically designed for on-device inference with low latency and offline capability, converting models into a lightweight format optimized for edge hardware. Vertex AI Edge Manager extends this by providing deployment, monitoring, and management of models on edge devices, ensuring they run efficiently without constant cloud connectivity.

Answer

Use Cloud Run for on-device inference.

Answer

Deploy the model to a Vertex AI endpoint and rely on mobile connectivity.

Answer

Use AI Platform Prediction (not Vertex AI).

Question 10

You have a Vertex AI endpoint with two deployed models: model A (champion) and model B (challenger). Traffic split is 90:10. You want to gradually increase model B's traffic to 50% over a week. What is the best way to update the traffic split?

Accepted Answer

Use the gcloud ai endpoints update command to change traffic split.. The `gcloud ai endpoints update` command allows you to directly modify the traffic split of an existing Vertex AI endpoint without redeploying or recreating the endpoint. This is the intended method for gradually shifting traffic between deployed models, as it supports incremental updates to the `--traffic-split` parameter, enabling a controlled rollout from 90:10 to 50:50 over a week.

Answer

Use gcloud ai models upload to overwrite model B with new settings.

Answer

Create a new endpoint and migrate traffic gradually using a load balancer.

Answer

Delete model B and redeploy with a new traffic split.

Question 11

You are using Vertex AI Prediction with a custom container that requires a large model file (5 GB). Deployment takes 10 minutes to start. You want to reduce cold start latency. Which action would be MOST effective?

Accepted Answer

Set minReplicas to 1 to keep at least one instance always running.. Cold start latency occurs when a new instance is started and must load the 5 GB model from disk, which can take 10 minutes. Setting minReplicas to 1 ensures that at least one instance is always running and serving, so the model is already loaded in memory, completely avoiding cold start on prediction requests. While other options (compressing the model, using local SSD, or switching to batch prediction) may reduce load time or avoid online serving, they do not eliminate the cold start penalty as effectively as maintaining a warm replica.

Answer

Compress the model file and decompress on startup.

Answer

Use a machine type with local SSD to speed up model loading.

Answer

Switch to batch prediction to avoid online cold start.

Question 12

You need to query a Vertex AI Vector Search index for nearest neighbours. The index is deployed on an endpoint. Which API method should you use to perform the query?

Accepted Answer

projects.locations.indexEndpoints.findNeighbors. The correct API method to query a deployed Vertex AI Vector Search index for nearest neighbors is `projects.locations.indexEndpoints.findNeighbors`. This method is specifically designed for vector similarity search against an index endpoint, returning the nearest neighbors for a given query vector. The other options either target the wrong resource (indexes instead of indexEndpoints) or use methods intended for different purposes like model prediction.

Answer

projects.locations.indexes.match

Answer

projects.locations.indexes.query

Answer

projects.locations.endpoints.predict

Question 13

You are deploying a model for real-time inference with strict latency requirements (<100ms P99). You want to autoscale based on custom metrics. Which TWO actions should you take? (Choose 2)

Accepted Answer

Configure the endpoint to use custom metrics from Cloud Monitoring.. Option B is correct because Cloud Monitoring custom metrics allow you to define autoscaling based on signals that are directly relevant to your inference latency, such as request queue depth or model throughput. This enables the autoscaler to react to real-time demand more precisely than CPU or memory utilization alone, which is critical for meeting strict P99 latency targets.

Answer

Use a regional endpoint to reduce network latency.

Answer

Enable GPU acceleration for faster inference.

Answer

Set minReplicas to 0 to save cost.

Question 14

Your team is using Vertex AI Prediction for a large-scale NLP model (PyTorch, custom ops). The model currently runs on CPU but you want to optimise inference cost and performance. Which THREE approaches should you consider? (Choose 3)

Accepted Answer

Deploy the model with a GPU machine type and use TensorRT optimisation.. Option A is correct because deploying the model with a GPU machine type (e.g., NVIDIA A100 or T4) and using TensorRT optimization can significantly accelerate inference for PyTorch models with custom ops. TensorRT performs layer fusion, precision calibration (FP16/INT8), and kernel auto-tuning, which reduces latency and improves throughput on GPU hardware. This directly addresses the goal of optimizing both cost and performance for large-scale NLP models.

Answer

Convert the model to TensorFlow Lite and deploy on Vertex AI endpoint.

Answer

Switch to batch prediction to reduce cost.

Question 15

You need to deploy a model for online predictions with low latency. You want to ensure that the endpoint can handle traffic bursts without cold start. Which TWO configurations should you set? (Choose 2)

Accepted Answer

Set minReplicas to 1.. To avoid cold start, you must keep at least one replica always running (minReplicas ≥ 1) and ensure that the machine type has enough capacity. Setting minReplicas to 0 would cause cold start. Also, setting a higher CPU utilisation target may help but not directly avoid cold start.

Answer

Set maxReplicas to a high number to handle bursts.

Answer

Deploy the model as a custom container.

Answer

Enable autoscaling with a target CPU utilisation of 30%.