PMLE Serving and scaling models — All Questions With Answers

Question 1easymultiple choice

Read the full Serving and scaling models explanation →

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

Question 2mediummultiple choice

Read the full Serving and scaling models explanation →

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

Question 3hardmultiple choice

Read the full Serving and scaling models explanation →

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

Question 4easymultiple choice

Read the full Serving and scaling models explanation →

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

Question 5mediummultiple choice

Read the full Serving and scaling models explanation →

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

Question 6hardmultiple choice

Read the full Serving and scaling models explanation →

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

Question 7easymultiple choice

Read the full Serving and scaling models explanation →

A company needs to serve a model with strict latency requirements (<100ms). They are using Vertex AI Prediction with CPU. During testing, latency is 150ms. What should they do?

Question 8mediummulti select

Read the full Serving and scaling models explanation →

A company is deploying a model for online predictions on Vertex AI. They want to minimize latency while also handling traffic spikes. Which TWO configurations should they choose?

Question 9hardmulti select

Read the full Serving and scaling models explanation →

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

Question 10mediummulti select

Read the full Serving and scaling models explanation →

A company wants to reduce costs for serving a model on Vertex AI Prediction without sacrificing availability. Which THREE strategies should they consider?

Question 11hardmultiple choice

Read the full Serving and scaling models explanation →

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

Question 12easymultiple choice

Read the full Serving and scaling models explanation →

A data science team has trained a TensorFlow model and wants to serve it online with minimal latency. Which Vertex AI deployment option should they use to ensure the model can handle traffic spikes without manual scaling?

Question 13mediummultiple choice

Read the full Serving and scaling models explanation →

A financial services company deploys a fraud detection model on Vertex AI. The model must make predictions in under 100ms. After deployment, latency spikes to 300ms during peak hours. The model is a large ensemble with 500MB size. Which action is most likely to reduce latency?

Question 14hardmultiple choice

Read the full Serving and scaling models explanation →

A company serves a PyTorch model using a custom container on Vertex AI Prediction. They notice that after a few hours, the endpoint returns 502 errors. The logs show 'Out of memory' errors. The container has a memory limit of 4GB, and the model loads a 3GB vocabulary file. What is the most likely cause and best fix?

Question 15mediummultiple choice

Read the full NAT/PAT explanation →

A team deploys a model on Vertex AI that uses a custom prediction routine (CPR) with a dependency on a native library. The container crashes with 'ImportError: libcudart.so.11.0: cannot open shared object file'. How should they resolve this?

Question 16easymulti select

Read the full Serving and scaling models explanation →

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

Question 17hardmulti select

Read the full Serving and scaling models explanation →

A team is serving a large language model (LLM) on Vertex AI using a custom container. They want to reduce tail latency. Which THREE strategies should they consider?

Question 18mediummultiple choice

Read the full Serving and scaling models explanation →

A model deployed on Vertex AI Prediction repeatedly exits with code 137. What is the most likely cause?

Exhibit

Refer to the exhibit.
```
Log entry:
{
  "severity": "ERROR",
  "message": "Model server process exited with code 137 (SIGKILL)",
  "container": {
    "memory_usage_mb": 4096,
    "memory_limit_mb": 4096
  },
  "@type": "type.googleapis.com/google.cloud.ml.v1.PredictionLog"
}
```

Question 19hardmultiple choice

Read the full Serving and scaling models explanation →

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

Question 20easymultiple choice

Read the full Serving and scaling models explanation →

A company is deploying a machine learning model for real-time fraud detection. The model must respond to requests within 100ms. The model is a TensorFlow model and will be deployed on Google Kubernetes Engine (GKE). Which Google Cloud service should be used to serve the model to minimize latency?

Question 21mediummulti select

Study the full Python automation breakdown →

A data science team has trained a large deep learning model using Vertex AI Workbench. They want to deploy it to Vertex AI Prediction for online serving. The model is stored in a custom container with a Python-based web server. Which TWO actions should the team take to ensure optimal performance and cost?

Question 22hardmultiple choice

Read the full Serving and scaling models explanation →

A data scientist deployed a model to Vertex AI Prediction. When making a prediction request as shown in the exhibit, they receive a 400 error. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
$ curl -X POST -H "Content-Type: application/json" -d '{"instances": [[1.0, 2.0, 3.0]]}' https://us-central1-aiplatform.googleapis.com/v1/projects/my-project/locations/us-central1/endpoints/123456:predict
{
  "error": {
    "code": 400,
    "message": "Prediction failed: exception during prediction: RuntimeError: Model input shape mismatch. Expected shape (None, 2) but received shape (1, 3)."
  }
}

Question 23mediumdrag order

Read the full Serving and scaling models explanation →

Drag and drop the steps to implement a CI/CD pipeline for ML models using Cloud Build and Vertex AI in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 24mediumdrag order

Read the full Serving and scaling models explanation →

Drag and drop the steps to set up a feature store for ML features using Vertex AI Feature Store in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 25mediummatching

Read the full Serving and scaling models explanation →

Match each feature engineering technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Convert categorical variable into binary columns

Combine two or more features to capture interactions

Normalize numeric features to a standard range

Group continuous values into discrete intervals

Weight term frequency by inverse document frequency

Question 26mediummatching

Read the full Serving and scaling models explanation →

Match each optimization algorithm to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Stochastic gradient descent with constant learning rate

Adaptive moment estimation with per-parameter learning rates

Root mean square propagation, adapts learning rate per parameter

Adaptive gradient algorithm, reduces learning rate for frequent features

Accelerates SGD by adding a fraction of previous update

Question 27easymultiple choice

Read the full Serving and scaling models explanation →

A company deploys a model on Vertex AI Prediction for real-time inference. Users report intermittent high latency during peak hours. The model is deployed on a single machine type with `min_replica_count=1` and `max_replica_count=5`. Autoscaling is enabled based on CPU utilization. What is the most likely cause of the latency spikes?

Question 28mediummultiple choice

Read the full Serving and scaling models explanation →

A team needs to serve a PyTorch model for production inference with strict latency requirements (p99 < 100ms). The model has dynamic control flow and uses custom kernels compiled with torch.jit. Which serving approach should they recommend?

Question 29hardmultiple choice

Read the full Serving and scaling models explanation →

A data science team deploys a large language model (LLM) on Vertex AI Prediction using an NVIDIA A100 GPU. The end-to-end latency is acceptable, but the cost is high due to low GPU utilization. The model is stateless and requests are independent. Which strategy would most effectively reduce cost per prediction?

Question 30easymultiple choice

Read the full Serving and scaling models explanation →

A company has deployed a fraud detection model on Vertex AI Prediction. After three months, the model's accuracy has degraded, and the business is losing money due to undetected fraud. What should the team implement to proactively detect such issues?

Question 31mediummultiple choice

Read the full Serving and scaling models explanation →

A team wants to deploy two versions of a model (v1 and v2) on Vertex AI Endpoint to conduct an A/B test. They need to split traffic so that 10% of requests go to v2. Which configuration achieves this?

Question 32hardmultiple choice

Read the full Serving and scaling models explanation →

A model serving team notices that during a flash sale, a real-time recommendation model experiences sudden spikes in traffic, causing some requests to time out. The endpoint is configured with `min_replica_count=3`, `max_replica_count=10`, and autoscaling metric set to `target_utilization=0.6` on CPU. Despite this, autoscaling is too slow. What change will most improve the autoscaling responsiveness?

Question 33easymultiple choice

Read the full Serving and scaling models explanation →

A machine learning engineer wants to manage multiple model versions and facilitate collaboration across teams. The goal is to track model lineage, versioning, and approvals. Which Vertex AI service should they use?

Question 34mediummultiple choice

Read the full Serving and scaling models explanation →

A company needs to serve a model for low-frequency inference requests (a few hundred per month) from multiple regions. The priority is simplicity and minimal cost without maintaining infrastructure. Which serving option should they choose?

Question 35hardmultiple choice

Read the full Serving and scaling models explanation →

A team deploys a real-time model using a custom container on Vertex AI Prediction. The container is large (5 GB) and cold starts are causing latency spikes. The endpoint is configured with `min_replica_count=0` to reduce cost. The team wants to keep the cost low while reducing cold starts. What is the best approach?

Question 36easymulti select

Read the full Serving and scaling models explanation →

Which TWO are best practices for deploying models to Vertex AI Prediction? (Choose 2.)

Question 37mediummulti select

Read the full Serving and scaling models explanation →

A model serving team is experiencing high latency in production. Which TWO actions should they take to diagnose the root cause? (Choose 2.)

Question 38hardmulti select

Read the full Serving and scaling models explanation →

Which THREE factors are critical when designing a model serving architecture for a global user base with strict latency SLAs? (Choose 3.)

Question 39easymultiple choice

Read the full Serving and scaling models explanation →

A company deploys a model on Vertex AI Endpoints for real-time inference. They notice latency spikes during peak hours. Which action is most effective to reduce latency without sacrificing accuracy?

Question 40mediummultiple choice

Read the full Serving and scaling models explanation →

A team uses Vertex AI Prediction with a custom container. They want to perform canary deployments by sending 5% of traffic to a new model version. Which method should they use?

Question 41hardmultiple choice

Read the full Serving and scaling models explanation →

A machine learning engineer notices that a model served on Vertex AI Endpoints returns predictions that are consistently 20% slower during the first request after idle (cold start). They are using automatic scaling with min replicas=1. What is the most likely cause and best solution?

Question 42easymultiple choice

Read the full Serving and scaling models explanation →

An organization needs to serve a large model (10 GB) with low latency across multiple regions. Which Vertex AI feature best meets this requirement?

Question 43mediummultiple choice

Read the full Serving and scaling models explanation →

A data scientist uses Vertex AI Workbench to train a model and then deploys it to an endpoint. They want to automate the retraining and redeployment pipeline when new data arrives. Which service should they use?

Question 44hardmultiple choice

Read the full Serving and scaling models explanation →

A model deployed on Vertex AI Endpoints returns predictions, but the performance metrics (e.g., AUC) degrade over time. The input data distribution is shifting. The team wants to detect and alert on this drift automatically. Which set of actions should they take?

Question 45easymultiple choice

Read the full Serving and scaling models explanation →

For a low-latency real-time serving requirement, which type of Vertex AI Endpoint is appropriate?

Question 46mediummultiple choice

Read the full Serving and scaling models explanation →

A team deploys a model using Vertex AI Endpoint with automatic scaling. They observe that during traffic spikes, new instances take a long time to become ready, causing high latency for some requests. What should they configure to reduce this startup time?

Question 47hardmultiple choice

Read the full Serving and scaling models explanation →

A company uses Vertex AI Endpoints for model serving and wants to implement A/B testing between model versions. They need to gradually shift traffic from the old to the new version while monitoring performance. Which Vertex AI feature allows this with minimal operational overhead?

Question 48easymulti select

Read the full Serving and scaling models explanation →

Which TWO options are best practices for reducing model serving latency on Vertex AI Endpoints? (Choose two.)

Question 49mediummulti select

Read the full Serving and scaling models explanation →

Which THREE factors should be considered when choosing between using Vertex AI Endpoints and Cloud Run for model serving? (Choose three.)

Question 50hardmulti select

Read the full Serving and scaling models explanation →

Which TWO options can help detect model performance degradation in production? (Choose two.)

Question 51hardmultiple choice

Review the full routing breakdown →

Refer to the exhibit. A data scientist deploys a new model version (model_v2) to an existing endpoint with 20% traffic. After a few days, they notice that model_v2's error rate is higher than model_v1's. They want to route all traffic back to model_v1 immediately. Which command achieves this with minimal disruption?

Exhibit

$ gcloud ai endpoints describe my-endpoint --region=us-central1
displayName: my-endpoint
name: projects/123456/locations/us-central1/endpoints/789012
deployedModels:
- id: '1'
  model: projects/123456/locations/us-central1/models/456789
  displayName: model_v1
  createTime: '2024-01-15T10:00:00Z'
  modelDisplayName: test_model
  trafficSplit: 0.8
- id: '2'
  model: projects/123456/locations/us-central1/models/987654
  displayName: model_v2
  createTime: '2024-01-20T10:00:00Z'
  modelDisplayName: test_model_v2
  trafficSplit: 0.2

Question 52mediummultiple choice

Read the full Serving and scaling models explanation →

Refer to the exhibit. A team deploys a model with the above configuration. They observe that during traffic spikes, the endpoint does not scale up quickly enough, causing increased latency. The average CPU utilization never exceeds 50%. What is the most likely reason for the slow scaling?

Exhibit

deployments:
- model: projects/my-project/locations/us-central1/models/123
  displayName: model_v1
  trafficPercentage: 100
  minReplicaCount: 2
  maxReplicaCount: 10
  machineType: n1-standard-4
  acceleratorType: NVIDIA_TESLA_T4
  acceleratorCount: 1
  strategy: manual

Question 53easymultiple choice

Read the full Serving and scaling models explanation →

Refer to the exhibit. A team deploys a model using Cloud Run. They notice that after scaling up, the new instances take about 90 seconds to become ready and serve requests. They want to reduce this startup time. Which configuration change is most likely to help?

Exhibit

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: model-serving
spec:
  template:
    spec:
      containers:
      - image: gcr.io/my-project/model:v2
        resources:
          limits:
            cpu: '2'
            memory: 8Gi
        startupProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
      containerConcurrency: 80

Question 54mediummultiple choice

Read the full Serving and scaling models explanation →

A company deploys a custom TensorFlow model to Vertex AI Endpoint for online predictions. After deployment, prediction latency is consistently high (over 500ms) even under low traffic. The model is CPU-only and the default machine type (n1-standard-2) is used. Which action will most likely reduce prediction latency?

Question 55hardmultiple choice

Read the full Serving and scaling models explanation →

A data scientist runs a batch prediction job on Vertex AI using a custom container. The job processes a large JSONL file (10 GB) and fails with an out-of-memory error. The machine type is n1-standard-4 (15 GB memory). Which action should be taken to resolve the error while minimizing cost?

Question 56easymultiple choice

Read the full NAT/PAT explanation →

A company needs to serve a model for real-time predictions with a strict latency SLA of 100ms at the 99th percentile. The model is lightweight and traffic patterns are highly variable with occasional spikes. Which deployment strategy best meets the SLA while controlling cost?

Question 57mediummultiple choice

Review the full routing breakdown →

A machine learning team wants to perform A/B testing between two model versions (v1 and v2) on Vertex AI Endpoint. They need to gradually route 10% of traffic to v2 while monitoring performance. What is the most efficient way to achieve this?

Question 58hardmultiple choice

Read the full Serving and scaling models explanation →

A team deploys a TensorFlow model using a custom container to Vertex AI Endpoint. The container expects the saved model at the /model directory, but predictions fail with a 'model not found' error. The team used the default Vertex AI serving container in the past. What is the most likely cause?

Question 59mediummultiple choice

Read the full Serving and scaling models explanation →

A company deploys a model on Vertex AI Endpoint and expects high traffic spikes during promotional events. The current configuration uses manual scaling with 2 replicas. Which autoscaling configuration should they use to handle spikes while minimizing cost during normal traffic?

Question 60easymultiple choice

Read the full Serving and scaling models explanation →

A startup wants to deploy a small machine learning model for real-time predictions but has a very limited budget. Traffic is minimal and predictable. They want to avoid paying for idle resources. Which serving option is most cost-effective?

Question 61hardmultiple choice

Read the full Serving and scaling models explanation →

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

Question 62mediummultiple choice

Read the full Serving and scaling models explanation →

After deploying a new version of a model to a Vertex AI Endpoint, the team notices that predictions are still returning results from the old version. The deployment command used a traffic split of 100% to the new version. What is the most likely cause?

Question 63mediummulti select

Read the full Serving and scaling models explanation →

Which TWO actions can help reduce prediction latency for a model deployed on Vertex AI Endpoint without changing the model architecture?

Question 64hardmulti select

Read the full Serving and scaling models explanation →

A company deploys a model to Vertex AI Endpoint with autoscaling enabled. During a traffic spike, they observe high tail latency (99th percentile > 2s). Which TWO factors are most likely contributing to this latency?

Question 65mediummulti select

Read the full Serving and scaling models explanation →

A team wants to serve a large PyTorch model (3 GB) for online predictions with low latency. Which THREE actions should they take?

Question 66mediummultiple choice

Read the full Serving and scaling models explanation →

You deploy a PyTorch model to Vertex AI Online Prediction. After deployment, you observe that inference latency is approximately 300ms per request, but the desired SLA is under 100ms. The model uses a custom container with CPU only. Which action is most likely to reduce latency to the target?

Question 67hardmultiple choice

Read the full NAT/PAT explanation →

Your team is serving a large language model on Vertex AI using a custom container. The endpoint experiences intermittent 502 errors during traffic spikes. The autoscaling configuration uses a CPU utilization target of 60% and the model is deployed on n1-standard-4 instances. The model requires significant memory. Which combination of changes is most likely to resolve the issue?

Question 68easymultiple choice

Read the full Serving and scaling models explanation →

You need to serve a TensorFlow model that has a cold start latency of 20 seconds. The model is used for a real-time application with unpredictable traffic, but occasional bursts require immediate responses. What is the best deployment strategy to minimize both cold start impact and cost?

Question 69hardmultiple choice

Read the full Serving and scaling models explanation →

Your team deploys a multi-model endpoint on Vertex AI with two models: Model A (small, low latency) and Model B (large, high latency). You configure traffic splitting so that 90% goes to Model A and 10% to Model B. However, you notice that the latency for Model A increases when Model B receives traffic. What is the most likely cause?

Question 70mediummultiple choice

Read the full Serving and scaling models explanation →

You are deploying a scikit-learn model for online predictions. The model size is 200 MB. You want to minimize latency and cost. Which serving option should you choose?

Question 71easymultiple choice

Read the full Serving and scaling models explanation →

A company is serving a model for their e-commerce website. They expect traffic to be low at night and very high during flash sales. They want to minimize costs while ensuring availability during spikes. Which autoscaling configuration should they use?

Question 72hardmultiple choice

Read the full Serving and scaling models explanation →

Your model serving endpoint on Vertex AI is experiencing increased memory usage after a recent update. The model was converted from TensorFlow to TF Lite for faster inference. You notice that the endpoint's instances occasionally get killed due to out-of-memory (OOM) errors. What is the most likely cause?

Question 73mediummultiple choice

Read the full Serving and scaling models explanation →

You are using Vertex AI continuous evaluation (model monitoring) for your deployed model. You receive an alert that the prediction distribution is significantly different from the training distribution. What should you do first?

Question 74hardmultiple choice

Read the full Serving and scaling models explanation →

You have a model that requires GPU for efficient inference. You deploy it on Vertex AI with a single NVIDIA T4 GPU accelerator and notice that the GPU utilization hovers around 30%. The endpoint has 10 replicas. What is the best way to improve cost efficiency while maintaining throughput?

Question 75easymulti select

Read the full Serving and scaling models explanation →

Which TWO actions can help reduce the latency of online prediction requests for a deep learning model served on Vertex AI?

Question 76mediummulti select

Read the full Serving and scaling models explanation →

Which THREE factors should you consider when deciding between online prediction and batch prediction on Vertex AI?

Question 77hardmulti select

Read the full Serving and scaling models explanation →

Which TWO statements are true about canary deployments for Vertex AI endpoints?

Question 78mediummultiple choice

Read the full Serving and scaling models explanation →

You run the above command to deploy a new model version to an existing endpoint. After deployment, you observe that the endpoint's previous model version is still receiving 100% of traffic. What is the most likely reason for this?

Exhibit

Refer to the exhibit.

gcloud ai endpoints deploy-model $ENDPOINT_ID \
  --model=$MODEL_ID \
  --display-name=my-model \
  --machine-type=n1-standard-4 \
  --min-replica-count=2 \
  --max-replica-count=10 \
  --traffic-split=0-100

Question 79hardmultiple choice

Read the full Serving and scaling models explanation →

You are troubleshooting a Vertex AI endpoint for a customer. The exhibit shows the endpoint configuration. The customer reports that Model A is experiencing high latency during peaks. Model B runs fine. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "name": "projects/my-project/locations/us-central1/endpoints/1234",
  "displayName": "my-endpoint",
  "dedicatedEndpointEnabled": false,
  "deployedModels": [
    {
      "id": "model-a-1",
      "displayName": "model-a",
      "model": "projects/my-project/locations/us-central1/models/456",
      "dedicatedResources": {
        "minReplicaCount": 1,
        "maxReplicaCount": 5,
        "machineSpec": {
          "machineType": "n1-standard-4",
          "acceleratorType": "NVIDIA_TESLA_T4",
          "acceleratorCount": 1
        }
      }
    },
    {
      "id": "model-b-1",
      "displayName": "model-b",
      "model": "projects/my-project/locations/us-central1/models/789",
      "dedicatedResources": {
        "minReplicaCount": 1,
        "maxReplicaCount": 5,
        "machineSpec": {
          "machineType": "n1-standard-8",
          "acceleratorType": "NVIDIA_TESLA_T4",
          "acceleratorCount": 2
        }
      }
    }
  ],
  "trafficSplit": {
    "model-a-1": 50,
    "model-b-1": 50
  }
}

Question 80hardmultiple choice

Read the full NAT/PAT explanation →

You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?

Question 81easymultiple choice

Read the full Serving and scaling models explanation →

A company has deployed a TensorFlow model on Vertex AI Prediction for real-time inference. They notice that during peak hours, the prediction latency increases significantly, and some requests time out. The model requires GPU acceleration. Which action should they take to reduce latency and avoid timeouts?

Question 82mediummultiple choice

Read the full Serving and scaling models explanation →

A data science team deploys a custom container on Vertex AI Prediction for a PyTorch model. After deployment, the model returns predictions that are consistently off by a constant factor. The model performed correctly during local testing. What is the most likely cause?

Question 83hardmultiple choice

Read the full Serving and scaling models explanation →

A company needs to serve a large Transformer model (5 GB) with strict latency requirements (< 50 ms) and throughput of 1000 requests per second. The model is in SavedModel format. They are considering deployment options on Google Cloud. Which approach best meets these requirements?

Question 84mediummultiple choice

Read the full Serving and scaling models explanation →

A machine learning engineer notices that the Vertex AI Prediction endpoint's error rate has increased over the past week. The model was retrained with new data and redeployed. Which step should the engineer take first to diagnose the issue?

Question 85mediummulti select

Read the full Serving and scaling models explanation →

A company wants to deploy a model for real-time inference with high availability across multiple Google Cloud regions. The model is small and stateless. Which two steps should they take? (Choose two.)

Question 86hardmulti select

Read the full Serving and scaling models explanation →

A company runs batch predictions on a large dataset using Vertex AI Batch Prediction. They want to reduce costs without significantly increasing processing time. Which three actions should they take? (Choose three.)

Question 87easymultiple choice

Read the full Serving and scaling models explanation →

Your team has deployed a scikit-learn model using a custom container on Vertex AI Prediction. The model receives about 100 requests per second, and the endpoint is configured with a single n1-standard-4 machine. You notice that response times are around 200 ms on average, but occasionally spike to over 10 seconds during traffic bursts. You have set the min replicas to 1 and max replicas to 10. Despite this, spikes still occur. What is the most likely cause and the best course of action?

Question 88easymultiple choice

Read the full Serving and scaling models explanation →

You are using Vertex AI Training to train a model and then automatically deploy the best candidate to a Vertex AI Prediction endpoint via the Vertex AI Model Registry. However, after deployment, you notice that the endpoint returns predictions for the new model, but they are significantly different from the evaluation metrics computed during training. The training scripts used TensorFlow with a serving input function. What is the most likely issue and how would you fix it?

Question 89mediummultiple choice

Read the full Serving and scaling models explanation →

Your organization has a large production system that uses Vertex AI Prediction for an NLP model with a 2 GB memory footprint. The endpoint is configured with 5 replicas, each using an n1-standard-4 with a single T4 GPU. Recently, you observed an increase in 503 errors during peak hours. Cloud Monitoring shows that GPU utilization is consistently above 90% across all replicas, while CPU and memory are below 50%. You have already increased the max replicas to 10, but the errors persist because the increased replicas also become saturated. What should you do to resolve the issue?

Question 90mediummultiple choice

Read the full Serving and scaling models explanation →

You are responsible for deploying a real-time recommendation model that uses a large embedding table (5 GB) and a small neural network. The model is served through a custom container on Vertex AI Prediction. The end-to-end latency requirement is under 200 ms. During load testing with 500 QPS, you observe that latency increases linearly with batch size. You are currently using a single replica with an n1-standard-8 machine and one T4 GPU. The embedding table is loaded entirely in GPU memory. However, CPU utilization is at 100% while GPU is at 30%. What is the best approach to meet the latency requirement at scale?

Question 91hardmultiple choice

Read the full Serving and scaling models explanation →

Your team has deployed a PyTorch model using a custom container on Vertex AI Prediction. The model uses dynamic batching to combine incoming requests. You notice that the average latency is 150 ms, but the 99th percentile latency is 2 seconds. Cloud Monitoring shows that the CPU is idle much of the time, and GPU utilization is around 70%. The model is deployed on a single n1-standard-4 with a T4 GPU. You suspect the issue is related to request queuing. Which change would most effectively reduce tail latency?

Question 92hardmultiple choice

Read the full Serving and scaling models explanation →

You manage a multi-tenant serving system on Vertex AI Prediction where multiple models are deployed in a single endpoint using model versioning. One particular model version (v2) is consuming excessive resources, causing latency spikes for other versions. You need to isolate this model to prevent interference. The models are all in TensorFlow SavedModel format. What is the best approach?

Question 93hardmulti select

Read the full NAT/PAT explanation →

An ML engineer is deploying a large BERT-based natural language processing model for real-time inference on Vertex AI Prediction. The model has a large memory footprint (2GB) and experiences unpredictable traffic spikes up to 10x the baseline. The engineer needs to minimize latency and cost while handling spiky traffic. Which TWO actions should the engineer take? (Choose two.)

Question 94mediummultiple choice

Read the full Serving and scaling models explanation →

An ML engineer notices that predictions are taking longer than expected under moderate traffic. Reviewing the endpoint configuration, what is the most likely cause of the high latency?

Exhibit

Refer to the exhibit.

gcloud ai endpoints describe projects/my-project/locations/us-central1/endpoints/456
...
deployedModels:
  - id: 'bert-model-1'
    model: projects/my-project/locations/us-central1/models/bert
    displayName: bert
    automaticResources:
      minReplicaCount: 1
      maxReplicaCount: 10
    machineType: n1-standard-4
    accelerator:
      count: 0
    enableAccessLogging: true
    ...
disableContainerLogging: true
...

Question 95easymultiple choice

Read the full Serving and scaling models explanation →

A company has deployed a computer vision model on Vertex AI Prediction using a custom container. The model processes high-resolution images and serves predictions to a mobile application. Recently, users have reported that predictions sometimes take over 10 seconds, and the application times out. The ML engineer's monitoring shows that the endpoint's CPU utilization is consistently high (above 85%) and that the request latency spikes during peak hours. The model is deployed on n1-standard-4 machines with automatic scaling set to minReplicaCount=1 and maxReplicaCount=5. The engineer has observed that the endpoint rarely scales beyond 2 replicas even during peak hours. What should the engineer do to reduce prediction latency?

Refer to the exhibit. ``` Log entry: { "severity": "ERROR", "message": "Model server process exited with code 137 (SIGKILL)", "container": { "memory_usage_mb": 4096, "memory_limit_mb": 4096 }, "@type": "type.googleapis.com/google.cloud.ml.v1.PredictionLog" } ```

Refer to the exhibit. ``` $ curl -X POST -H "Content-Type: application/json" -d '{"instances": [[1.0, 2.0, 3.0]]}' https://us-central1-aiplatform.googleapis.com/v1/projects/my-project/locations/us-central1/endpoints/123456:predict { "error": { "code": 400, "message": "Prediction failed: exception during prediction: RuntimeError: Model input shape mismatch. Expected shape (None, 2) but received shape (1, 3)." } }

$ gcloud ai endpoints describe my-endpoint --region=us-central1 displayName: my-endpoint name: projects/123456/locations/us-central1/endpoints/789012 deployedModels: - id: '1' model: projects/123456/locations/us-central1/models/456789 displayName: model_v1 createTime: '2024-01-15T10:00:00Z' modelDisplayName: test_model trafficSplit: 0.8 - id: '2' model: projects/123456/locations/us-central1/models/987654 displayName: model_v2 createTime: '2024-01-20T10:00:00Z' modelDisplayName: test_model_v2 trafficSplit: 0.2

deployments: - model: projects/my-project/locations/us-central1/models/123 displayName: model_v1 trafficPercentage: 100 minReplicaCount: 2 maxReplicaCount: 10 machineType: n1-standard-4 acceleratorType: NVIDIA_TESLA_T4 acceleratorCount: 1 strategy: manual

apiVersion: serving.knative.dev/v1 kind: Service metadata: name: model-serving spec: template: spec: containers: - image: gcr.io/my-project/model:v2 resources: limits: cpu: '2' memory: 8Gi startupProbe: tcpSocket: port: 8080 initialDelaySeconds: 60 periodSeconds: 10 containerConcurrency: 80

Refer to the exhibit. gcloud ai endpoints deploy-model $ENDPOINT_ID \ --model=$MODEL_ID \ --display-name=my-model \ --machine-type=n1-standard-4 \ --min-replica-count=2 \ --max-replica-count=10 \ --traffic-split=0-100

Refer to the exhibit. { "name": "projects/my-project/locations/us-central1/endpoints/1234", "displayName": "my-endpoint", "dedicatedEndpointEnabled": false, "deployedModels": [ { "id": "model-a-1", "displayName": "model-a", "model": "projects/my-project/locations/us-central1/models/456", "dedicatedResources": { "minReplicaCount": 1, "maxReplicaCount": 5, "machineSpec": { "machineType": "n1-standard-4", "acceleratorType": "NVIDIA_TESLA_T4", "acceleratorCount": 1 } } }, { "id": "model-b-1", "displayName": "model-b", "model": "projects/my-project/locations/us-central1/models/789", "dedicatedResources": { "minReplicaCount": 1, "maxReplicaCount": 5, "machineSpec": { "machineType": "n1-standard-8", "acceleratorType": "NVIDIA_TESLA_T4", "acceleratorCount": 2 } } } ], "trafficSplit": { "model-a-1": 50, "model-b-1": 50 } }

Refer to the exhibit. gcloud ai endpoints describe projects/my-project/locations/us-central1/endpoints/456 ... deployedModels: - id: 'bert-model-1' model: projects/my-project/locations/us-central1/models/bert displayName: bert automaticResources: minReplicaCount: 1 maxReplicaCount: 10 machineType: n1-standard-4 accelerator: count: 0 enableAccessLogging: true ... disableContainerLogging: true ...