PMLE Monitoring ML solutions — All Questions With Answers

Question 1mediummultiple choice

Read the full Monitoring ML solutions explanation →

You have deployed a regression model that predicts house prices. Over the past month, the model's predictions have been consistently too high. You suspect data drift in the input features. Which monitoring metric should you prioritize to confirm this?

Question 2hardmultiple choice

Read the full Monitoring ML solutions explanation →

Your team has deployed a text classification model on Vertex AI Endpoints. You notice that the model's latency has increased significantly over the last week, but the request rate has remained stable. Which of the following is the most likely cause?

Question 3easymultiple choice

Read the full Monitoring ML solutions explanation →

You are monitoring a classification model that predicts loan default. The model was trained on data from 2020-2022. In 2023, the economic conditions changed, and the model's accuracy dropped significantly. Which monitoring approach would best help you detect this issue early?

Question 4mediummultiple choice

Read the full Monitoring ML solutions explanation →

You are responsible for monitoring a batch prediction pipeline that runs daily. Recently, the pipeline started failing intermittently with out-of-memory errors. The input data volume has not changed. What is the most likely cause?

Question 5easymultiple choice

Read the full Monitoring ML solutions explanation →

You need to set up monitoring for a Vertex AI model that serves predictions in real-time. The model is expected to have a latency SLA of under 100ms. Which metric should you configure an alert on to ensure the SLA is met?

Question 6hardmultiple choice

Read the full Monitoring ML solutions explanation →

Your company uses a custom container for model serving on Vertex AI. After a recent update, the model returns predictions but they are clearly wrong (e.g., negative probabilities for a classification model). The logs show no errors. What is the most likely cause?

Question 7mediummultiple choice

Read the full Monitoring ML solutions explanation →

You are monitoring a machine learning pipeline that runs on Vertex AI Pipelines. The pipeline occasionally fails with a 'ResourceExhausted' error when attempting to read data from BigQuery. Which action should you take to resolve this issue?

Question 8easymultiple choice

Read the full Monitoring ML solutions explanation →

You have an online prediction model that is showing increasing prediction latency. You have already verified that the request rate and input data size are unchanged. Which of the following should you investigate next?

Question 9mediummulti select

Read the full Monitoring ML solutions explanation →

Which TWO metrics should you monitor to detect data drift in a batch prediction pipeline?

Question 10hardmulti select

Read the full Monitoring ML solutions explanation →

Which THREE components should you include in a comprehensive model monitoring dashboard for a production ML system?

Question 11easymulti select

Read the full Monitoring ML solutions explanation →

Which TWO actions are appropriate when you detect that a production model's prediction distribution has shifted significantly from the training distribution?

Question 12hardmultiple choice

Read the full Monitoring ML solutions explanation →

You are the ML engineer for a financial services company. You have deployed a fraud detection model on Vertex AI Endpoints using a custom container. The model is a gradient boosting model trained on transactional data. Over the past week, the model's precision has dropped from 95% to 80%, while recall has remained stable. The input data volume and distribution have not changed significantly. The model is served on a single endpoint with autoscaling enabled (min replicas=2, max replicas=10). You notice that the average CPU utilization of the serving containers has increased from 40% to 90%, and the p99 latency has increased from 50ms to 200ms. The model is retrained weekly using the latest data, and the last retraining was 3 days ago. The logs show no errors, and the model version is unchanged. Given these symptoms, what is the most likely cause of the precision drop?

Question 13easymultiple choice

Read the full Monitoring ML solutions explanation →

A data science team deploys a regression model to predict house prices. After one month, the mean absolute error (MAE) on the serving data increases by 20% compared to the test set. Which monitoring strategy should the team implement first to diagnose the issue?

Question 14mediummultiple choice

Read the full Monitoring ML solutions explanation →

An e-commerce company uses a recommendation model deployed on Vertex AI Endpoints. The model's latency increases gradually over two weeks, causing timeouts. The model is served using a custom container. What is the most likely root cause and corrective action?

Question 15hardmultiple choice

Read the full NAT/PAT explanation →

A financial services firm deploys a binary classification model for fraud detection. The model's precision is 0.95 and recall is 0.60 on the test set. After deployment, the fraud rate in production is 0.5% compared to 5% in the test set. The model shows good calibration on the test set (Brier score 0.02) but poor calibration in production (Brier score 0.15). What is the most likely explanation for the calibration degradation?

Question 16mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company implements an ML pipeline using Vertex AI Pipelines. The pipeline trains a model using custom training jobs and then deploys it to an endpoint. The team notices that the endpoint occasionally serves an older model version for a few minutes after a new pipeline run completes. What is the most likely cause?

Question 17easymulti select

Read the full Monitoring ML solutions explanation →

A team has deployed a model on Vertex AI Prediction and wants to monitor for data drift. Which TWO metrics should they use to detect drift in numerical features?

Question 18hardmulti select

Read the full Monitoring ML solutions explanation →

A company uses Vertex AI Model Monitoring to detect training-serving skew. They have a categorical feature 'product_category' with high cardinality. The monitoring job alerts for skew, but the data scientists believe the model performance is still acceptable. Which THREE actions should the team take to investigate and resolve the alert?

Question 19mediummultiple choice

Review the full routing breakdown →

You are an ML engineer at a logistics company. You have deployed a deep learning model on Vertex AI Endpoints using a custom container with GPU acceleration. The model predicts delivery times based on route features. After one week, you notice that the endpoint's GPU utilization is consistently at 10%, but the prediction latency has increased by 50%. The number of prediction requests per second has remained stable. You check the container logs and see no errors. The model is served using TensorFlow Serving with batching enabled (batch size: 32, batch timeout: 100ms). The custom container uses a single NVIDIA T4 GPU. You have also set the Vertex AI endpoint to use autoscaling with minReplicaCount: 1 and maxReplicaCount: 5, and the CPU utilization target is 60%. Which action should you take to reduce latency?

Question 20mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company deploys a custom ML model on Vertex AI to predict customer churn. The model retrains weekly, and predictions are served via a Vertex AI endpoint. After a recent retraining, the monitoring dashboard shows a sudden increase in prediction requests but a decrease in predicted churn probabilities. The model's accuracy on the validation set remains stable. What is the most likely cause of the observed behavior?

Question 21hardmulti select

Read the full Monitoring ML solutions explanation →

A financial services company has deployed a classification model on Vertex AI to detect fraudulent transactions. The model is monitored using Vertex AI Model Monitoring for skew and drift detection, and also logs predictions to BigQuery for analysis. After a month, the monitoring alerts show a significant drift in one feature (transaction_amount). Which TWO actions should the team take to diagnose and address this issue?

Question 22mediumdrag order

Read the full Monitoring ML solutions explanation →

Drag and drop the steps to set up a distributed training job on Vertex AI using a custom container in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 23mediummatching

Read the full Monitoring ML solutions explanation →

Match each Google Cloud storage option to its best use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Unstructured object storage for any type of data

NoSQL wide-column database for low-latency, high-throughput

Serverless data warehouse for analytics at scale

Relational database for OLTP workloads

NoSQL document database for mobile/web apps

Question 24mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company deploys a classification model on Vertex AI for loan approval. After a month, they notice the precision has dropped significantly. What should they do first?

Question 25hardmultiple choice

Read the full Monitoring ML solutions explanation →

A team uses custom training and deploys a TensorFlow model using Vertex AI Endpoints. They set up Cloud Monitoring alerts for online prediction latency. However, they notice the latency metric shows a spike every hour, but the actual user experience is fine. What could be the cause?

Question 26easymultiple choice

Read the full Monitoring ML solutions explanation →

A machine learning engineer wants to monitor model performance on Vertex AI for a regression model. Which metric is most appropriate to track the average prediction error?

Question 27mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company uses Vertex AI Model Monitoring to detect data drift. They have a model that predicts house prices. Which dataset should they compare against the training data to detect drift?

Question 28hardmultiple choice

Read the full Monitoring ML solutions explanation →

After setting up model monitoring on Vertex AI for a classification model, the engineer sees a high number of anomaly alerts for the "age" feature. Upon investigation, the age distribution in recent predictions is similar to training data. What might be the cause?

Question 29easymultiple choice

Read the full Monitoring ML solutions explanation →

A data scientist wants to log prediction inputs and outputs for model monitoring. Which Google Cloud service is best suited for this?

Question 30mediummultiple choice

Read the full Monitoring ML solutions explanation →

A team deploys a model using Vertex AI and wants to monitor for concept drift. What should they track?

Question 31hardmultiple choice

Read the full Monitoring ML solutions explanation →

A company uses a custom container on Vertex AI Prediction. They want to send custom metrics from their prediction container to Cloud Monitoring. Which method should they use?

Question 32easymultiple choice

Read the full Monitoring ML solutions explanation →

A model deployed on Vertex AI Endpoints shows increasing prediction latency. What is the most scalable way to reduce latency?

Question 33mediummulti select

Read the full Monitoring ML solutions explanation →

A company uses Vertex AI Model Monitoring. Which two configuration options can be set to reduce false positive drift alerts?

Question 34hardmulti select

Read the full Monitoring ML solutions explanation →

A team is monitoring a batch prediction job on Vertex AI. Which two metrics should they monitor to ensure the job completes successfully without errors?

Question 35mediummulti select

Read the full Monitoring ML solutions explanation →

A company wants to set up end-to-end monitoring for a Vertex AI model. Which three components should they include?

Question 36mediummultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. What is the purpose of this query?

Exhibit

resource.type="ml_job"
jsonPayload.@type="type.googleapis.com/google.cloud.ml.v1.PredictionError"
severity=ERROR

Question 37hardmultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. An engineer notices no drift alerts but the model performance has degraded. What is the likely cause?

Exhibit

modelMonitoringConfig:
  objectiveConfig:
    detectionConfig:
      driftThresholds:
        age: 0.3
        income: 0.1
      skewThresholds:
        age: 0.2
        income: 0.05
  featureAttributionConfig:
    enabled: True

Question 38easymultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. What does this query return?

Exhibit

fetch ml.googleapis.com/prediction_latencies
| filter resource.model_id = "my_model"
| every 1m
| mean

Question 39mediummultiple choice

Read the full Monitoring ML solutions explanation →

Your team has a production ML model on Vertex AI that shows a gradual decline in accuracy over the past week. The model is retrained weekly using the latest data. Which monitoring approach should you implement to detect the issue earlier?

Question 40easymultiple choice

Read the full Monitoring ML solutions explanation →

Your company deploys batch prediction jobs using Vertex AI Batch Prediction. You need to monitor the jobs for failures and performance. What is the recommended approach?

Question 41hardmultiple choice

Read the full Monitoring ML solutions explanation →

A real-time recommendation model deployed on Vertex AI Endpoints is experiencing increased latency, especially during peak hours. The model is hosted on a single machine with 4 CPUs. Which set of actions should you take to diagnose and resolve the issue?

Question 42mediummultiple choice

Read the full Monitoring ML solutions explanation →

Your organization has a requirement to monitor fairness of an ML model that predicts loan approvals. You need to set up alerts if the model's predictions show bias against a protected group. Which tool on Google Cloud can you use to monitor this?

Question 43easymultiple choice

Read the full Monitoring ML solutions explanation →

A data scientist trained a model on historical data from 2020-2022 and deployed it in January 2023. In February 2023, the model's accuracy drops significantly. Which monitoring metric would most likely indicate the root cause?

Question 44hardmultiple choice

Read the full Monitoring ML solutions explanation →

You have a model that predicts equipment failure. The model is retrained every week with new data. You notice that the model's precision is stable but recall drops suddenly. Which monitoring strategy would best help you understand the cause?

Question 45mediummultiple choice

Read the full Monitoring ML solutions explanation →

Your ML pipeline uses Vertex AI Feature Store to serve features for online predictions. You need to monitor the freshness of features in the online store. Which approach is most effective?

Question 46easymultiple choice

Read the full Monitoring ML solutions explanation →

You have deployed a text classification model using Vertex AI Endpoints. The model is performing well, but the operations team wants to be alerted if the endpoint returns an excessive number of HTTP 503 errors. What is the simplest way to achieve this?

Question 47hardmultiple choice

Read the full Monitoring ML solutions explanation →

A recommendation system model is updated daily via a retraining pipeline. After each update, the online prediction latency increases significantly for about 30 minutes before returning to normal. What is the most likely cause and solution?

Question 48mediummulti select

Read the full Monitoring ML solutions explanation →

Your team manages multiple ML models on Vertex AI. You need to implement a centralized monitoring solution to track model performance over time. Which TWO approaches should you consider? (Choose two.)

Question 49hardmulti select

Read the full Monitoring ML solutions explanation →

You are monitoring a production model that is experiencing gradual decay in AUC. Which THREE metrics should you set up alerts for to diagnose the root cause? (Choose three.)

Question 50easymulti select

Read the full Monitoring ML solutions explanation →

Your team deploys a model using Vertex AI Endpoints with autoscaling. Which TWO metrics are most important to monitor in order to optimize cost and performance? (Choose two.)

Question 51easymultiple choice

Read the full Monitoring ML solutions explanation →

A data science team has deployed a model on Vertex AI and wants to automatically detect when the distribution of a specific feature shifts significantly from the training data. Which service should they use?

Question 52mediummultiple choice

Read the full Monitoring ML solutions explanation →

A machine learning engineer notices that the online prediction latency for a custom TensorFlow model deployed on Vertex AI has increased significantly over the past week. Cloud Monitoring shows that the CPU utilization of the endpoints remains below 40%, but the number of concurrent requests has doubled. What is the most likely cause of the latency increase?

Question 53hardmultiple choice

Read the full Monitoring ML solutions explanation →

A large enterprise has multiple ML models deployed in production across different regions. They want to implement a centralized monitoring dashboard that tracks key performance indicators such as prediction accuracy, latency, and error rates for all models, with the ability to drill down into individual model versions. Which approach best meets these requirements?

Question 54easymultiple choice

Read the full Monitoring ML solutions explanation →

An ML team is using Vertex AI Pipelines to run automated retraining workflows. They want to monitor pipeline execution and receive alerts when a pipeline run fails. Which Google Cloud service should they use to set up such alerts?

Question 55mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company has deployed a model that predicts customer churn. The model's performance, as measured by AUC, has been declining over the past month. The team suspects data drift. They have enabled Vertex AI Model Monitoring, but no alerts have been triggered. What is a possible reason for the lack of alerts?

Question 56hardmultiple choice

Read the full Monitoring ML solutions explanation →

A team is monitoring a production ML system that includes multiple models and data processing pipelines. They want to set up a comprehensive alerting strategy that minimizes false positives while ensuring critical issues are promptly addressed. Which approach is the most effective?

Question 57easymultiple choice

Read the full Monitoring ML solutions explanation →

A machine learning model deployed on Vertex AI is returning erroneous predictions. The team needs to investigate the root cause by examining the prediction request and response details. Which Google Cloud tool is best suited for this?

Question 58mediummultiple choice

Read the full Monitoring ML solutions explanation →

A team is using Vertex AI Feature Store to manage features for training and serving. They want to monitor the freshness of the features (i.e., how recently each feature was updated). Which approach should they take?

Question 59hardmultiple choice

Read the full Monitoring ML solutions explanation →

A company has deployed a machine learning model that uses a large input tensor. They notice that the prediction latency varies significantly between requests of the same size. Cloud Monitoring shows that the serving endpoint's CPU utilization is consistently below 50%, but memory utilization fluctuates between 70% and 95%. What is the most likely cause?

Question 60easymulti select

Read the full Monitoring ML solutions explanation →

A team is deploying a new model version. They want to ensure that they can quickly roll back if the new version performs poorly in production. Which TWO actions should they take? (Choose 2.)

Question 61mediummulti select

Read the full Monitoring ML solutions explanation →

A team is responsible for monitoring the health of a Vertex AI pipeline that runs daily. Which THREE resources should they use to gain visibility into pipeline performance and failures? (Choose 3.)

Question 62hardmulti select

Read the full Monitoring ML solutions explanation →

A financial institution uses a machine learning model to approve loans. They must monitor for fairness and bias. Which THREE Google Cloud tools or features can help them achieve this? (Choose 3.)

Question 63easymultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. A Vertex AI prediction endpoint is failing with a deadline exceeded error. The log shows the following. What is the most likely cause?

Exhibit

{
  "insertId": "abc123",
  "textPayload": "Prediction request failed with deadline exceeded",
  "severity": "ERROR",
  "resource": {
    "type": "ml_model_version",
    "labels": {
      "model": "my_model",
      "version": "v2",
      "region": "us-central1"
    }
  },
  "jsonPayload": {
    "prediction_latency_ms": 8500,
    "error": "deadline_exceeded",
    "machine_type": "n1-standard-2",
    "cpu_utilization": 0.95,
    "memory_utilization": 0.9
  }
}

Question 64mediummultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. A team configured Vertex AI Model Monitoring with skew detection for feature "income" with a threshold of 0.2. However, they have not received any alerts even though they suspect data drift. What is the most likely reason?

Exhibit

modelMonitoringConfig:
  skewDetection:
    defaultThreshold: 0.3
    featureThresholds:
      income: 0.2
  driftDetection:
    defaultThreshold: 0.5
  samplingRate: 0.5

Question 65hardmultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. An alert policy is configured to trigger when prediction latency exceeds 500 ms for 5 consecutive minutes. The team is experiencing many false positive alerts during brief latency spikes. Which adjustment would most effectively reduce false positives while still detecting prolonged latency issues?

Exhibit

{
  "name": "projects/123/alertPolicies/456",
  "displayName": "High Latency",
  "conditions": [
    {
      "displayName": "Latency > 500ms",
      "conditionThreshold": {
        "filter": "metric.type=\"vertexai.googleapis.com/prediction/latency\"",
        "comparison": "COMPARISON_GT",
        "thresholdValue": 500,
        "duration": "300s"
      }
    }
  ],
  "combiner": "OR"
}

Question 66mediummultiple choice

Read the full Monitoring ML solutions explanation →

A company deploys a batch prediction job on Vertex AI using a custom container. The job completes successfully, but the predictions are later found to be inaccurate. The ML engineer wants to set up monitoring to detect similar issues proactively. Which approach should the engineer take?

Question 67easymultiple choice

Read the full Monitoring ML solutions explanation →

An ML team is using Vertex AI Online Prediction and wants to receive alerts when the 99th percentile latency exceeds 500ms for more than 5 minutes. What is the best practice to set up this alert in Cloud Monitoring?

Question 68hardmultiple choice

Read the full Monitoring ML solutions explanation →

An e-commerce company uses a Vertex AI endpoint for product recommendations. Recently, the click-through rate (CTR) dropped significantly. Model monitoring shows no significant data drift or skew. Logs show increased latency but no errors. Which technique should the engineer use to diagnose the issue?

Question 69mediummultiple choice

Read the full Monitoring ML solutions explanation →

A data science team uses TFX to train and deploy a model on Vertex AI. They want automated monitoring for pipeline health. Which set of metrics should they monitor to quickly detect issues in the training pipeline?

Question 70easymultiple choice

Read the full Monitoring ML solutions explanation →

An ML engineer is monitoring a Vertex AI Feature Store used for online serving. Which metrics are most important to track for ensuring low-latency online serving?

Question 71hardmultiple choice

Read the full Monitoring ML solutions explanation →

A company uses Vertex AI Predictions with a custom container that invokes an external API for feature enrichment. The prediction response time is highly variable. The engineer wants to monitor the external API's contribution to latency. What should the engineer do?

Question 72mediummultiple choice

Read the full Monitoring ML solutions explanation →

An MLOps team wants to set up alerts for GPU memory utilization on Vertex AI Training jobs. Which approach is most efficient?

Question 73easymultiple choice

Read the full Monitoring ML solutions explanation →

A company deploys an online prediction model serving 100 requests per second. They are optimizing for both latency and throughput. Which monitoring strategy should they use?

Question 74mediummulti select

Read the full Monitoring ML solutions explanation →

A data science team uses Vertex AI Model Monitoring to detect data quality issues in a production model. Which TWO metrics should they enable to identify problems with missing values in predictions? (Select TWO.)

Question 75hardmulti select

Read the full Monitoring ML solutions explanation →

An ML engineer is building a monitoring dashboard for a Vertex AI pipeline that includes training, evaluation, and batch prediction. Which THREE components should be included to provide comprehensive observability? (Select THREE.)

Question 76easymulti select

Read the full Monitoring ML solutions explanation →

An ML team wants to monitor their recommendation model for fairness. Which TWO metrics should they track to detect potential bias? (Select TWO.)

Question 77hardmultiple choice

Read the full Monitoring ML solutions explanation →

A global retailer has deployed a real-time product recommendation model on Vertex AI Endpoints. The model is a large neural network that runs on a single node with 8 vCPUs and 30 GB memory. Over the past week, the p99 latency has increased from 200ms to 2 seconds, and the error rate has risen to 5%. Cloud Monitoring shows that the endpoint's CPU utilization is consistently near 100%, and memory is at 80%. The ML engineer suspects the model is too large for the node, but model size has not changed. Logs show no increase in request volume (steady at 50 QPS). There are no recent model updates. The engineer has tried to increase the node to 16 vCPUs, but latency decreased only slightly. What is the most likely root cause and the best first step to resolve it?

Question 78mediummultiple choice

Read the full Monitoring ML solutions explanation →

A financial services company uses a custom container to serve a fraud detection model on Vertex AI Endpoints. The model requires a feature store lookup for each prediction. Recently, the feature store (Cloud Bigtable) experienced a brief outage, causing some predictions to fail. After the outage resolved, the endpoint's CPU utilization dropped significantly, and prediction latency improved. However, the model's false positive rate increased sharply. The ML engineer suspects the model is using stale features because the feature store outage caused missing lookups. Cloud Monitoring for the endpoint shows no errors after the outage, but the number of feature store read requests per prediction decreased by 30%. Which metric should the engineer use to confirm the hypothesis of stale features?

Question 79easymultiple choice

Read the full Monitoring ML solutions explanation →

A startup is deploying its first machine learning model using BigQuery ML. The model is a logistic regression for churn prediction, trained on a dataset of 5 million rows. The pipeline runs every week: it exports training data from BigQuery, trains a model using BigQuery ML, and then deploys the model as a remote model for predictions. The ML engineer wants to set up basic monitoring to ensure the pipeline runs successfully and the model quality does not degrade. Which monitoring approach should the engineer implement first?

Question 80mediummulti select

Read the full Monitoring ML solutions explanation →

A machine learning engineer is monitoring a deployed churn prediction model that has shown a gradual decline in accuracy over the past month. The engineer wants to diagnose the root cause of the performance degradation. Which TWO actions should the engineer take? (Choose two.)

Question 81easymultiple choice

Read the full Monitoring ML solutions explanation →

A retail company has deployed a machine learning model using Vertex AI Endpoints to predict inventory demand. The model was trained on data from the past two years and has been in production for six months. The team has enabled Vertex AI Model Monitoring to track prediction drift with an alert threshold of 0.2. Last week, they received an alert that the prediction drift score reached 0.35, exceeding the threshold. The engineer checks the monitoring dashboard and sees that the distribution of predictions has shifted noticeably compared to the training data. The engineer also notices that the model's accuracy metrics, computed from weekly ground truth data, have remained within acceptable range. What should the engineer do first?

Question 82mediummultiple choice

Read the full Monitoring ML solutions explanation →

A financial services company uses a custom deep learning model on Vertex AI to automatically approve or reject credit card transactions. The model is explainable using Vertex Explainable AI, and the company monitors feature attribution drift with thresholds defined per feature. Last week, the monitoring system flagged that the mean absolute attribution score for the 'transaction_amount' feature increased from 0.35 to 0.55. The overall model accuracy, measured on a daily batch of labeled transactions, has remained around 97%. The operations team is concerned about potential compliance issues due to changing model behavior. What should the data scientist do?

Question 83hardmultiple choice

Read the full Monitoring ML solutions explanation →

A travel booking company has a real-time recommendation system that suggests hotels and flights to users. The model is served using TensorFlow Serving on a Google Kubernetes Engine (GKE) cluster with auto-scaling enabled. The cluster uses n1-standard-4 machine types. The team has set up Cloud Monitoring dashboards and alerts. Last week, during a major holiday promotion, the team noticed that the model's inference latency P99 increased from 150 ms to 450 ms over a 30-minute period, while the request throughput increased from 500 to 1,200 requests per second. CPU utilization across the cluster rose to 95%, but memory utilization remained at 60%. The model version and the serving infrastructure configuration have not changed since the last deployment. Which action should the team take to mitigate the latency issue?

Question 84mediummulti select

Read the full Monitoring ML solutions explanation →

A financial services company has deployed a credit risk ML model on Vertex AI. They want to monitor the model for fairness across demographic groups to ensure no biased outcomes. Which TWO actions should they take as best practices? (Choose TWO.)

Question 85easymultiple choice

Read the full Monitoring ML solutions explanation →

Refer to the exhibit. A data scientist notices that predictions from a deployed model are taking longer than expected. Which Cloud Monitoring metric should be inspected first to identify the bottleneck?

Exhibit

Refer to the exhibit.
```
{
  "insertId": "abc123",
  "jsonPayload": {
    "predictions": [0.98, 0.12],
    "modelVersionId": "1",
    "latencyMs": 450,
    "region": "us-central1"
  },
  "resource": {
    "type": "vertex_ai_endpoint",
    "labels": {
      "endpoint_id": "1234",
      "model_id": "model-xyz"
    }
  },
  "severity": "INFO",
  "timestamp": "2024-03-15T10:30:00Z"
}
```

Question 86hardmultiple choice

Read the full Monitoring ML solutions explanation →

A retail company deployed a demand forecasting model using TensorFlow on Vertex AI Batch Prediction. The model runs weekly on a large dataset stored in BigQuery. Over the past month, the prediction accuracy has degraded significantly. The ML engineer reviews the monitoring dashboard and sees that the feature distribution for 'product_price' has shifted from a mean of $50 to $55, and the new product category 'electronics' now represents 20% of the data, whereas it was only 5% in training. The model was never retrained after initial deployment six months ago. The engineer also notices that the Vertex Explainable AI feature importance scores have changed: 'product_price' used to be the top feature (importance 0.35) but now ranks third (importance 0.20). The company requires minimal downtime and wants to improve accuracy as quickly as possible without incurring high costs from excessive retraining. Which course of action should the ML engineer take?

modelMonitoringConfig: objectiveConfig: detectionConfig: driftThresholds: age: 0.3 income: 0.1 skewThresholds: age: 0.2 income: 0.05 featureAttributionConfig: enabled: True

{ "insertId": "abc123", "textPayload": "Prediction request failed with deadline exceeded", "severity": "ERROR", "resource": { "type": "ml_model_version", "labels": { "model": "my_model", "version": "v2", "region": "us-central1" } }, "jsonPayload": { "prediction_latency_ms": 8500, "error": "deadline_exceeded", "machine_type": "n1-standard-2", "cpu_utilization": 0.95, "memory_utilization": 0.9 } }

{ "name": "projects/123/alertPolicies/456", "displayName": "High Latency", "conditions": [ { "displayName": "Latency > 500ms", "conditionThreshold": { "filter": "metric.type=\"vertexai.googleapis.com/prediction/latency\"", "comparison": "COMPARISON_GT", "thresholdValue": 500, "duration": "300s" } } ], "combiner": "OR" }

Refer to the exhibit. ``` { "insertId": "abc123", "jsonPayload": { "predictions": [0.98, 0.12], "modelVersionId": "1", "latencyMs": 450, "region": "us-central1" }, "resource": { "type": "vertex_ai_endpoint", "labels": { "endpoint_id": "1234", "model_id": "model-xyz" } }, "severity": "INFO", "timestamp": "2024-03-15T10:30:00Z" } ```