Practice PMLE Monitoring ML solutions questions with full explanations on every answer.
Start practicing
Monitoring ML solutions — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
You have deployed a regression model that predicts house prices. Over the past month, the model's predictions have been consistently too high. You suspect data drift in the input features. Which monitoring metric should you prioritize to confirm this?
2Your team has deployed a text classification model on Vertex AI Endpoints. You notice that the model's latency has increased significantly over the last week, but the request rate has remained stable. Which of the following is the most likely cause?
3You are monitoring a classification model that predicts loan default. The model was trained on data from 2020-2022. In 2023, the economic conditions changed, and the model's accuracy dropped significantly. Which monitoring approach would best help you detect this issue early?
4You are responsible for monitoring a batch prediction pipeline that runs daily. Recently, the pipeline started failing intermittently with out-of-memory errors. The input data volume has not changed. What is the most likely cause?
5You need to set up monitoring for a Vertex AI model that serves predictions in real-time. The model is expected to have a latency SLA of under 100ms. Which metric should you configure an alert on to ensure the SLA is met?
6Your company uses a custom container for model serving on Vertex AI. After a recent update, the model returns predictions but they are clearly wrong (e.g., negative probabilities for a classification model). The logs show no errors. What is the most likely cause?
7You are monitoring a machine learning pipeline that runs on Vertex AI Pipelines. The pipeline occasionally fails with a 'ResourceExhausted' error when attempting to read data from BigQuery. Which action should you take to resolve this issue?
8You have an online prediction model that is showing increasing prediction latency. You have already verified that the request rate and input data size are unchanged. Which of the following should you investigate next?
9Which TWO metrics should you monitor to detect data drift in a batch prediction pipeline?
10Which THREE components should you include in a comprehensive model monitoring dashboard for a production ML system?
11Which TWO actions are appropriate when you detect that a production model's prediction distribution has shifted significantly from the training distribution?
12You are the ML engineer for a financial services company. You have deployed a fraud detection model on Vertex AI Endpoints using a custom container. The model is a gradient boosting model trained on transactional data. Over the past week, the model's precision has dropped from 95% to 80%, while recall has remained stable. The input data volume and distribution have not changed significantly. The model is served on a single endpoint with autoscaling enabled (min replicas=2, max replicas=10). You notice that the average CPU utilization of the serving containers has increased from 40% to 90%, and the p99 latency has increased from 50ms to 200ms. The model is retrained weekly using the latest data, and the last retraining was 3 days ago. The logs show no errors, and the model version is unchanged. Given these symptoms, what is the most likely cause of the precision drop?
13A data science team deploys a regression model to predict house prices. After one month, the mean absolute error (MAE) on the serving data increases by 20% compared to the test set. Which monitoring strategy should the team implement first to diagnose the issue?
14An e-commerce company uses a recommendation model deployed on Vertex AI Endpoints. The model's latency increases gradually over two weeks, causing timeouts. The model is served using a custom container. What is the most likely root cause and corrective action?
15A financial services firm deploys a binary classification model for fraud detection. The model's precision is 0.95 and recall is 0.60 on the test set. After deployment, the fraud rate in production is 0.5% compared to 5% in the test set. The model shows good calibration on the test set (Brier score 0.02) but poor calibration in production (Brier score 0.15). What is the most likely explanation for the calibration degradation?
16A company implements an ML pipeline using Vertex AI Pipelines. The pipeline trains a model using custom training jobs and then deploys it to an endpoint. The team notices that the endpoint occasionally serves an older model version for a few minutes after a new pipeline run completes. What is the most likely cause?
17A team has deployed a model on Vertex AI Prediction and wants to monitor for data drift. Which TWO metrics should they use to detect drift in numerical features?
18A company uses Vertex AI Model Monitoring to detect training-serving skew. They have a categorical feature 'product_category' with high cardinality. The monitoring job alerts for skew, but the data scientists believe the model performance is still acceptable. Which THREE actions should the team take to investigate and resolve the alert?
19You are an ML engineer at a logistics company. You have deployed a deep learning model on Vertex AI Endpoints using a custom container with GPU acceleration. The model predicts delivery times based on route features. After one week, you notice that the endpoint's GPU utilization is consistently at 10%, but the prediction latency has increased by 50%. The number of prediction requests per second has remained stable. You check the container logs and see no errors. The model is served using TensorFlow Serving with batching enabled (batch size: 32, batch timeout: 100ms). The custom container uses a single NVIDIA T4 GPU. You have also set the Vertex AI endpoint to use autoscaling with minReplicaCount: 1 and maxReplicaCount: 5, and the CPU utilization target is 60%. Which action should you take to reduce latency?
20A company deploys a custom ML model on Vertex AI to predict customer churn. The model retrains weekly, and predictions are served via a Vertex AI endpoint. After a recent retraining, the monitoring dashboard shows a sudden increase in prediction requests but a decrease in predicted churn probabilities. The model's accuracy on the validation set remains stable. What is the most likely cause of the observed behavior?
21A financial services company has deployed a classification model on Vertex AI to detect fraudulent transactions. The model is monitored using Vertex AI Model Monitoring for skew and drift detection, and also logs predictions to BigQuery for analysis. After a month, the monitoring alerts show a significant drift in one feature (transaction_amount). Which TWO actions should the team take to diagnose and address this issue?
22Drag and drop the steps to set up a distributed training job on Vertex AI using a custom container in the correct order.
23Match each Google Cloud storage option to its best use case.
24A company deploys a classification model on Vertex AI for loan approval. After a month, they notice the precision has dropped significantly. What should they do first?
25A team uses custom training and deploys a TensorFlow model using Vertex AI Endpoints. They set up Cloud Monitoring alerts for online prediction latency. However, they notice the latency metric shows a spike every hour, but the actual user experience is fine. What could be the cause?
26A machine learning engineer wants to monitor model performance on Vertex AI for a regression model. Which metric is most appropriate to track the average prediction error?
27A company uses Vertex AI Model Monitoring to detect data drift. They have a model that predicts house prices. Which dataset should they compare against the training data to detect drift?
28After setting up model monitoring on Vertex AI for a classification model, the engineer sees a high number of anomaly alerts for the "age" feature. Upon investigation, the age distribution in recent predictions is similar to training data. What might be the cause?
29A data scientist wants to log prediction inputs and outputs for model monitoring. Which Google Cloud service is best suited for this?
30A team deploys a model using Vertex AI and wants to monitor for concept drift. What should they track?
31A company uses a custom container on Vertex AI Prediction. They want to send custom metrics from their prediction container to Cloud Monitoring. Which method should they use?
32A model deployed on Vertex AI Endpoints shows increasing prediction latency. What is the most scalable way to reduce latency?
33A company uses Vertex AI Model Monitoring. Which two configuration options can be set to reduce false positive drift alerts?
34A team is monitoring a batch prediction job on Vertex AI. Which two metrics should they monitor to ensure the job completes successfully without errors?
35A company wants to set up end-to-end monitoring for a Vertex AI model. Which three components should they include?
36Refer to the exhibit. What is the purpose of this query?
37Refer to the exhibit. An engineer notices no drift alerts but the model performance has degraded. What is the likely cause?
38Refer to the exhibit. What does this query return?
39Your team has a production ML model on Vertex AI that shows a gradual decline in accuracy over the past week. The model is retrained weekly using the latest data. Which monitoring approach should you implement to detect the issue earlier?
40Your company deploys batch prediction jobs using Vertex AI Batch Prediction. You need to monitor the jobs for failures and performance. What is the recommended approach?
41A real-time recommendation model deployed on Vertex AI Endpoints is experiencing increased latency, especially during peak hours. The model is hosted on a single machine with 4 CPUs. Which set of actions should you take to diagnose and resolve the issue?
42Your organization has a requirement to monitor fairness of an ML model that predicts loan approvals. You need to set up alerts if the model's predictions show bias against a protected group. Which tool on Google Cloud can you use to monitor this?
43A data scientist trained a model on historical data from 2020-2022 and deployed it in January 2023. In February 2023, the model's accuracy drops significantly. Which monitoring metric would most likely indicate the root cause?
44You have a model that predicts equipment failure. The model is retrained every week with new data. You notice that the model's precision is stable but recall drops suddenly. Which monitoring strategy would best help you understand the cause?
45Your ML pipeline uses Vertex AI Feature Store to serve features for online predictions. You need to monitor the freshness of features in the online store. Which approach is most effective?
46You have deployed a text classification model using Vertex AI Endpoints. The model is performing well, but the operations team wants to be alerted if the endpoint returns an excessive number of HTTP 503 errors. What is the simplest way to achieve this?
47A recommendation system model is updated daily via a retraining pipeline. After each update, the online prediction latency increases significantly for about 30 minutes before returning to normal. What is the most likely cause and solution?
48Your team manages multiple ML models on Vertex AI. You need to implement a centralized monitoring solution to track model performance over time. Which TWO approaches should you consider? (Choose two.)
49You are monitoring a production model that is experiencing gradual decay in AUC. Which THREE metrics should you set up alerts for to diagnose the root cause? (Choose three.)
50Your team deploys a model using Vertex AI Endpoints with autoscaling. Which TWO metrics are most important to monitor in order to optimize cost and performance? (Choose two.)
51A data science team has deployed a model on Vertex AI and wants to automatically detect when the distribution of a specific feature shifts significantly from the training data. Which service should they use?
52A machine learning engineer notices that the online prediction latency for a custom TensorFlow model deployed on Vertex AI has increased significantly over the past week. Cloud Monitoring shows that the CPU utilization of the endpoints remains below 40%, but the number of concurrent requests has doubled. What is the most likely cause of the latency increase?
53A large enterprise has multiple ML models deployed in production across different regions. They want to implement a centralized monitoring dashboard that tracks key performance indicators such as prediction accuracy, latency, and error rates for all models, with the ability to drill down into individual model versions. Which approach best meets these requirements?
54An ML team is using Vertex AI Pipelines to run automated retraining workflows. They want to monitor pipeline execution and receive alerts when a pipeline run fails. Which Google Cloud service should they use to set up such alerts?
55A company has deployed a model that predicts customer churn. The model's performance, as measured by AUC, has been declining over the past month. The team suspects data drift. They have enabled Vertex AI Model Monitoring, but no alerts have been triggered. What is a possible reason for the lack of alerts?
56A team is monitoring a production ML system that includes multiple models and data processing pipelines. They want to set up a comprehensive alerting strategy that minimizes false positives while ensuring critical issues are promptly addressed. Which approach is the most effective?
57A machine learning model deployed on Vertex AI is returning erroneous predictions. The team needs to investigate the root cause by examining the prediction request and response details. Which Google Cloud tool is best suited for this?
58A team is using Vertex AI Feature Store to manage features for training and serving. They want to monitor the freshness of the features (i.e., how recently each feature was updated). Which approach should they take?
59A company has deployed a machine learning model that uses a large input tensor. They notice that the prediction latency varies significantly between requests of the same size. Cloud Monitoring shows that the serving endpoint's CPU utilization is consistently below 50%, but memory utilization fluctuates between 70% and 95%. What is the most likely cause?
60A team is deploying a new model version. They want to ensure that they can quickly roll back if the new version performs poorly in production. Which TWO actions should they take? (Choose 2.)
61A team is responsible for monitoring the health of a Vertex AI pipeline that runs daily. Which THREE resources should they use to gain visibility into pipeline performance and failures? (Choose 3.)
62A financial institution uses a machine learning model to approve loans. They must monitor for fairness and bias. Which THREE Google Cloud tools or features can help them achieve this? (Choose 3.)
63Refer to the exhibit. A Vertex AI prediction endpoint is failing with a deadline exceeded error. The log shows the following. What is the most likely cause?
64Refer to the exhibit. A team configured Vertex AI Model Monitoring with skew detection for feature "income" with a threshold of 0.2. However, they have not received any alerts even though they suspect data drift. What is the most likely reason?
65Refer to the exhibit. An alert policy is configured to trigger when prediction latency exceeds 500 ms for 5 consecutive minutes. The team is experiencing many false positive alerts during brief latency spikes. Which adjustment would most effectively reduce false positives while still detecting prolonged latency issues?
66A company deploys a batch prediction job on Vertex AI using a custom container. The job completes successfully, but the predictions are later found to be inaccurate. The ML engineer wants to set up monitoring to detect similar issues proactively. Which approach should the engineer take?
67An ML team is using Vertex AI Online Prediction and wants to receive alerts when the 99th percentile latency exceeds 500ms for more than 5 minutes. What is the best practice to set up this alert in Cloud Monitoring?
68An e-commerce company uses a Vertex AI endpoint for product recommendations. Recently, the click-through rate (CTR) dropped significantly. Model monitoring shows no significant data drift or skew. Logs show increased latency but no errors. Which technique should the engineer use to diagnose the issue?
69A data science team uses TFX to train and deploy a model on Vertex AI. They want automated monitoring for pipeline health. Which set of metrics should they monitor to quickly detect issues in the training pipeline?
70An ML engineer is monitoring a Vertex AI Feature Store used for online serving. Which metrics are most important to track for ensuring low-latency online serving?
71A company uses Vertex AI Predictions with a custom container that invokes an external API for feature enrichment. The prediction response time is highly variable. The engineer wants to monitor the external API's contribution to latency. What should the engineer do?
72An MLOps team wants to set up alerts for GPU memory utilization on Vertex AI Training jobs. Which approach is most efficient?
73A company deploys an online prediction model serving 100 requests per second. They are optimizing for both latency and throughput. Which monitoring strategy should they use?
74A data science team uses Vertex AI Model Monitoring to detect data quality issues in a production model. Which TWO metrics should they enable to identify problems with missing values in predictions? (Select TWO.)
75An ML engineer is building a monitoring dashboard for a Vertex AI pipeline that includes training, evaluation, and batch prediction. Which THREE components should be included to provide comprehensive observability? (Select THREE.)
76An ML team wants to monitor their recommendation model for fairness. Which TWO metrics should they track to detect potential bias? (Select TWO.)
77A global retailer has deployed a real-time product recommendation model on Vertex AI Endpoints. The model is a large neural network that runs on a single node with 8 vCPUs and 30 GB memory. Over the past week, the p99 latency has increased from 200ms to 2 seconds, and the error rate has risen to 5%. Cloud Monitoring shows that the endpoint's CPU utilization is consistently near 100%, and memory is at 80%. The ML engineer suspects the model is too large for the node, but model size has not changed. Logs show no increase in request volume (steady at 50 QPS). There are no recent model updates. The engineer has tried to increase the node to 16 vCPUs, but latency decreased only slightly. What is the most likely root cause and the best first step to resolve it?
78A financial services company uses a custom container to serve a fraud detection model on Vertex AI Endpoints. The model requires a feature store lookup for each prediction. Recently, the feature store (Cloud Bigtable) experienced a brief outage, causing some predictions to fail. After the outage resolved, the endpoint's CPU utilization dropped significantly, and prediction latency improved. However, the model's false positive rate increased sharply. The ML engineer suspects the model is using stale features because the feature store outage caused missing lookups. Cloud Monitoring for the endpoint shows no errors after the outage, but the number of feature store read requests per prediction decreased by 30%. Which metric should the engineer use to confirm the hypothesis of stale features?
79A startup is deploying its first machine learning model using BigQuery ML. The model is a logistic regression for churn prediction, trained on a dataset of 5 million rows. The pipeline runs every week: it exports training data from BigQuery, trains a model using BigQuery ML, and then deploys the model as a remote model for predictions. The ML engineer wants to set up basic monitoring to ensure the pipeline runs successfully and the model quality does not degrade. Which monitoring approach should the engineer implement first?
80A machine learning engineer is monitoring a deployed churn prediction model that has shown a gradual decline in accuracy over the past month. The engineer wants to diagnose the root cause of the performance degradation. Which TWO actions should the engineer take? (Choose two.)
81A retail company has deployed a machine learning model using Vertex AI Endpoints to predict inventory demand. The model was trained on data from the past two years and has been in production for six months. The team has enabled Vertex AI Model Monitoring to track prediction drift with an alert threshold of 0.2. Last week, they received an alert that the prediction drift score reached 0.35, exceeding the threshold. The engineer checks the monitoring dashboard and sees that the distribution of predictions has shifted noticeably compared to the training data. The engineer also notices that the model's accuracy metrics, computed from weekly ground truth data, have remained within acceptable range. What should the engineer do first?
82A financial services company uses a custom deep learning model on Vertex AI to automatically approve or reject credit card transactions. The model is explainable using Vertex Explainable AI, and the company monitors feature attribution drift with thresholds defined per feature. Last week, the monitoring system flagged that the mean absolute attribution score for the 'transaction_amount' feature increased from 0.35 to 0.55. The overall model accuracy, measured on a daily batch of labeled transactions, has remained around 97%. The operations team is concerned about potential compliance issues due to changing model behavior. What should the data scientist do?
83A travel booking company has a real-time recommendation system that suggests hotels and flights to users. The model is served using TensorFlow Serving on a Google Kubernetes Engine (GKE) cluster with auto-scaling enabled. The cluster uses n1-standard-4 machine types. The team has set up Cloud Monitoring dashboards and alerts. Last week, during a major holiday promotion, the team noticed that the model's inference latency P99 increased from 150 ms to 450 ms over a 30-minute period, while the request throughput increased from 500 to 1,200 requests per second. CPU utilization across the cluster rose to 95%, but memory utilization remained at 60%. The model version and the serving infrastructure configuration have not changed since the last deployment. Which action should the team take to mitigate the latency issue?
84A financial services company has deployed a credit risk ML model on Vertex AI. They want to monitor the model for fairness across demographic groups to ensure no biased outcomes. Which TWO actions should they take as best practices? (Choose TWO.)
85Refer to the exhibit. A data scientist notices that predictions from a deployed model are taking longer than expected. Which Cloud Monitoring metric should be inspected first to identify the bottleneck?
86A retail company deployed a demand forecasting model using TensorFlow on Vertex AI Batch Prediction. The model runs weekly on a large dataset stored in BigQuery. Over the past month, the prediction accuracy has degraded significantly. The ML engineer reviews the monitoring dashboard and sees that the feature distribution for 'product_price' has shifted from a mean of $50 to $55, and the new product category 'electronics' now represents 20% of the data, whereas it was only 5% in training. The model was never retrained after initial deployment six months ago. The engineer also notices that the Vertex Explainable AI feature importance scores have changed: 'product_price' used to be the top feature (importance 0.35) but now ranks third (importance 0.20). The company requires minimal downtime and wants to improve accuracy as quickly as possible without incurring high costs from excessive retraining. Which course of action should the ML engineer take?
The Monitoring ML solutions domain covers the key concepts tested in this area of the PMLE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PMLE domains — no account required.
The Courseiva PMLE question bank contains 86 questions in the Monitoring ML solutions domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Monitoring ML solutions domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included