PMLE Practice Questions

Question 1

A travel booking company has a real-time recommendation system that suggests hotels and flights to users. The model is served using TensorFlow Serving on a Google Kubernetes Engine (GKE) cluster with auto-scaling enabled. The cluster uses n1-standard-4 machine types. The team has set up Cloud Monitoring dashboards and alerts. Last week, during a major holiday promotion, the team noticed that the model's inference latency P99 increased from 150 ms to 450 ms over a 30-minute period, while the request throughput increased from 500 to 1,200 requests per second. CPU utilization across the cluster rose to 95%, but memory utilization remained at 60%. The model version and the serving infrastructure configuration have not changed since the last deployment. Which action should the team take to mitigate the latency issue?

Accepted Answer

Add more nodes to the GKE cluster to increase the total CPU resources available for serving.. The latency spike is caused by CPU saturation (95% utilization) under increased load (500 to 1,200 RPS). Adding more nodes to the GKE cluster directly increases the total CPU resources available, allowing the existing TensorFlow Serving pods to handle the higher throughput without contention. This is the most immediate and infrastructure-appropriate fix because the model version and serving configuration have not changed, ruling out model-level or code-level optimizations.

Answer

Implement a feature engineering pipeline that compresses the input features to reduce data size and inference time.

Answer

Deploy a newer version of the model that uses a more efficient architecture to reduce computational complexity.

Answer

Increase the number of TensorFlow Serving instances by reducing the CPU request per pod in GKE to allow more pods per node.

Question 2

A global retail company uses Vertex AI Recommendations to provide product recommendations on their website. They have a large catalog and millions of users. The initial deployment works well for active users, but they notice that new users (with no purchase history) receive generic recommendations that are not personalized. The company wants to improve the cold-start experience. They have user demographic data (age, location) available at sign-up. Current recommendation model is a collaborative filtering model using the built-in Vertex AI Recommendations. What should the company do to improve personalization for new users?

Accepted Answer

Increase the user exploration parameter in the Vertex AI Recommendations configuration. Option C is correct because increasing the user exploration parameter in Vertex AI Recommendations instructs the model to allocate a higher percentage of recommendations to items with less historical data, effectively enabling personalized suggestions for cold-start users based on available demographic signals. This parameter directly controls the balance between exploiting known user-item interactions and exploring new or less-seen items, which is the standard mechanism within Vertex AI's built-in collaborative filtering to address the cold-start problem without requiring a custom model.

Answer

Collect more historical interaction data before showing recommendations

Answer

Disable recommendations for new users until they have at least 10 interactions

Answer

Build a custom two-tower recommendation model using Vertex AI Training

Question 3

Your team is developing a machine learning model for real-time fraud detection. The training pipeline runs on Vertex AI and uses BigQuery for feature engineering. Recently, the pipeline has been taking significantly longer to execute. Upon investigation, you find that the BigQuery query for feature extraction is being rerun every time the pipeline runs, even though the underlying data hasn't changed. The pipeline is scheduled to run every hour. You want to reduce cost and execution time without losing the ability to detect data drifts. Which approach should you take?

Accepted Answer

Move the feature extraction to a separate scheduled query in BigQuery and load the results into a table that the pipeline reads from.. Option B is correct because it decouples the feature extraction from the training pipeline by using a separate scheduled BigQuery query that writes results to a table. This eliminates redundant query execution on every pipeline run, reducing cost and execution time, while the scheduled query can be set to run at a frequency that still detects data drifts (e.g., hourly). The pipeline then reads from the precomputed table, avoiding repeated full scans of the source data.

Answer

Implement a caching mechanism in the pipeline that stores the results of the BigQuery query and reuses them if the data hasn't changed.

Answer

Reduce the pipeline frequency to once a day to minimize the number of runs.

Answer

Use a conditional pipeline that checks if the data has changed before running the feature extraction step.

Question 4

A healthcare organization is building a machine learning model to predict patient readmission risk. They have sensitive data stored in BigQuery that includes protected health information (PHI). The data science team uses Vertex AI Workbench notebooks to explore the data and develop models. The organization's security policy requires that all PHI data must be encrypted at rest and in transit, and that access to the data is logged and audited. They also need to ensure that the data used for model training is de-identified to remove direct identifiers such as patient names and SSNs. The team wants to automate the de-identification process as part of the data pipeline. Which approach meets these requirements?

Accepted Answer

Create a Dataflow pipeline that reads from the original BigQuery table, applies Cloud DLP de-identification transforms, and writes to a new BigQuery table. Grant the data science team access to the de-identified table.. Option A is correct because it uses Cloud DLP within a Dataflow pipeline to automatically de-identify PHI data as it is read from the original BigQuery table and written to a new, de-identified table. This satisfies the requirement for automated de-identification, while the original table remains encrypted at rest (BigQuery default) and in transit (TLS), and access to the original data can be logged via Cloud Audit Logs. The data science team only gets access to the de-identified table, ensuring PHI is not exposed during model development.

Answer

Enable Shielded VM on Vertex AI Workbench notebooks and use VPC-SC to restrict data access.

Answer

Use Cloud Key Management Service to encrypt the PHI columns in BigQuery, and share the encryption key with the data science team.

Answer

Use BigQuery row-level security to mask PHI columns for the data science team, and train the model directly on the original table.

Question 5

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

Accepted Answer

Switch to GPU instances (e.g., n1-standard-4 with T4) and set min replicas=2 with autoscaling up to 10. Option A is correct because switching to GPU instances (n1-standard-4 with T4) offloads compute-intensive recommendation model inference to GPUs, significantly reducing per-request latency. Setting min replicas=2 ensures that at least two instances are always warm, reducing cold-start delays and allowing autoscaling to handle traffic spikes more responsively. This combination addresses both the CPU bottleneck and the slow scaling trigger, keeping latency under 50ms even at 1000 QPS.

Answer

Increase min replicas to 5 to keep warm instances

Answer

Set min replicas=1 and max replicas=5 to control cost

Answer

Increase max replicas to 20 and keep CPU instances

Question 6

A financial services company uses Vertex AI AutoML Tables to build a credit risk model. The dataset contains 500,000 rows and 50 features, including loan amount, credit score, debt-to-income ratio, and employment length. The target variable is binary: 'default' (1) or 'no default' (0). The data is highly imbalanced, with only 2% defaults. The data scientist trains a model with AutoML Tables using default settings. The evaluation metrics show an AUC of 0.85, but the confusion matrix reveals that the model predicts 'no default' for almost all cases, missing most defaults. The data scientist needs to improve the model's ability to identify defaults without significantly increasing false positives. They have limited time and cannot write custom code. What should they do?

Accepted Answer

Enable 'Enable weighted evaluation' and set the optimization objective to 'Maximize recall at a specific recall@P%' with a target precision of 0.5.. Option C is correct because AutoML Tables allows you to set a custom optimization objective to handle class imbalance without custom code. By enabling weighted evaluation and setting the objective to 'Maximize recall at a specific recall@P%' with a target precision of 0.5, the model will be tuned to prioritize identifying defaults (recall) while maintaining a specified precision level, directly addressing the need to catch more defaults without a massive increase in false positives.

Answer

Manually split the data into a stratified train/test set to ensure the same proportion of defaults in each.

Answer

Train multiple models with different algorithms (e.g., XGBoost, Random Forest) and blend them using a custom script.

Answer

Under-sample the majority class to create a balanced dataset and retrain.

Question 7

A financial services firm deploys a binary classification model for fraud detection. The model's precision is 0.95 and recall is 0.60 on the test set. After deployment, the fraud rate in production is 0.5% compared to 5% in the test set. The model shows good calibration on the test set (Brier score 0.02) but poor calibration in production (Brier score 0.15). What is the most likely explanation for the calibration degradation?

Accepted Answer

The relationship between features and the target has changed (concept drift), causing the model's probability estimates to be misaligned with the true probabilities.. The model's calibration degrades in production despite being well-calibrated on the test set, which had a 5% fraud rate, while production has a 0.5% fraud rate. This shift in class imbalance (prior probability shift) directly affects the model's probability estimates because the model's predicted probabilities are conditional on the training distribution. Option D is correct because concept drift—specifically a change in the base rate of fraud—causes the model's probability estimates to no longer reflect the true posterior probabilities in production, leading to a higher Brier score.

Answer

The distribution of input features has shifted significantly, causing the model to produce incorrect probabilities.

Answer

The model overfits to noise in the training data, leading to poor generalization.

Answer

The production data has a different class imbalance than the training data, causing the model to be biased toward the majority class.

Question 8

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

Accepted Answer

Approximate Nearest Neighbor (ANN) index using the ScaNN algorithm.. Option B is correct because the ScaNN (Scalable Nearest Neighbors) algorithm is specifically designed for high-dimensional, large-scale similarity search with strict latency requirements. With 10 million 512-dimensional embeddings, an ANN index like ScaNN can achieve sub-10ms query latency at the 99th percentile by trading a small amount of recall for dramatic speed improvements, which is exactly what Vertex AI Matching Engine optimizes for.

Answer

Brute-force index with cosine distance.

Answer

A custom distance-based index using Cloud SQL.

Answer

A tree-based index from scikit-learn deployed as a custom container.

Question 9

A machine learning engineer wants to deploy a trained model to Vertex AI for online predictions. Which Vertex AI resource is required to serve the model and provide an endpoint URL?

Accepted Answer

Vertex AI Endpoint. Vertex AI Endpoint is the required resource to deploy a trained model for online predictions, as it provides a dedicated endpoint URL that accepts prediction requests and routes them to the model. Without an endpoint, the model cannot be accessed via HTTP/HTTPS for real-time inference, which is the core requirement for online serving.

Answer

Vertex AI Pipeline

Answer

Vertex AI Model Registry

Answer

Vertex AI Feature Store

Question 10

You have a Vertex AI endpoint serving a model for real-time predictions. The endpoint is configured with minReplicaCount=2 and maxReplicaCount=10. Over the past week, you notice that the actual number of replicas rarely exceeds 2, but the average CPU utilization is around 85%. You want to reduce costs without impacting performance. What should you do?

Accepted Answer

Decrease minReplicaCount to 1.. Option B is correct because decreasing minReplicaCount to 1 allows the endpoint to scale down to a single replica when traffic is low, reducing compute costs. Since the actual replica count rarely exceeds 2, the current minReplicaCount=2 forces at least two replicas to run continuously, even when one would suffice. With average CPU utilization at 85%, the model is already efficiently handling load, so scaling down to one replica will not impact performance while saving costs.

Answer

Increase minReplicaCount to 5.

Answer

Increase maxReplicaCount to 20.

Answer

Decrease the CPU utilization target to 50%

Question 11

Your company runs a high-traffic web application that serves the same machine learning model prediction for many identical requests (e.g., product recommendations for the same user profile). You want to reduce latency and load on the prediction endpoint by caching responses. Which Google Cloud service should you use?

Accepted Answer

Cloud Memorystore. Cloud Memorystore (B) is correct because it provides a managed in-memory cache (Redis or Memcached) that can store the results of identical prediction requests, reducing latency and load on the prediction endpoint. By caching responses keyed on the user profile or request parameters, subsequent identical requests can be served directly from Memorystore in microseconds, avoiding redundant model inference.

Answer

Cloud CDN

Answer

Cloud Spanner

Answer

BigQuery

Question 12

You have a Vertex AI endpoint with two deployed models: a champion (v1) and a challenger (v2). You set the traffic split to 90% v1 and 10% v2. After a week, you observe that v2 has better business metrics. You want to shift all traffic to v2 gradually over 3 days to avoid any risk. What should you do?

Accepted Answer

Update the traffic split configuration on the endpoint multiple times over the 3 days to gradually increase v2's percentage.. Option C is correct because Vertex AI endpoints support live traffic splitting between deployed models, allowing you to gradually shift traffic from v1 to v2 by updating the traffic split configuration multiple times over the 3-day period. This approach minimizes risk by enabling incremental rollouts and immediate rollback if issues arise, without requiring client-side changes or downtime.

Answer

Deploy v2 to a new endpoint and update your clients to use the new endpoint.

Answer

Use Vertex AI Experiments to compare v1 and v2, then redeploy v2 with 100% traffic.

Answer

Delete v1 from the endpoint so that all traffic automatically goes to v2.

Question 13

Your team has deployed a model on Vertex AI endpoints and you are planning an A/B test to compare a new challenger model (v2) against the current champion (v1). The test should measure business metrics such as click-through rate. Which THREE steps should you take to set up the A/B test correctly? (Choose 3 correct answers)

Accepted Answer

Deploy the challenger model (v2) to the same endpoint as the champion (v1).. Option A is correct because deploying both v1 and v2 to the same Vertex AI endpoint allows you to use the built-in traffic splitting feature. This enables you to route a percentage of requests to each model version without managing separate endpoints or DNS changes, which is the standard approach for A/B testing on Vertex AI.

Answer

Create a new endpoint for v2 and gradually shift DNS traffic.

Answer

Use Vertex AI Experiments to compare model performance.

Question 14

A company is deploying a complex model that requires GPU for inference. They want to use Vertex AI for serving. Which TWO steps are required to deploy the model with GPU support? (Choose 2)

Accepted Answer

Select a GPU-enabled machine type such as n1-standard-4 with 1 x NVIDIA Tesla T4.. Option A is correct because Vertex AI requires selecting a GPU-enabled machine type (e.g., n1-standard-4 with 1 x NVIDIA Tesla T4) when deploying a model for inference. This is done in the machine specification of the endpoint deployment, ensuring the GPU hardware is allocated for the serving container.

Answer

Enable Vertex AI Model Optimization for automatic GPU compilation.

Answer

Increase the minimum replicas to at least 2 for GPU redundancy.

Answer

Use gRPC protocol for prediction requests to reduce latency.

Question 15

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

Accepted Answer

Enable autoscaling with minReplicaCount=0. Option C is correct because Vertex AI endpoints support autoscaling with a `minReplicaCount` of 0, which allows the endpoint to scale down to zero instances when there are no incoming requests, thereby minimizing costs. This feature is specifically designed for serverless model serving, where the endpoint automatically scales up from zero when traffic arrives and scales down to zero during idle periods.

Answer

Deploy the model to a Compute Engine instance and use instance groups.

Answer

Use a custom metric for autoscaling

Answer

Set maxReplicaCount to 0

Google Professional Machine Learning Engineer PMLE practice test

Three ways to study

All 1,000 PMLE questions with answers

Study PMLE by domain

Study PMLE by topic

Automating and Orchestrating ML Pipelines practice questions

Collaborating Within and Across Teams to Manage Data and Models practice questions

Serving and Scaling Models practice questions

Monitoring ML Solutions practice questions

Architecting Low-Code ML Solutions practice questions

Scaling Prototypes into ML Models practice questions

Collaborating to manage data and models practice questions

Solving business challenges with ML practice questions

PMLE fundamentals practice questions

PMLE scenario practice questions

PMLE troubleshooting practice questions

Top PMLE questions

Google Professional Machine Learning Engineer practice questions

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

A machine learning engineer wants to deploy a trained model to Vertex AI for online predictions. Which Vertex AI resource is required to serve the model and provide an endpoint URL?

A company is deploying a complex model that requires GPU for inference. They want to use Vertex AI for serving. Which TWO steps are required to deploy the model with GPU support? (Choose 2)

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

A company uses Vertex AI Vector Search (Matching Engine) for a product recommendation system. The product embeddings are updated hourly. Which index update method should they use to ensure low latency for new items?

A company uses Vertex AI Vector Search for similarity search. They have a dataset of 10 million 512-dimensional vectors. Which index type should they choose for lowest latency at high recall?

A data analyst wants to build a binary classification model to predict customer churn using SQL queries in BigQuery. Which BigQuery ML model type should they use?

A company is deploying a new model version to an existing Vertex AI endpoint. They want to test the new version with 5% of traffic before fully rolling it out. What is the correct approach?

You are designing a Vertex AI pipeline that includes a container component. The component needs to use a custom container image that is stored in Artifact Registry. How should you specify the container image in the component definition?

Question Discussion

How to use these PMLE questions

Quick answer