Google Professional Machine Learning Engineer (PMLE) — Questions 151225

506 questions total · 7pages · All types, answers revealed

Page 2

Page 3 of 7

Page 4
151
Multi-Selecthard

Which THREE actions can help improve the performance of a BigQuery ML model?

Select 3 answers
A.Increase the amount of training data
B.Replace the model with an AutoML model via export
C.Use hypertuning to optimize model parameters
D.Increase the time interval for prediction
E.Perform feature engineering in SQL
AnswersA, C, E

More data often improves model accuracy.

Why this answer

Increasing the amount of training data provides the model with more examples to learn from, which can reduce overfitting and improve generalization, especially for complex patterns. In BigQuery ML, more data often leads to better feature representation and higher accuracy, as long as the data is clean and relevant.

Exam trap

Google Cloud often tests the misconception that exporting a model to AutoML is a valid optimization step, but in reality, BigQuery ML and AutoML are separate services with incompatible model formats and training workflows.

152
Multi-Selecthard

A manufacturing company wants to predict equipment failure using sensor data. The data is highly imbalanced (only 1% failures). They are using a gradient boosted tree model with class weights. The model achieves 0.99 recall but 0.2 precision on the test set. Which two actions should they take to improve precision without significantly hurting recall? (Choose TWO)

Select 2 answers
A.Oversample the minority class using SMOTE
B.Try an anomaly detection algorithm like Isolation Forest
C.Add more features to the model
D.Increase the class weight for the minority class
E.Increase the decision threshold for classifying a positive
AnswersB, E

Anomaly detection is designed for imbalanced data and can improve precision by focusing on outliers.

Why this answer

Option B is correct because anomaly detection algorithms like Isolation Forest are designed to identify rare events by isolating anomalies rather than modeling the majority class, which can improve precision when the minority class is extremely rare (1%). Option E is correct because increasing the decision threshold for classifying a positive reduces false positives by requiring higher confidence for a positive prediction, directly improving precision while only minimally reducing recall if the model's probability scores are well-calibrated.

Exam trap

Google Cloud often tests the misconception that oversampling or adding features always improves model performance, but in highly imbalanced scenarios, these actions can degrade precision without recall benefit, and the correct approach is to adjust the decision threshold or use anomaly detection.

153
MCQmedium

An ML engineer needs to run batch predictions on tens of petabytes of data using a trained model. The data is stored in Cloud Storage. Which service should they choose?

A.Cloud Dataflow with the model as a side input
B.Cloud Dataproc running Spark ML
C.Cloud Run with multiple revisions
D.Vertex AI Batch Prediction
AnswerD

Batch Prediction scales to petabytes and integrates with Cloud Storage.

Why this answer

Vertex AI Batch Prediction is the correct choice because it is a managed service specifically designed for high-throughput, large-scale batch inference on data stored in Cloud Storage. It automatically handles sharding, scaling, and resource management for tens of petabytes, without requiring the engineer to manage infrastructure or write custom distributed processing code.

Exam trap

Google Cloud often tests the distinction between batch inference and data processing pipelines, so the trap here is that candidates confuse Cloud Dataflow (a data processing tool) with a batch prediction service, not realizing that Vertex AI Batch Prediction is the dedicated service for running models on large static datasets.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow with the model as a side input is optimized for stream and batch data processing pipelines, not for running a trained model's predictions on petabytes of static data; side inputs are not designed for large model inference and would cause severe performance bottlenecks and memory issues. Option B is wrong because Cloud Dataproc running Spark ML requires the engineer to manually manage clusters, configure Spark jobs for inference, and handle scaling, which adds operational overhead and is less efficient than a purpose-built batch prediction service for petabyte-scale data. Option C is wrong because Cloud Run is a serverless container platform for request-driven, low-latency applications, not for batch processing of tens of petabytes; it has a maximum request timeout of 60 minutes and cannot handle the volume or duration required.

154
MCQeasy

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

A.Reduce model size by removing features
B.Compress the model using gzip and upload
C.Deploy the model on Cloud Run Functions
D.Use a custom container to serve the model
AnswerD

Custom containers have no size limit.

Why this answer

Vertex AI Prediction has a 2GB limit for the model artifact when using pre-built containers. A custom container bypasses this limit because you package the model and serving code into a Docker image, which can be arbitrarily large. This allows you to serve XGBoost models exceeding 2GB without size constraints imposed by the managed serving infrastructure.

Exam trap

Google Cloud often tests the misconception that compression (gzip) or feature reduction can circumvent hard platform limits, when in fact the correct solution is to use a custom container that bypasses the artifact size restriction entirely.

How to eliminate wrong answers

Option A is wrong because removing features reduces model accuracy and does not address the core issue of the 2GB artifact limit; Vertex AI still enforces the limit on the remaining model file. Option B is wrong because gzip compression is not transparent to Vertex AI's pre-built containers—the model must be decompressed at load time, and the 2GB limit applies to the uncompressed artifact, so compression does not bypass the restriction. Option C is wrong because Cloud Run Functions have a 2GB memory limit and are designed for stateless, short-lived functions, not for hosting large ML models; they lack GPU support and are unsuitable for XGBoost inference at scale.

155
MCQeasy

A data scientist wants to share a trained model with the team for review before deployment. The model is stored in Vertex AI Model Registry. What is the recommended way to grant the team read access to the model?

A.Grant the IAM role 'roles/aiplatform.admin' to the team members.
B.Export the model as a local file and share it via a shared drive.
C.Grant the IAM role 'roles/aiplatform.viewer' to the team members on the project.
D.Add the team members to the Cloud Storage bucket ACL with 'READER' access.
AnswerC

This role allows viewing models in Vertex AI.

Why this answer

Option A is correct because Vertex AI Model Registry uses Cloud IAM, and granting the 'roles/aiplatform.viewer' role provides read access to all model versions. Option B is wrong because too broad. Option C is wrong because Cloud Storage IAM is separate and not sufficient for Vertex AI models.

Option D is wrong because the bucket ACL does not apply to Vertex AI.

156
MCQhard

You have a model that predicts equipment failure. The model is retrained every week with new data. You notice that the model's precision is stable but recall drops suddenly. Which monitoring strategy would best help you understand the cause?

A.Monitor feature drift for all input features.
B.Monitor the distribution of the model's predicted probabilities and compare to the empirical failure rate over time.
C.Compare the number of predictions per day with previous weeks.
D.Check the request latency at the endpoint.
AnswerB

This helps detect concept drift: if predicted probabilities shift relative to actual outcomes, recall may drop.

Why this answer

Option B is correct because a drop in recall (more false negatives) while precision stays stable suggests the model's decision threshold may be misaligned with the current data distribution. Monitoring the distribution of predicted probabilities against the empirical failure rate over time directly reveals if the model's confidence calibration has shifted, indicating concept drift or a change in the underlying failure rate that requires threshold recalibration.

Exam trap

Google Cloud often tests the distinction between data drift (feature drift) and concept drift (label/prior shift), and the trap here is that candidates assume any performance degradation must be due to feature drift, ignoring that a stable precision with dropping recall specifically signals a threshold or label distribution issue best diagnosed via probability calibration monitoring.

How to eliminate wrong answers

Option A is wrong because monitoring feature drift for all input features is too broad and may not directly explain a recall drop; feature drift can cause both precision and recall to change, but a stable precision with dropping recall points to a threshold or label distribution issue, not necessarily input feature drift. Option C is wrong because comparing the number of predictions per day with previous weeks only detects volume anomalies (e.g., traffic spikes), which do not affect recall directly and would not explain a systematic increase in false negatives. Option D is wrong because checking request latency at the endpoint measures infrastructure performance (e.g., network delays, compute bottlenecks), which has no causal link to model prediction quality like recall degradation.

157
MCQmedium

A team needs to serve a PyTorch model for production inference with strict latency requirements (p99 < 100ms). The model has dynamic control flow and uses custom kernels compiled with torch.jit. Which serving approach should they recommend?

A.Build a custom container with PyTorch JIT and deploy it on Vertex AI Prediction.
B.Convert the model to TensorFlow SavedModel and serve it on Vertex AI Prediction with TensorFlow Serving.
C.Use Cloud Functions with a PyTorch wrapper to handle inference requests.
D.Deploy the model on Vertex AI Prediction using the prebuilt PyTorch container.
AnswerA

Custom container allows fine-grained optimization and inclusion of custom kernels.

Why this answer

Option C is correct because a custom container with a PyTorch JIT server offers full control over the model execution and avoids overhead of generic servers. Option A is wrong because Vertex AI Prediction does not support custom containers? Actually it does, but the best fit for dynamic control flow is a custom container. Option B is wrong because TensorFlow Serving does not support PyTorch natively.

Option D is wrong because Cloud Functions are not suitable for real-time inference at scale with strict latency.

158
MCQmedium

A team is using Vertex AI Feature Store to manage features for training and serving. They want to monitor the freshness of the features (i.e., how recently each feature was updated). Which approach should they take?

A.Use Cloud Logging to track feature updates
B.Use Vertex AI Feature Store's monitoring dashboard
C.Create a custom Cloud Monitoring metric based on feature ingestion timestamps
D.Use Cloud Audit Logs to monitor API calls
AnswerC

By exporting timestamps as custom metrics, the team can monitor feature freshness in Cloud Monitoring and set alerts.

Why this answer

Vertex AI Feature Store does not provide a built-in monitoring dashboard for feature freshness. To track how recently each feature was updated, you must create a custom Cloud Monitoring metric based on feature ingestion timestamps, which allows you to define alerting thresholds and visualize freshness over time.

Exam trap

The trap here is that candidates assume Vertex AI Feature Store has a built-in freshness monitoring dashboard, but it only provides monitoring for distribution drift and skew, not for update timestamps.

How to eliminate wrong answers

Option A is wrong because Cloud Logging captures log entries but is not designed for real-time metric-based monitoring of feature freshness; it would require parsing logs and creating custom metrics, which is less direct than using Cloud Monitoring. Option B is wrong because Vertex AI Feature Store's monitoring dashboard focuses on feature value distribution drift and skew, not on freshness or update timestamps. Option D is wrong because Cloud Audit Logs record API calls for compliance and security, not the actual data update timestamps needed to measure feature freshness.

159
MCQeasy

A data scientist wants to log prediction inputs and outputs for model monitoring. Which Google Cloud service is best suited for this?

A.Cloud Monitoring
B.Cloud Storage
C.Cloud Logging
D.BigQuery
AnswerC

Cloud Logging can ingest and store prediction logs.

Why this answer

Cloud Logging is the best choice because it is designed to ingest, store, and analyze log data, including custom log entries from applications. The data scientist can use the Cloud Logging API to write structured log entries containing prediction inputs and outputs, then query them using Logs Explorer or export them for further analysis. This aligns with the requirement to log prediction inputs and outputs for model monitoring, as Cloud Logging provides a centralized, scalable, and queryable log management service.

Exam trap

Google Cloud often tests the distinction between logging (Cloud Logging) and monitoring (Cloud Monitoring), where candidates mistakenly choose Cloud Monitoring because they think 'monitoring' includes logging, but Cloud Monitoring is for metrics and alerts, not for storing and querying log data.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring is focused on collecting metrics, uptime checks, and alerting on system performance (e.g., CPU utilization, latency), not on storing and querying arbitrary log data like prediction inputs and outputs. Option B is wrong because Cloud Storage is an object storage service for unstructured data (e.g., images, backups), not a log management service; it lacks native querying capabilities for log entries and is not designed for real-time log ingestion and search. Option D is wrong because BigQuery is a serverless data warehouse for analytical queries on large structured datasets, not a log management service; while it can store logs exported from Cloud Logging, it is not the primary service for ingesting and querying log entries in real time.

160
MCQmedium

A retailer uses BigQuery ML to build a linear regression model for sales forecasting. The model's evaluation shows high RMSE. Which step should they take first?

A.Use a more complex model like XGBoost
B.Increase the number of features
C.Set a larger training budget
D.Examine the data for outliers and missing values
AnswerD

Correct: Data quality inspection is the first step.

Why this answer

High RMSE in a linear regression model often indicates issues with data quality, such as outliers or missing values, which can disproportionately skew the model's predictions. BigQuery ML's linear regression is sensitive to such anomalies, so examining and cleaning the data is the most appropriate first step before considering model complexity or feature engineering.

Exam trap

Google Cloud often tests the misconception that high RMSE is always a model complexity issue, leading candidates to jump to advanced algorithms or feature engineering without considering fundamental data quality checks.

How to eliminate wrong answers

Option A is wrong because switching to a more complex model like XGBoost without first addressing data quality issues can amplify overfitting and does not fix the root cause of high RMSE. Option B is wrong because blindly increasing the number of features can introduce noise and multicollinearity, potentially worsening RMSE rather than improving it. Option C is wrong because setting a larger training budget in BigQuery ML does not improve model accuracy; it only allocates more resources for training, which is irrelevant when the issue stems from data problems.

161
MCQhard

Two teams independently develop two different versions of a model for the same use case. They both deploy to the same Vertex AI endpoint, causing conflicts. What is the best way to manage multiple model versions and avoid conflicts in a collaborative environment?

A.Have each team work on a separate Google Cloud project
B.Use custom metadata to tag each version and rely on team coordination
C.Deploy each team's model to a separate endpoint
D.Use Vertex AI Model Registry with staging and production channels, and implement CI/CD to control promotions
AnswerD

Model registry with staging/production allows controlled version management and rollback.

Why this answer

Option C is correct because using a model registry with separate staging and production channels helps control which version is promoted. Option A is wrong because deploying to different endpoints increases management overhead. Option B is wrong because versioning metadata does not enforce deployment order.

Option D is wrong because separate projects create silos and increase cost.

162
MCQeasy

A company needs to extract entities (e.g., names, dates) from customer emails using a pre-trained model. Which service should they use?

A.Translation API
B.Natural Language API
C.Dialogflow
D.Vision API
AnswerB

Natural Language API can extract entities from text.

Why this answer

The Natural Language API provides entity extraction as a pre-trained model. Vision API is for images, Translation API for translation, and Dialogflow for conversational agents.

163
MCQhard

A team deploys a TensorFlow model using a custom container to Vertex AI Endpoint. The container expects the saved model at the /model directory, but predictions fail with a 'model not found' error. The team used the default Vertex AI serving container in the past. What is the most likely cause?

A.The container does not have a GPU accelerator configured.
B.The model artifact must be downloaded from Cloud Storage and placed in /gcs.
C.The container reads from a fixed directory /model, but Vertex AI mounts the model at /tmp/model.
D.The model was saved in a different format (e.g., SavedModel vs. HDF5).
AnswerC

Custom containers must adapt to the Vertex AI model mount point.

Why this answer

Option D is correct because Vertex AI mounts the model artifact at a path specified by the environment variable AIP_STORAGE_URI, typically under /tmp/model. The custom container must read from this location or copy the model. Option A is wrong because the model format is not the issue.

Option B is wrong because Vertex AI does not require the model to be in a Cloud Storage bucket mounted at /gcs in this context. Option C is wrong because the container can be GPU-enabled; the error is about file not found.

164
MCQeasy

An organization needs to serve a large model (10 GB) with low latency across multiple regions. Which Vertex AI feature best meets this requirement?

A.Private endpoints
B.Batch prediction
C.Model Monitoring
D.Global endpoints
AnswerD

Global endpoints automatically route to the closest region, providing low latency across regions.

Why this answer

Option A is correct because Vertex AI Global Endpoints automatically route traffic to the nearest region with capacity, reducing latency for geographically distributed users. Option B is for batch jobs, not real-time. Option C is for private access within VPC, which does not address multi-region latency.

Option D is for monitoring, not serving.

165
MCQmedium

You are an ML engineer at a fintech company. You have a prototype credit risk model built using XGBoost that achieves high accuracy on historical data. The model is trained on a dataset with 500,000 rows and 50 features. The company wants to deploy this model to production to score loan applications in real-time. The production environment must handle a peak load of 100 requests per second with a latency under 200ms. You have decided to use Vertex AI for deployment. After deploying the model as a Vertex AI endpoint with a single n1-standard-4 machine, you notice that latency exceeds 500ms at peak load and some requests time out. You have verified that the model prediction itself (excluding network overhead) takes about 50ms on average. What should you do to meet the latency and throughput requirements?

A.Change the machine type to a GPU-accelerated machine like n1-standard-4 with a T4 GPU.
B.Prune the model to reduce size and improve prediction speed.
C.Enable autoscaling with a minimum of 2 replicas and use a larger machine type (e.g., n1-standard-8) to handle more concurrent requests.
D.Switch from online prediction to batch prediction using Vertex AI Batch Prediction.
AnswerC

Autoscaling increases replicas to handle load, and a larger machine can process more requests concurrently, reducing queueing time.

Why this answer

Option C is correct because the latency bottleneck is not the model inference time (50ms) but the inability of a single n1-standard-4 machine to handle 100 concurrent requests per second without queuing. By enabling autoscaling with a minimum of 2 replicas and upgrading to n1-standard-8, you increase both the number of concurrent requests the endpoint can process and the CPU/memory resources per replica, reducing queue wait times and keeping total latency under 200ms. This directly addresses the throughput and latency requirements without changing the model or switching to batch processing.

Exam trap

The trap here is that candidates assume latency issues are always due to model inference speed (leading them to choose GPU or model pruning), when in fact the bottleneck is often the lack of horizontal scaling to handle concurrent requests under load.

How to eliminate wrong answers

Option A is wrong because adding a GPU (e.g., T4) does not reduce latency for XGBoost inference; XGBoost is CPU-optimized and GPU acceleration typically adds overhead for tree-based models, making latency worse. Option B is wrong because pruning the model (e.g., reducing tree depth or number of trees) would only marginally improve the 50ms prediction time, but the primary issue is queuing due to insufficient replicas to handle 100 requests per second, not the raw inference speed. Option D is wrong because batch prediction is designed for offline, asynchronous processing and cannot meet the real-time requirement of under 200ms latency per request; it also does not solve the concurrency problem for online scoring.

166
Multi-Selecthard

Which TWO actions should be taken to ensure reproducibility of ML experiments when collaborating across teams on Vertex AI?

Select 2 answers
A.Lock dependency versions in a container image used for training
B.Share notebooks via Colab Enterprise with real-time editing
C.Version control datasets using DVC or Vertex AI ML Metadata
D.Allow each team to use their own preferred environment
E.Always use random seeds for all random operations
AnswersA, C

Container images with fixed versions ensure environment reproducibility.

Why this answer

Locking dependency versions in a container image ensures that the exact same software environment (e.g., Python packages, CUDA libraries, system tools) is used every time a training job runs. This eliminates variability from package updates or OS patches, which is a fundamental requirement for reproducibility across teams. Vertex AI supports custom containers for training, making this a direct and reliable method.

Exam trap

The trap here is that candidates often think 'always use random seeds' is a safe blanket rule, but in practice, seeds must be explicitly set and logged per run, and some operations (e.g., certain GPU kernels) are inherently non-deterministic, making this option an oversimplification that is not a guaranteed action for reproducibility.

167
Multi-Selecthard

Which THREE of the following are valid ways to share a Vertex AI model across two different Google Cloud projects?

Select 3 answers
A.Use Vertex AI Model Registry's cross-project sharing feature with IAM conditions.
B.Publish the model to Google Cloud Marketplace.
C.Export the model to a Cloud Storage bucket accessible by both projects and import into the second project.
D.Use IAM to grant the second project's service account Vertex AI User role on the model resource.
E.Use the gcloud ai models copy command to copy the model across projects.
AnswersA, C, D

Model Registry supports sharing model versions across projects with fine-grained IAM.

Why this answer

Option A is correct because Vertex AI Model Registry supports cross-project sharing by allowing you to grant IAM roles with conditions on the model resource. This enables a model registered in one project to be accessed by a service account from another project without moving or copying the model artifacts.

Exam trap

The trap here is that candidates may assume a dedicated copy command exists for moving models across projects, but Vertex AI relies on IAM-based sharing or export/import workflows instead.

168
MCQeasy

A small business wants to build a sentiment analysis model for customer reviews without writing any code. They have a small labeled dataset with 500 positive and 500 negative reviews. Which Google Cloud service should they use?

A.AutoML Natural Language
B.Natural Language API
C.Vertex AI custom training with PyTorch
D.BigQuery ML with logistic regression
AnswerA

Allows training a custom model with a GUI and no code.

Why this answer

AutoML Natural Language is the correct choice because it allows the business to train a custom sentiment analysis model using their own labeled dataset without writing any code. It provides a low-code interface for uploading data, training, and deploying the model, which aligns with the requirement of no coding and a small labeled dataset.

Exam trap

The trap here is that candidates often confuse the pre-trained Natural Language API with AutoML Natural Language, assuming the API can be customized with labeled data, but the API is fixed and cannot be retrained, while AutoML is designed for custom model training without code.

How to eliminate wrong answers

Option B is wrong because the Natural Language API is a pre-trained model that cannot be fine-tuned with custom labeled data; it only offers general sentiment analysis and would not leverage the business's specific 500/500 dataset. Option C is wrong because Vertex AI custom training with PyTorch requires writing code to define the model architecture and training loop, which violates the 'without writing any code' constraint. Option D is wrong because BigQuery ML with logistic regression is designed for structured tabular data, not for text sentiment analysis, and it would require feature engineering and SQL-based model definition, which is not a true no-code solution for text.

169
MCQmedium

A data scientist is using Vertex AI Workbench user-managed notebooks. They need to collaborate with a colleague on the same notebook. The colleague should be able to edit the notebook simultaneously. What should they do?

A.Store the notebook in Cloud Source Repositories and have the colleague clone it
B.Share the underlying Compute Engine VM's SSH access with the colleague
C.Export the notebook to Colab and share the link
D.Share the notebook instance URL with the colleague; both can edit simultaneously
AnswerD

Vertex AI Workbench supports real-time collaboration through the same instance.

Why this answer

Vertex AI Workbench user-managed notebooks support real-time collaboration by sharing the notebook instance URL. When you share the URL with a colleague, both users can edit the notebook simultaneously because the underlying JupyterLab environment is multi-user and supports concurrent editing sessions. This is the intended method for synchronous collaboration on the same notebook instance.

Exam trap

Google Cloud often tests the misconception that version control (like Cloud Source Repositories) is the correct way to collaborate on notebooks, but the question specifically asks for simultaneous editing, which requires a real-time collaboration feature like sharing the notebook instance URL.

How to eliminate wrong answers

Option A is wrong because Cloud Source Repositories is a Git-based version control system for storing code, not a real-time collaboration tool; cloning a notebook does not allow simultaneous editing. Option B is wrong because sharing SSH access to the Compute Engine VM would give the colleague full system-level access, which is insecure and unnecessary for notebook editing, and it does not enable simultaneous editing in JupyterLab. Option C is wrong because exporting to Colab creates a separate copy of the notebook in a different environment, breaking the connection to the original Vertex AI Workbench instance and preventing simultaneous editing on the same notebook.

170
Multi-Selecteasy

Which TWO actions can help reduce the latency of online prediction requests for a deep learning model served on Vertex AI?

Select 2 answers
A.Increase the number of CPU vCPUs per machine.
B.Set min_replica_count to 0 to avoid idle instances.
C.Use a GPU accelerator for the deployed model.
D.Decrease the number of replicas to reduce resource contention.
E.Enable request batching to process multiple inputs together.
AnswersC, E

GPU accelerates deep learning inference.

Why this answer

Using a GPU accelerator speeds up inference, and batching requests reduces overhead per request. Minimizing replicas doesn't help latency; increasing CPU doesn't always help if GPU is better.

171
MCQhard

A data scientist uses Vertex AI Pipelines to orchestrate an ML workflow. They want to reuse a component from Google's curated repository. What is the recommended way to incorporate it?

A.Import the component from Google Cloud Build
B.Use the 'aiplatform' Python SDK to define the component
C.Use prebuilt components from the Google Cloud Pipeline Components repository
D.Copy the component code into the pipeline definition
AnswerC

These are officially maintained and can be directly used in Vertex AI Pipelines.

Why this answer

Option C is correct because Google provides a curated set of prebuilt components in the Google Cloud Pipeline Components repository, which are designed to be directly imported and used within Vertex AI Pipelines. These components encapsulate common ML tasks (e.g., model training, deployment) and are maintained by Google, ensuring compatibility and reducing custom code. Using them is the recommended approach to avoid reinventing the wheel and to leverage Google's best practices.

Exam trap

The trap here is that candidates may confuse the 'aiplatform' SDK (used for direct API calls) with the pipeline components SDK, or assume that copying code is acceptable for reusability, when Google specifically recommends using the curated prebuilt components to ensure compatibility and reduce maintenance overhead.

How to eliminate wrong answers

Option A is wrong because Google Cloud Build is a CI/CD service for building and testing code, not a repository for reusable pipeline components; importing a component from Cloud Build would not provide the curated, prebuilt component logic needed for Vertex AI Pipelines. Option B is wrong because the 'aiplatform' Python SDK is used to interact with Vertex AI services (e.g., creating datasets, jobs) but does not define or import prebuilt pipeline components; defining a component from scratch would bypass the curated repository. Option D is wrong because copying component code into the pipeline definition defeats the purpose of reusability and maintainability, and it is not the recommended method; the curated repository provides versioned, tested components that should be referenced rather than duplicated.

172
Multi-Selecteasy

Which TWO actions are appropriate when you detect that a production model's prediction distribution has shifted significantly from the training distribution?

Select 2 answers
A.Immediately roll back to the previous model version
B.Increase logging for future predictions
C.Retrain the model using the most recent data
D.Investigate the cause of the shift before taking corrective action
E.Reduce the traffic to the model to minimize impact
AnswersC, D

Adapts model to new distribution.

Why this answer

Option C is correct because retraining the model on the most recent data directly addresses the distribution shift by adapting the model to the new data patterns. This is a standard practice in MLOps when the shift is confirmed and the cause is understood, ensuring the model remains accurate and reliable in production.

Exam trap

Google Cloud often tests the misconception that immediate rollback or traffic reduction is the correct first action, when in fact the proper response is to investigate the cause before taking corrective action like retraining.

173
MCQhard

A team is using Vertex AI AutoML to train a forecasting model. They need to retrain the model weekly and only if the new week's data significantly changes the data distribution. What is the most efficient way to achieve this?

A.Use a scheduled pipeline that always retrains
B.Use Cloud Monitoring alerts on data drift to trigger retraining
C.Use Vertex AI Model Monitoring to detect drift and trigger a pipeline
D.Use Cloud Functions on schedule to compare distributions
AnswerC

Correct: Native drift detection and retriggering.

Why this answer

Option C is correct because Vertex AI Model Monitoring can be configured to detect data drift on the model's input features, and when drift exceeds a threshold, it can trigger a Cloud Function or a Vertex AI pipeline to retrain the model. This approach avoids unnecessary retraining when the data distribution has not changed significantly, which is more efficient than always retraining. The integration with Cloud Functions or Pub/Sub allows for a serverless, event-driven retraining pipeline that only runs when needed.

Exam trap

Google Cloud often tests the distinction between infrastructure monitoring (Cloud Monitoring) and model-specific monitoring (Vertex AI Model Monitoring), and candidates mistakenly choose Cloud Monitoring because they think it can detect data drift, but it lacks the statistical algorithms needed for feature-level distribution comparison.

How to eliminate wrong answers

Option A is wrong because a scheduled pipeline that always retrains ignores the requirement to retrain only when the new week's data significantly changes the data distribution, leading to wasted compute resources and potential model instability from unnecessary retraining. Option B is wrong because Cloud Monitoring alerts are designed for infrastructure and application metrics (e.g., CPU, latency), not for detecting data drift in model features; data drift detection requires feature-level statistical analysis, which is not a built-in capability of Cloud Monitoring. Option D is wrong because using Cloud Functions on a schedule to compare distributions would require custom code to compute statistical tests (e.g., KS test) and manage state, which is less efficient and more error-prone than using Vertex AI Model Monitoring's managed drift detection service.

174
MCQhard

You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?

A.Add a NVIDIA T4 GPU accelerator to the existing machine type.
B.Reduce the min_replica_count to 0 to allow scaling down aggressively and add more replicas during spikes.
C.Increase the machine type to n1-highmem-16 with more vCPUs.
D.Switch the model to Vertex AI batch prediction and run predictions every hour.
AnswerA

GPU accelerates the deep learning parts, reducing total latency.

Why this answer

Adding a GPU accelerator (e.g., NVIDIA T4) to the instances can significantly speed up the TensorFlow and PyTorch components, which are deep learning models. The XGBoost part runs on CPU but the overall latency bottleneck is likely the deep learning models. GPU will accelerate inference of those models, reducing total latency.

Increasing CPUs will help only marginally as the main bottleneck is compute. Reducing min replicas may increase cold start and queue. Switching to batch prediction changes the model from real-time to batch, which does not meet the latency requirement.

175
MCQmedium

A company implements an ML pipeline using Vertex AI Pipelines. The pipeline trains a model using custom training jobs and then deploys it to an endpoint. The team notices that the endpoint occasionally serves an older model version for a few minutes after a new pipeline run completes. What is the most likely cause?

A.The new model artifact is temporarily unavailable, so the endpoint falls back to the previous version.
B.The prediction cache is returning cached results from the old model.
C.The pipeline failed to update the endpoint with the new model ID.
D.The endpoint is configured with a canary traffic split, and the old model is still receiving a fraction of traffic during the rollout.
AnswerD

Canary deployments gradually shift traffic, so some requests hit the old model until the rollout is complete.

Why this answer

D is correct because Vertex AI endpoints can be configured with a canary (gradual) traffic rollout strategy. When a new model is deployed, traffic is shifted incrementally from the old model to the new one over a specified duration. During this rollout window, the old model continues to serve a fraction of requests, which explains why users occasionally see the older model version for a few minutes after the pipeline completes.

Exam trap

The trap here is that candidates confuse a canary rollout with a deployment failure or caching issue, assuming the old model persists due to an error rather than recognizing it as an intentional traffic-splitting mechanism during a gradual rollout.

How to eliminate wrong answers

Option A is wrong because Vertex AI endpoints do not automatically fall back to a previous model version when a new artifact is temporarily unavailable; instead, the deployment would fail or the endpoint would return an error. Option B is wrong because Vertex AI endpoints do not have a built-in prediction cache that returns cached results from an old model; caching is not a default behavior for model serving. Option C is wrong because if the pipeline failed to update the endpoint with the new model ID, the endpoint would consistently serve the old model, not just occasionally for a few minutes.

176
MCQeasy

You have an online prediction model that is showing increasing prediction latency. You have already verified that the request rate and input data size are unchanged. Which of the following should you investigate next?

A.Check if the model was recently updated to a larger version
B.Check the monitoring dashboard configuration
C.Check if the feature engineering logic was changed
D.Check the geographic location of the endpoint
AnswerA

Larger model increases inference latency.

Why this answer

If request rate and input data size are unchanged, increased prediction latency often points to a change in the model itself. A larger model (e.g., deeper neural network, more parameters) requires more computation per inference, directly increasing latency. This is a common root cause when monitoring ML pipelines, as model version updates can silently alter performance characteristics.

Exam trap

Google Cloud often tests the distinction between network-level latency (e.g., geographic location) and compute-level latency (e.g., model size), tempting candidates to pick the geographic option when the root cause is model-related.

How to eliminate wrong answers

Option B is wrong because the monitoring dashboard configuration only affects how metrics are displayed or alerted, not the underlying latency of predictions. Option C is wrong because feature engineering logic changes would alter input data size or structure, but the question states input data size is unchanged. Option D is wrong because the geographic location of the endpoint affects network latency, not the model's prediction latency (which is server-side compute time).

177
MCQhard

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

A.The custom container is not compatible with Vertex AI
B.The model is cached and needs cache invalidation
C.Traffic is not split to the new model version
D.The new model version was not deployed to the same endpoint
AnswerC

Traffic splitting must be adjusted to route to the new version.

Why this answer

In Vertex AI Prediction, when you deploy a new model version to an existing endpoint, you must explicitly allocate traffic to it. By default, the new version receives 0% traffic, so all requests continue to be served by the old version. The correct fix is to update the endpoint's traffic split, for example via the console or the `gcloud ai endpoints update` command with the `--traffic-split` flag.

Exam trap

Google Cloud often tests the misconception that deploying a new model version automatically replaces the old one, when in fact Vertex AI requires an explicit traffic split update to shift requests to the new version.

How to eliminate wrong answers

Option A is wrong because Vertex AI supports custom containers for TensorFlow models as long as they implement the required HTTP health check and prediction endpoints; incompatibility would cause deployment failure, not silent routing to an old version. Option B is wrong because Vertex AI does not cache model predictions at the endpoint level; caching is not a factor in traffic routing between model versions. Option D is wrong because deploying to the same endpoint is exactly what the user did; the issue is that the new version was deployed but not given any traffic share, not that it was deployed to a different endpoint.

178
MCQmedium

You are monitoring a machine learning pipeline that runs on Vertex AI Pipelines. The pipeline occasionally fails with a 'ResourceExhausted' error when attempting to read data from BigQuery. Which action should you take to resolve this issue?

A.Switch from BigQuery to Cloud Storage for data source
B.Increase the memory allocated to the pipeline step
C.Reduce the complexity of the BigQuery query or increase the reservation size
D.Reduce the batch size of the data being read
AnswerC

ResourceExhausted error is due to BigQuery limits; simplifying query or increasing slots can help.

Why this answer

The 'ResourceExhausted' error when reading from BigQuery indicates that the query is consuming more resources than the BigQuery reservation allows. Option C is correct because reducing query complexity (e.g., using fewer JOINs, aggregations, or partitions) or increasing the reservation size directly addresses the root cause by either lowering resource demand or allocating more capacity. Other options like switching to Cloud Storage or adjusting pipeline memory do not fix the BigQuery-specific quota or slot exhaustion.

Exam trap

Google Cloud often tests the misconception that memory or batch size adjustments in the pipeline environment can fix backend service quota errors, when in fact the error is specific to BigQuery's resource management (slots/queries) and requires query optimization or reservation changes.

How to eliminate wrong answers

Option A is wrong because switching to Cloud Storage does not resolve the BigQuery resource exhaustion; it changes the data source but introduces new latency and format compatibility issues without addressing the query's resource consumption. Option B is wrong because increasing memory allocated to the pipeline step only affects the compute environment (e.g., the container running the pipeline), not the BigQuery service's slot or query quota limits. Option D is wrong because reducing the batch size of data being read may reduce memory pressure on the pipeline but does not affect the BigQuery query's resource usage; the error originates from BigQuery's backend, not from the client-side read volume.

179
MCQhard

A company has a prototype ML model that achieves 85% accuracy on historical data. In production, accuracy drops to 70% after two weeks due to data drift. They need an automated retraining pipeline with minimal manual oversight. Which solution is most cost-effective?

A.Use Cloud Functions to trigger a Dataflow job that trains the model using custom containers
B.Deploy the model on a GPU-equipped Compute Engine VM and run retraining every time new data arrives
C.Set up Vertex AI Model Monitoring to detect drift, which triggers a Cloud Function that submits a Vertex AI Training job with new data
D.Schedule a weekly Cloud Composer DAG that runs a new training job with all available data
AnswerC

Monitoring detects drift, automation triggers retraining with new data, cost-effective.

Why this answer

Option C is correct because it combines automated drift detection via Vertex AI Model Monitoring with a serverless retraining trigger (Cloud Function) that submits a Vertex AI Training job, minimizing manual oversight while only incurring costs when drift is detected. This avoids the expense of continuous retraining or always-on GPU instances, making it the most cost-effective solution for the described scenario.

Exam trap

The trap here is that candidates often choose scheduled retraining (Option D) as the simplest automation, overlooking the cost savings and precision of event-driven retraining triggered by actual drift detection, which is a key concept in the PMLE exam for scaling prototypes to production.

How to eliminate wrong answers

Option A is wrong because using Cloud Functions to trigger a Dataflow job for training with custom containers introduces unnecessary complexity and cost for batch processing of training data, whereas Vertex AI Training is purpose-built for ML model training and integrates seamlessly with drift detection. Option B is wrong because deploying a GPU-equipped Compute Engine VM for retraining every time new data arrives incurs high costs for idle GPU time and requires manual management of the VM lifecycle, contradicting the requirement for minimal manual oversight. Option D is wrong because scheduling a weekly Cloud Composer DAG to retrain with all available data ignores the cost savings of event-driven retraining triggered by actual drift, and may waste resources retraining when no drift has occurred.

180
Multi-Selecthard

An ML engineer is building a monitoring dashboard for a Vertex AI pipeline that includes training, evaluation, and batch prediction. Which THREE components should be included to provide comprehensive observability? (Select THREE.)

Select 3 answers
A.Pipeline execution status, duration, and failure rates for each component.
B.Compute engine CPU and memory logs for each pipeline step.
C.Model evaluation metrics (e.g., accuracy, AUC) after training and validation.
D.Data validation reports showing anomaly counts and feature statistics.
E.Online prediction latency and request count from the deployed model endpoint.
AnswersA, C, D

Core pipeline health metrics.

Why this answer

Option A is correct because pipeline execution status, duration, and failure rates are fundamental metrics for monitoring the health and performance of a Vertex AI pipeline. These metrics allow the ML engineer to quickly identify bottlenecks, track overall workflow progress, and detect failures in training, evaluation, or batch prediction steps, which is essential for comprehensive observability.

Exam trap

The trap here is that candidates often confuse infrastructure monitoring (CPU/memory logs) or serving-layer metrics (online prediction latency) with pipeline-specific observability, leading them to select options that are relevant to different stages of the ML lifecycle rather than the pipeline itself.

181
MCQeasy

Refer to the exhibit. A team runs this command to upload a model to Vertex AI. They want to create this model as a new version under an existing model named 'my_model'. What is missing from the command?

A.--description='Second version'
B.--version=v2
C.--labels=team=ml
D.--service-account=sa@project.iam.gserviceaccount.com
E.--parent-model=my_model
AnswerE

The --parent-model flag indicates the existing model to add a version to.

Why this answer

Option E is correct because the `--parent-model` flag is required when uploading a new model version to an existing model in Vertex AI. Without specifying the parent model name, the command would attempt to create a brand-new model rather than adding a version to the existing 'my_model'. The `gcloud ai models upload` command uses this flag to associate the new version with the specified parent model.

Exam trap

Google Cloud often tests the distinction between creating a new model versus adding a version to an existing model, and the trap here is that candidates assume a `--version` flag exists (like in some other services) instead of recognizing the required `--parent-model` parameter.

How to eliminate wrong answers

Option A is wrong because `--description` is an optional metadata field and does not affect versioning or parent-model association. Option B is wrong because Vertex AI does not support a `--version` flag; model versions are automatically assigned by the service based on the order of uploads under the same parent model. Option C is wrong because `--labels` are optional key-value pairs for organizing resources and have no role in version creation.

Option D is wrong because `--service-account` is used for specifying a custom service account for model deployment, not for versioning or parent-model linkage.

182
MCQeasy

To enable collaboration on notebook-based experiments across teams, what is the recommended approach in Google Cloud?

A.Use Colab Enterprise notebooks with shared runtimes and IAM permissions
B.Share Docker images containing the notebook environment
C.Each team member works on their own local Jupyter notebook and shares screenshots
D.Store notebooks in a Cloud Storage bucket and open them with Vertex AI Workbench
AnswerA

Colab Enterprise enables collaborative editing and shared compute resources.

Why this answer

Colab Enterprise notebooks with shared runtimes and IAM permissions is the recommended approach because it provides a fully managed, collaborative environment where multiple users can work on the same notebook simultaneously, with fine-grained access control via IAM and consistent runtime configurations. This eliminates version conflicts and environment drift, which are common in distributed notebook workflows.

Exam trap

Google Cloud often tests the misconception that shared storage (like Cloud Storage) alone is sufficient for collaboration, but the key requirement is shared runtimes and concurrent editing, which only Colab Enterprise provides among the options.

How to eliminate wrong answers

Option B is wrong because sharing Docker images containing the notebook environment addresses environment reproducibility but does not enable real-time collaboration or shared runtime execution; each user would still need to launch their own instance and manually sync changes. Option C is wrong because each team member working on their own local Jupyter notebook and sharing screenshots is a manual, non-scalable approach that lacks version control, concurrent editing, and centralized data access, making it unsuitable for team collaboration. Option D is wrong because storing notebooks in a Cloud Storage bucket and opening them with Vertex AI Workbench provides shared storage but does not inherently support shared runtimes or concurrent editing; Vertex AI Workbench instances are typically single-user, and multiple users would need to coordinate access to avoid conflicts.

183
MCQhard

A retail company deployed a demand forecasting model using TensorFlow on Vertex AI Batch Prediction. The model runs weekly on a large dataset stored in BigQuery. Over the past month, the prediction accuracy has degraded significantly. The ML engineer reviews the monitoring dashboard and sees that the feature distribution for 'product_price' has shifted from a mean of $50 to $55, and the new product category 'electronics' now represents 20% of the data, whereas it was only 5% in training. The model was never retrained after initial deployment six months ago. The engineer also notices that the Vertex Explainable AI feature importance scores have changed: 'product_price' used to be the top feature (importance 0.35) but now ranks third (importance 0.20). The company requires minimal downtime and wants to improve accuracy as quickly as possible without incurring high costs from excessive retraining. Which course of action should the ML engineer take?

A.Increase the complexity of the model by switching from a feedforward neural network to a gradient boosted tree ensemble, and then deploy without retraining.
B.Route all predictions to human reviewers until the model can be re-evaluated, and then manually correct the outputs.
C.Retrain the model using the most recent 3 months of data, including all new product categories, and deploy the updated model via a new Vertex AI endpoint.
D.Adjust the prediction threshold for the 'product_price' feature to account for the price shift, and monitor for another month.
AnswerC

Retraining with recent data addresses both covariate shift and concept drift, and is the standard approach for maintaining accuracy.

Why this answer

The correct action is to retrain the model with the latest data because the feature distributions and data composition have changed significantly (covariate shift and concept drift). Simply using a more complex model (B) may overfit without addressing the underlying drift. Adjusting thresholds (C) is insufficient because the model's predictions are likely inaccurate.

Sending all data to a human review (D) is costly and not scalable; retraining is the proper response.

184
MCQhard

A large organization uses a multi-project setup with a central data lake. Different teams manage their own models. To enable cross-team sharing of features, they want to use Vertex AI Feature Store. What is the best practice to manage access?

A.Create a single Feature Store in a central project and grant fine-grained IAM roles
B.Export features to Cloud Storage
C.Create separate Feature Stores per team project
D.Use BigQuery authorized views
AnswerA

A central Feature Store with IAM enables sharing while controlling access.

Why this answer

Creating a single Feature Store in a central project with fine-grained IAM roles is the best practice because it centralizes feature management while allowing cross-team access control at the feature group or feature level. Vertex AI Feature Store supports IAM roles like `aiplatform.featureStoreAdmin` and `aiplatform.featureStoreDataViewer` to grant granular permissions, enabling teams to share features without duplicating data or exposing sensitive information. This approach avoids data silos and ensures consistent governance across the organization.

Exam trap

Google Cloud often tests the misconception that separate Feature Stores per team are needed for isolation, but the correct approach is to use a single Feature Store with fine-grained IAM to enable sharing while maintaining security.

How to eliminate wrong answers

Option B is wrong because exporting features to Cloud Storage introduces data duplication, latency, and manual synchronization overhead, defeating the purpose of a centralized feature store for real-time serving. Option C is wrong because creating separate Feature Stores per team project creates data silos, preventing cross-team sharing and requiring complex cross-project networking or data replication. Option D is wrong because BigQuery authorized views are designed for table-level access control in BigQuery, not for managing access to Vertex AI Feature Store entities like feature groups or online/offline stores, and they lack the low-latency serving capabilities of Feature Store.

185
Multi-Selecthard

Which THREE of the following are valid best practices when using Vertex AI AutoML for tabular data?

Select 3 answers
A.Normalize the data into multiple tables to reduce data size
B.Enable automatic feature engineering to improve model performance
C.Disable early stopping for best model quality if budget allows
D.Use the max time budget parameter to control costs
E.Keep training data with heavy class imbalance as-is to let AutoML correct it
AnswersB, C, D

Creates cross features and handling missing values.

Why this answer

Option B is correct because enabling automatic feature engineering in Vertex AI AutoML for tabular data allows the service to automatically create, select, and transform features (e.g., polynomial combinations, cross features, and numerical transformations) to improve model accuracy without manual intervention. This is a built-in capability that leverages Google's AutoML algorithms to discover the most predictive feature representations from the raw data.

Exam trap

Google Cloud often tests the misconception that cost-control parameters like max time budget are best practices for model quality, when in fact they are operational constraints, and that disabling early stopping is beneficial for quality, when it actually risks overfitting and wasted resources.

186
MCQmedium

A retail company wants to build a product recommendation system using BigQuery ML for their e-commerce platform. The data includes customer purchase history, product metadata, and clickstream logs. The ML engineer needs to minimize manual feature engineering and leverage pre-built solutions. Which approach should the engineer take?

A.Use a pre-built recommendation model from Vertex AI Model Garden and deploy it to an endpoint.
B.Write a custom TensorFlow model using the Vertex AI Training service and deploy it via Vertex AI Prediction.
C.Export the data to CSV and use AutoML Tables to train a recommendation model.
D.Use BigQuery ML's matrix factorization model (CREATE MODEL with model_type='matrix_factorization') to train directly on historical interaction data.
AnswerD

BigQuery ML provides low-code matrix factorization for recommendations.

Why this answer

Option D is correct because BigQuery ML's matrix factorization model (model_type='matrix_factorization') is purpose-built for recommendation systems using implicit or explicit feedback data. It trains directly on historical interaction data (e.g., user-item purchases) without requiring manual feature engineering, aligning with the goal of minimizing low-code ML effort. This approach leverages BigQuery's native SQL interface and scales automatically, making it ideal for the described e-commerce scenario.

Exam trap

The trap here is that candidates may assume Vertex AI Model Garden (Option A) is the go-to for pre-built ML, but it does not offer a pre-trained recommendation model that can be directly deployed without custom training on the company's data.

How to eliminate wrong answers

Option A is wrong because Vertex AI Model Garden provides pre-built models for tasks like vision or NLP, not a ready-to-use recommendation model that can be directly deployed without training on the company's specific interaction data. Option B is wrong because writing a custom TensorFlow model and training it via Vertex AI Training contradicts the requirement to minimize manual feature engineering and leverage pre-built solutions. Option C is wrong because exporting data to CSV and using AutoML Tables would require additional data preparation and does not natively handle the user-item interaction structure as efficiently as BigQuery ML's matrix factorization, which operates directly on the data in place.

187
MCQmedium

A logistics company uses a regression model to predict delivery times. The model currently uses features: distance (km), traffic index, weather condition, and time of day. The data scientist notices that the model's predictions are systematically too low for deliveries during peak traffic hours. Which action would best address this issue?

A.Switch to a deep neural network model
B.Remove the traffic index feature as it is causing bias
C.Add a cross-feature that multiplies distance by traffic index
D.Collect more training data during peak traffic hours
AnswerC

This interaction term allows the model to capture the combined effect.

Why this answer

The model's systematic underestimation during peak traffic hours indicates a missing interaction effect between distance and traffic. Adding a cross-feature (distance × traffic index) allows a linear model to capture the non-linear relationship where traffic disproportionately increases delivery time over longer distances. This directly addresses the bias without discarding useful data or unnecessarily complicating the model.

Exam trap

Google Cloud often tests the misconception that systematic bias is always due to insufficient data or the wrong model type, when in fact it is frequently caused by missing feature interactions that can be fixed with simple feature engineering.

How to eliminate wrong answers

Option A is wrong because switching to a deep neural network is overkill and does not guarantee fixing systematic bias; it may even introduce overfitting without addressing the root cause of missing feature interactions. Option B is wrong because removing the traffic index feature would eliminate a key predictor entirely, likely worsening the model's accuracy and increasing bias rather than correcting it. Option D is wrong because collecting more data during peak hours would not fix the model's inability to model the interaction between distance and traffic; the model would still systematically underpredict unless the feature representation is improved.

188
MCQmedium

A machine learning engineer notices that the online prediction latency for a custom TensorFlow model deployed on Vertex AI has increased significantly over the past week. Cloud Monitoring shows that the CPU utilization of the endpoints remains below 40%, but the number of concurrent requests has doubled. What is the most likely cause of the latency increase?

A.Data skew causing longer inference time
B.Memory leak in the serving container
C.Insufficient number of replicas for autoscaling
D.Model overfitting
AnswerC

If the number of replicas is not scaling fast enough to match increased concurrency, requests queue up, leading to higher latency while each replica's CPU is underutilized.

Why this answer

Option C is correct because the CPU utilization remains below 40% while concurrent requests have doubled, indicating that the existing replicas are not saturated on CPU but are bottlenecked by request queuing or thread contention. Vertex AI autoscaling scales based on CPU utilization by default; if the threshold is not crossed, new replicas are not provisioned, causing requests to queue and latency to spike. The engineer should verify the autoscaling configuration and consider scaling on request count or reducing the CPU utilization target.

Exam trap

Google Cloud often tests the misconception that low CPU utilization always means there is spare capacity, when in reality the bottleneck can be request queuing or thread pool exhaustion that does not raise CPU usage.

How to eliminate wrong answers

Option A is wrong because data skew would cause a persistent increase in per-request inference time, but the observation shows CPU utilization is low and latency increased only after request volume doubled, not due to a change in data distribution. Option B is wrong because a memory leak would manifest as increasing memory usage over time, potentially causing OOM kills or garbage collection pauses, but the described symptom is low CPU and doubled concurrency, not memory pressure. Option D is wrong because model overfitting affects prediction accuracy, not inference latency; overfitting does not change the computational cost of a forward pass.

189
MCQeasy

A data scientist needs to share a BigQuery dataset with a colleague in a different team so they can run queries. What is the simplest and most secure way to grant access?

A.Export the dataset to Cloud Storage and share the bucket
B.Add the colleague's account as a BigQuery Data Viewer on the dataset
C.Share the service account key of a BigQuery job user with the colleague
D.Add the colleague's account as a Project Viewer on the entire project
AnswerB

Direct IAM binding on the dataset provides least-privilege access.

Why this answer

Option A is correct because BigQuery dataset ACLs (via IAM) allow fine-grained access to specific datasets. Option B is wrong because sharing the entire project gives too much access. Option C is wrong because exporting to Cloud Storage adds unnecessary complexity and stale data.

Option D is wrong because sharing the service account key is a security risk.

190
MCQeasy

Your team is using Vertex AI Pipelines to build an automated training pipeline. You need to share the pipeline definition with another team so they can run it in their own project. Which format should you use?

A.Copy the pipeline artifacts to a Cloud Storage bucket and share the bucket.
B.Package the pipeline as a Docker container image.
C.Share the Python code that compiles the pipeline.
D.Export the pipeline as a YAML file using the Kubeflow Pipelines SDK.
AnswerD

YAML file defines the pipeline graph and components.

Why this answer

Option D is correct because Vertex AI Pipelines is built on Kubeflow Pipelines, and the standard way to share a pipeline definition is to export it as a YAML file using the Kubeflow Pipelines SDK (`kfp.compiler.Compiler().compile()`). This YAML file contains the complete pipeline specification, including all components, dependencies, and execution order, and can be uploaded and run in any Vertex AI project without requiring the original Python code or build environment.

Exam trap

Google Cloud often tests the misconception that sharing the Python code or Docker images is sufficient for pipeline portability, but the exam expects you to recognize that the compiled YAML is the portable, self-contained artifact that decouples pipeline definition from the build environment.

How to eliminate wrong answers

Option A is wrong because copying pipeline artifacts (such as intermediate outputs or model files) does not share the pipeline definition itself; the other team would need the pipeline specification to recreate the execution graph. Option B is wrong because a Docker container image packages the runtime environment and code for a single component, not the entire pipeline DAG (directed acyclic graph) definition. Option C is wrong because sharing the Python compilation code requires the other team to have the exact same dependencies, SDK versions, and build environment to reproduce the pipeline, which is error-prone and not the intended portable format.

191
MCQeasy

An ML engineer is monitoring a Vertex AI Feature Store used for online serving. Which metrics are most important to track for ensuring low-latency online serving?

A.Number of feature stores and feature values.
B.Storage utilization and write throughput to the feature store.
C.Batch export duration and number of exported features.
D.Feature value retrieval latency (p99) and error rate.
AnswerD

These directly affect online serving performance.

Why this answer

For online serving, the primary concern is the latency and reliability of feature value retrieval at inference time. The p99 retrieval latency directly measures the worst-case delay experienced by users, while the error rate captures failures that could cause serving disruptions. Other metrics like storage utilization or batch export duration are relevant for offline or batch pipelines, not real-time serving.

Exam trap

The trap here is that candidates confuse metrics for offline batch operations (like export duration) with those for online serving, or assume that storage-level metrics (like utilization) are sufficient for performance monitoring, when in fact only retrieval latency and error rate directly reflect the serving quality.

How to eliminate wrong answers

Option A is wrong because the number of feature stores and feature values does not directly impact serving latency; it is a capacity planning metric, not a performance indicator. Option B is wrong because storage utilization and write throughput are important for data ingestion and maintenance, but they do not measure the online retrieval performance that affects inference latency. Option C is wrong because batch export duration and number of exported features pertain to offline batch serving or data export jobs, not the low-latency online serving path.

192
MCQeasy

A data science team uses Vertex AI Workbench and wants to share notebooks with version history. Which service should they use?

A.Artifact Registry
B.Cloud Storage
C.Data Catalog
D.Cloud Source Repositories
AnswerD

Cloud Source Repositories provides Git-based version control for notebooks and code.

Why this answer

Cloud Source Repositories (CSR) is the correct choice because it provides Git-based version control for notebooks, enabling teams to track changes, collaborate, and maintain a full version history. Vertex AI Workbench integrates natively with CSR, allowing users to clone, commit, and push notebook files directly from the JupyterLab interface, which is essential for collaborative development with revision tracking.

Exam trap

Google Cloud often tests the distinction between storage services (Cloud Storage) and version control services (Cloud Source Repositories), leading candidates to choose Cloud Storage because it has object versioning, but it lacks the collaborative Git workflow required for notebook version history.

How to eliminate wrong answers

Option A is wrong because Artifact Registry is designed for storing and managing container images and ML artifacts (e.g., models, packages), not for version-controlling notebook files or providing a Git-based history. Option B is wrong because Cloud Storage is an object store for unstructured data; it supports object versioning but lacks the branching, merging, and collaborative workflow features of a Git repository, making it unsuitable for notebook version history. Option C is wrong because Data Catalog is a metadata management service for discovering and tagging assets (e.g., datasets, models), not a version control system for code or notebooks.

193
MCQmedium

An organization uses Cloud Composer to orchestrate ML workflows. A DAG that triggers Vertex AI training jobs fails because the training job exceeds the 7-day maximum runtime. What is the best way to handle long-running training jobs in Cloud Composer?

A.Increase the DAG execution timeout to 14 days in the Airflow configuration
B.Use Vertex AI Pipeline to manage the training job asynchronously
C.Refactor the training job to run on Dataflow, which supports longer runtimes
D.Set max_active_runs=1 in the DAG to prevent overlapping runs
AnswerB

Vertex AI Pipeline can handle long-running jobs independently of the DAG runtime.

Why this answer

Option B is correct because Vertex AI Pipelines natively supports asynchronous execution, allowing Cloud Composer to trigger a pipeline and monitor its status without blocking the Airflow worker for the entire duration of the training job. This decouples the DAG execution timeout from the training runtime, enabling workflows that exceed the 7-day Airflow task timeout limit.

Exam trap

The trap here is that candidates assume increasing the Airflow execution timeout is a valid solution, but the PMLE exam tests understanding that Cloud Composer's architecture imposes practical limits on synchronous task execution, and the correct approach is to use asynchronous orchestration with services like Vertex AI Pipelines.

How to eliminate wrong answers

Option A is wrong because increasing the DAG execution timeout to 14 days does not address the underlying issue: Airflow tasks have a hard-coded maximum runtime of 7 days (configurable via `default_task_retries` and `execution_timeout`, but extending it beyond 7 days is not recommended and can lead to resource exhaustion and scheduler instability). Option C is wrong because Dataflow is a stream and batch processing service, not designed for long-running ML training jobs; its default worker timeout is also limited, and refactoring to Dataflow would not solve the runtime limit issue. Option D is wrong because `max_active_runs=1` prevents overlapping DAG runs but does nothing to extend the maximum runtime of a single task; the training job would still fail after 7 days.

194
MCQeasy

A data scientist has trained a model using Vertex AI Training and wants to deploy it to a Vertex AI Endpoint for online predictions. Which orchestration service should be used to automate the deployment step after training completes?

A.Vertex AI Pipelines
B.App Engine
C.Cloud Functions
D.Cloud Build
AnswerA

Vertex AI Pipelines allows you to define a pipeline with training and deployment components, automating the workflow.

Why this answer

Vertex AI Pipelines is the correct orchestration service because it is purpose-built for automating and managing end-to-end ML workflows on Google Cloud. It allows you to define a pipeline that includes both the training step (using Vertex AI Training) and the subsequent deployment step (creating or updating a Vertex AI Endpoint) as a single, repeatable, and monitored workflow. This ensures that after training completes, the model is automatically deployed without manual intervention, leveraging the pipeline's ability to pass artifacts and trigger conditional logic.

Exam trap

Google Cloud often tests the distinction between general-purpose compute services (Cloud Functions, App Engine) and ML-specific orchestration tools (Vertex AI Pipelines), trapping candidates who think any serverless or CI/CD tool can handle the unique requirements of ML workflow automation.

How to eliminate wrong answers

Option B (App Engine) is wrong because it is a platform-as-a-service (PaaS) for building and hosting web applications, not an ML pipeline orchestrator; it lacks native integration with Vertex AI Training and Endpoint APIs for automated model deployment. Option C (Cloud Functions) is wrong because it is a serverless compute service for event-driven, single-purpose functions, not designed for orchestrating multi-step ML workflows with dependencies and artifact tracking. Option D (Cloud Build) is wrong because it is a CI/CD service primarily for building, testing, and deploying software artifacts (e.g., container images), not for orchestrating ML pipelines that involve training jobs and endpoint deployments with state management.

195
MCQeasy

You are using Cloud Datalab for collaborative data exploration with your team. However, some team members cannot access the Datalab instances. What is the most likely issue?

A.The Datalab instances have been deleted by another team member.
B.The team members need to install the Cloud Datalab SDK locally.
C.The team members have not been granted the necessary IAM roles (e.g., roles/datalab.user) on the project.
D.The Datalab instances were created using an incompatible notebook type.
AnswerC

IAM roles control access to Datalab instances.

Why this answer

Cloud Datalab uses IAM permissions to control access to instances. The most common reason team members cannot access Datalab instances is that they lack the necessary IAM role, such as `roles/datalab.user`, which grants permission to view and connect to Datalab instances. Without this role, even if the instances exist and are running, users will receive permission-denied errors when trying to access them via the Datalab UI or API.

Exam trap

Google Cloud often tests the misconception that Cloud Datalab requires local software installation or that instance deletion is the cause, when in fact the core issue is almost always IAM permissions, specifically the `roles/datalab.user` role.

How to eliminate wrong answers

Option A is wrong because if Datalab instances were deleted, all team members would lose access, not just some, and the error would be a 'not found' rather than an access-denied error. Option B is wrong because Cloud Datalab is a managed service accessed through a web browser; no local SDK installation is required—users simply need the correct IAM permissions and a browser. Option D is wrong because Datalab instances are based on Jupyter notebooks, and there is no concept of 'incompatible notebook type' that would prevent access; the instance type (e.g., machine size) does not affect authentication or authorization.

196
MCQmedium

A team has deployed a model with autoscaling configured as shown. They notice that during off-peak hours, the endpoint consistently runs 3 instances instead of scaling down to 1. What is the most likely cause?

A.There is a sustained request rate that prevents scaling down.
B.The `enableAccessLogging` flag increases resource usage.
C.The `minReplicaCount` is set too high.
D.The model is too large to fit on a single instance.
AnswerA

Autoscaler keeps instances if load requires them, even if low.

Why this answer

The autoscaling configuration is likely based on a target metric (e.g., requests per second or CPU utilization). During off-peak hours, if there is a sustained but low request rate that still exceeds the scale-down threshold, the model will not reduce instances below the number needed to handle that load. The endpoint runs 3 instances because the sustained request rate prevents the scaling-down logic from triggering, even though the traffic is lower than peak.

Exam trap

Google Cloud often tests the misconception that scaling is purely based on instance count or model size, when in reality it is driven by sustained request rates and metric thresholds that prevent scale-down actions.

How to eliminate wrong answers

Option B is wrong because `enableAccessLogging` only controls whether request/response logs are written to CloudWatch (or similar), which does not directly affect compute resource usage or scaling behavior. Option C is wrong because if `minReplicaCount` were set too high, the endpoint would always run at least that many instances, but the question states it runs 3 instances instead of scaling down to 1, implying the minimum is 1 and the scaling logic is failing to reduce further. Option D is wrong because model size affects instance memory and startup time, but it does not prevent scaling down; a large model can still run on a single instance if the instance type supports it.

197
MCQeasy

A marketing team wants to use a pre-built natural language processing (NLP) model from Vertex AI Model Garden to analyze customer feedback. They need to extract sentiment from text data stored in Cloud Storage. The team has no experience with model serving infrastructure. Which deployment option minimizes operational overhead?

A.Deploy the model as a Cloud Function invoked by Cloud Storage events.
B.Deploy the model as a Cloud Run service using a custom Docker container.
C.Deploy the model on App Engine flexible environment.
D.Deploy the model to a Vertex AI Endpoint directly from Model Garden.
AnswerD

Simplest deployment with managed infrastructure.

Why this answer

Option D is correct because deploying directly to a Vertex AI Endpoint from Model Garden eliminates all infrastructure management. Vertex AI handles model serving, scaling, and monitoring automatically, which is ideal for a team with no experience in model serving infrastructure. This is a fully managed, serverless deployment that requires no containerization or server configuration.

Exam trap

The trap here is that candidates often assume Cloud Functions or Cloud Run are simpler because they are 'serverless,' but they fail to recognize that deploying a large NLP model requires specialized infrastructure (GPUs, model serving frameworks) that these services do not natively provide without significant custom work.

How to eliminate wrong answers

Option A is wrong because Cloud Functions are designed for lightweight, stateless event-driven code, not for hosting large NLP models with significant memory and GPU requirements; they also lack built-in model serving capabilities like autoscaling for inference. Option B is wrong because deploying as a Cloud Run service with a custom Docker container requires the team to containerize the model, manage dependencies, and configure scaling, which introduces significant operational overhead for a team with no serving experience. Option C is wrong because App Engine flexible environment still requires the team to build a custom runtime, manage instances, and handle model dependencies, and it is not optimized for ML inference workloads like Vertex AI endpoints.

198
MCQeasy

An organization wants to implement continuous training for a model that serves predictions via Vertex AI Endpoints. Which approach best automates the retrain-deploy cycle?

A.Schedule a Vertex AI Pipeline to retrain and conditionally deploy
B.Use Vertex AI Model Registry to auto-deploy on new model upload
C.Manually retrain and deploy monthly
D.Use Cloud Composer to schedule retraining only
E.Use a Cloud Function to retrain the model and update the endpoint
AnswerA

Automates the full cycle.

Why this answer

Option A is correct because Vertex AI Pipelines can be scheduled to run a retraining workflow and include a conditional step that deploys the new model to the endpoint only if it passes validation (e.g., evaluation metrics meet a threshold). This fully automates the retrain-deploy cycle without manual intervention, leveraging the pipeline's orchestration capabilities.

Exam trap

Google Cloud often tests the distinction between partial automation (e.g., only retraining or only deploying) and full end-to-end automation; the trap here is that candidates may choose an option that automates only one part of the cycle (like retraining with Cloud Composer or auto-deployment with Model Registry) and miss that the question requires both retraining and deployment to be automated in a single, orchestrated workflow.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Registry auto-deploys a model to an endpoint only if the endpoint is configured for automatic deployment, but it does not trigger retraining; it merely deploys an already uploaded model, so it does not automate the retrain step. Option C is wrong because manual retraining and deployment monthly is not automated and defeats the purpose of continuous training. Option D is wrong because Cloud Composer (Airflow) can schedule retraining, but it does not automatically deploy the model to the endpoint; deployment requires an additional step, so it does not fully automate the cycle.

Option E is wrong because a Cloud Function can trigger retraining and update an endpoint, but it lacks built-in orchestration for complex workflows like conditional deployment based on model evaluation, and it is less robust for managing dependencies and state compared to a pipeline.

199
Multi-Selecthard

A team is architecting a low-code ML system for real-time predictions with AutoML. Which THREE considerations are critical for production?

Select 3 answers
A.Enable autoscaling for the endpoint
B.Set up alerts for model performance degradation
C.Monitor prediction drift with Vertex AI Model Monitoring
D.Use a custom container for prediction
E.Use global model endpoints for low latency everywhere
AnswersA, B, C

Correct: Essential for handling variable traffic.

Why this answer

A is correct because autoscaling ensures the prediction endpoint can handle variable request loads without manual intervention, which is critical for production real-time systems. In Vertex AI, you can configure autoscaling with a target utilization level (e.g., 60%) to automatically adjust the number of compute nodes based on incoming traffic, preventing both over-provisioning and latency spikes.

Exam trap

The trap here is that candidates confuse 'low-code' with 'no-code' and assume custom containers (Option D) are always required for production, when AutoML actually abstracts away container management, and they also mistakenly think a single global endpoint inherently provides low latency, ignoring the need for regional deployment and traffic routing.

200
Multi-Selecteasy

Which TWO options are best practices for reducing model serving latency on Vertex AI Endpoints? (Choose two.)

Select 2 answers
A.Use a larger machine type with more memory
B.Optimize the model using quantization or pruning
C.Deploy the model in the same region as the clients
D.Use batch prediction instead of online prediction
E.Enable model caching at the endpoint
AnswersB, C

Reduces model size and inference time, lowering latency with minimal accuracy impact.

Why this answer

Options C and E are correct. Deploying in the same region as clients reduces network latency. Optimizing the model (quantization/pruning) reduces compute time without major accuracy loss.

Option A increases cost but not necessarily latency. Option B is not a feature. Option D increases latency due to batch processing.

201
MCQeasy

A user receives this error when deploying an AutoML model. What should they do?

A.Change the region to us-west1
B.Use machine type n1-highmem-2
C.Increase the min-replica-count to 2
D.Remove the traffic-split flag
AnswerB

Correct: n1-highmem-2 is a supported machine type for AutoML.

Why this answer

The error occurs because AutoML models require a machine type with sufficient memory to load the model and perform predictions. The default machine type may not have enough memory for the model's size, leading to an out-of-memory (OOM) error. Using `n1-highmem-2` provides higher memory per core, which resolves the memory constraint without changing other deployment parameters.

Exam trap

The trap here is that candidates often confuse scaling (increasing replicas) with resource allocation (increasing memory per replica), leading them to choose Option C instead of addressing the per-instance memory bottleneck.

How to eliminate wrong answers

Option A is wrong because changing the region to `us-west1` does not affect the memory capacity of the machine type; the error is due to insufficient memory, not regional availability or latency. Option C is wrong because increasing `min-replica-count` to 2 only adds more replicas for scaling, but each replica still uses the same underpowered machine type, so the OOM error persists. Option D is wrong because removing the `traffic-split` flag would disrupt traffic routing but does not address the root cause of insufficient memory for model loading.

202
MCQhard

A company has a pipeline that uses Vertex AI Pipelines to fetch data from BigQuery, preprocess with Dataflow (without code?), then train an AutoML model, and deploy. However, they want to reduce cloud costs. The pipeline runs hourly. Which change will most reduce compute costs while maintaining throughput?

A.Decrease the AutoML training budget from 10 to 1 node hour
B.Replace Dataflow preprocessing with a Cloud Function that runs on each file upload
C.Increase Dataflow batch size to process more data per worker
D.Switch from Vertex AI Pipelines to Cloud Composer for orchestration
AnswerC

Reduces the number of worker instances needed.

Why this answer

Option C is correct because increasing the Dataflow batch size allows each worker to process more data per batch, reducing the number of workers needed and the total compute time for the same throughput. This directly lowers Dataflow's compute cost without affecting the pipeline's hourly schedule or the AutoML training budget.

Exam trap

The trap here is that candidates assume reducing AutoML node hours (Option A) is the most direct way to cut costs, but the question specifies 'maintaining throughput' and the pipeline runs hourly, so Dataflow preprocessing is the dominant cost driver, not the model training budget.

How to eliminate wrong answers

Option A is wrong because decreasing the AutoML training budget from 10 to 1 node hour would severely degrade model quality, as AutoML requires sufficient training time to converge, and this does not address the main cost driver (Dataflow preprocessing). Option B is wrong because replacing Dataflow with a Cloud Function triggered on file upload is event-driven and not suitable for the hourly batch pipeline; Cloud Functions have a 9-minute timeout and cannot handle large-scale preprocessing, so throughput would drop and costs could increase due to per-invocation overhead. Option D is wrong because switching from Vertex AI Pipelines to Cloud Composer (managed Airflow) adds orchestration complexity and cost (e.g., environment nodes) without reducing compute costs for Dataflow or AutoML; the orchestration layer is not the primary cost driver.

203
MCQhard

An organization uses Vertex AI Pipelines to automate a model training workflow. They want to reuse previously trained models if the data hasn't changed. Which pipeline component best achieves this?

A.Use a caching mechanism in Vertex AI Pipelines
B.Use a Cloud Function to check BigQuery update time
C.Use Artifact Registry to store model versions
D.Use a conditional component that checks data hash
AnswerD

Correct: Conditional logic can skip training when data unchanged.

Why this answer

Conditional components can check data changes and skip training if unchanged. Caching exists but not for whole pipeline; Cloud Function check is external; Artifact Registry stores models but doesn't decide to retrain.

204
MCQmedium

A data science team uses TFX to train and deploy a model on Vertex AI. They want automated monitoring for pipeline health. Which set of metrics should they monitor to quickly detect issues in the training pipeline?

A.Prediction request count, latency, and error rate on the serving endpoint.
B.Pipeline execution status (success/failure), component completion times, and data validation anomalies.
C.Number of pipeline runs, average CPU utilization, and memory usage.
D.Model accuracy, precision, and recall on the evaluation dataset.
AnswerB

Directly monitors pipeline health including data quality.

Why this answer

Option B is correct because the question specifically asks about monitoring the training pipeline's health, not the serving infrastructure. Pipeline execution status directly indicates whether the pipeline ran successfully, component completion times help identify bottlenecks or failures, and data validation anomalies catch data quality issues early in the pipeline — all of which are essential for detecting issues in the training pipeline itself.

Exam trap

The trap here is that candidates confuse serving endpoint metrics (like latency and error rate) with pipeline health metrics, because both are part of an ML system, but the question explicitly asks about the training pipeline, not the serving infrastructure.

How to eliminate wrong answers

Option A is wrong because prediction request count, latency, and error rate are metrics for monitoring the serving endpoint (model serving), not the training pipeline. Option C is wrong because number of pipeline runs, average CPU utilization, and memory usage are infrastructure-level metrics that do not directly indicate pipeline health or data quality issues. Option D is wrong because model accuracy, precision, and recall are evaluation metrics for model performance, not for detecting issues in the training pipeline's execution or data validation.

205
Multi-Selectmedium

Which THREE actions are best practices for managing ML models in production on Google Cloud? (Choose 3)

Select 3 answers
A.Manually tune hyperparameters for each retraining run.
B.Monitor model performance and data drift continuously.
C.Use a central model registry for model governance.
D.Version all model artifacts and training datasets.
E.Store all raw training data indefinitely for auditability.
AnswersB, C, D

Correct: monitoring helps detect degradation.

Why this answer

Option B is correct because continuous monitoring of model performance and data drift is essential for maintaining prediction accuracy in production. Google Cloud's Vertex AI Model Monitoring automatically detects skew and drift by comparing serving data against training data distributions, alerting you to degradation before it impacts business outcomes.

Exam trap

Google Cloud often tests the misconception that manual hyperparameter tuning is acceptable for production, when in fact automation (e.g., Vertex AI Vizier) is the recommended practice to ensure reproducibility and efficiency.

206
Multi-Selecthard

A company runs batch predictions on a large dataset using Vertex AI Batch Prediction. They want to reduce costs without significantly increasing processing time. Which three actions should they take? (Choose three.)

Select 3 answers
A.Use preemptible VMs for the batch prediction job.
B.Use a larger machine type to reduce the number of workers.
C.Use custom machine types with only the necessary resources (vCPU and memory).
D.Use TPUs instead of GPUs to accelerate processing.
E.Tune the batch size to maximize throughput per worker.
AnswersA, C, E

Preemptible VMs are significantly cheaper and suitable for batch jobs.

Why this answer

Options A, C, and E are correct. A uses preemptible VMs which are cheaper. C tunes batch size to maximize throughput per worker.

E uses custom machine types to avoid overprovisioning. Option B increases machine size which may increase cost per worker. Option D uses TPUs which are more expensive and may not be beneficial for all model types.

207
Multi-Selectmedium

Which TWO metrics should you monitor to detect data drift in a batch prediction pipeline?

Select 2 answers
A.Model accuracy on recent labeled data
B.Model prediction latency
C.Feature distribution drift (e.g., KS test)
D.Prediction distribution drift
E.Training data size
AnswersC, D

Directly measures input drift.

Why this answer

Feature distribution drift (C) is correct because it directly measures changes in the input data distribution over time using statistical tests like the Kolmogorov-Smirnov (KS) test, which compares the cumulative distribution of a feature in the current batch against a reference baseline. This is a primary indicator of data drift, as shifts in feature distributions can degrade model performance even if labels are not immediately available.

Exam trap

Google Cloud often tests the distinction between monitoring for data drift (input distribution changes) versus monitoring for model performance degradation (accuracy), leading candidates to incorrectly select accuracy as a drift metric when it is actually a downstream effect.

208
MCQmedium

A company uses Vertex AI AutoML to train a vision model, but the model has low accuracy. What should they do first?

A.Add more labeled images to the dataset
B.Switch to a custom model
C.Increase the training budget
D.Reduce image size to speed up training
AnswerA

More data often improves model accuracy.

Why this answer

Adding more labeled images directly addresses the most common cause of low accuracy in AutoML vision models: insufficient or unrepresentative training data. Vertex AI AutoML relies on transfer learning from pre-trained models, and its performance is heavily dependent on the quality and quantity of labeled examples. Before adjusting hyperparameters or infrastructure, the first step should always be to improve the dataset, as AutoML is designed to handle model architecture and training budget automatically.

Exam trap

Google Cloud often tests the misconception that AutoML models are 'black boxes' where tuning budgets or switching to custom models is the first fix, when in reality the platform is optimized to handle those aspects automatically, and the primary lever is data quality.

How to eliminate wrong answers

Option B is wrong because switching to a custom model would require manual architecture design and hyperparameter tuning, which contradicts the low-code premise of AutoML and is not the first troubleshooting step. Option C is wrong because increasing the training budget (e.g., node hours) only helps if the model has not converged; with low accuracy, the root cause is typically data quality, not insufficient training time. Option D is wrong because reducing image size may speed up training but can discard critical features, further degrading accuracy; AutoML already handles resizing internally.

209
MCQmedium

An ML team is scaling a prototype to production. The data pipeline currently reads from Cloud Storage and transforms data with a custom Python script. They need to handle higher throughput and add monitoring. Which approach should they take?

A.Deploy the Python script on a large Compute Engine instance with a cron job
B.Migrate the pipeline to Apache Beam on Dataflow with Cloud Monitoring
C.Rewrite the pipeline to use Pub/Sub and Cloud Functions for processing
D.Use Cloud Composer to orchestrate the Python script at scale
AnswerB

Dataflow is serverless, auto-scales, and integrates with Cloud Monitoring for observability.

Why this answer

Apache Beam on Dataflow provides a unified programming model for batch and streaming data processing, enabling automatic scaling to handle higher throughput. Cloud Monitoring integrates natively with Dataflow to track pipeline metrics, latency, and error rates, addressing the monitoring requirement. This approach is purpose-built for production-grade data pipelines, unlike ad-hoc solutions.

Exam trap

Google Cloud often tests the distinction between orchestration (Cloud Composer) and execution (Dataflow), leading candidates to choose an orchestrator when a dedicated processing engine is required for scaling and monitoring.

How to eliminate wrong answers

Option A is wrong because deploying a Python script on a single large Compute Engine instance with a cron job does not provide horizontal scaling, fault tolerance, or built-in monitoring; it creates a single point of failure and cannot handle throughput spikes. Option C is wrong because rewriting the pipeline to use Pub/Sub and Cloud Functions is suitable for event-driven, lightweight processing but not for complex data transformations or high-throughput batch workloads; Cloud Functions have timeouts (up to 9 minutes for HTTP functions) and lack stateful processing capabilities. Option D is wrong because Cloud Composer (managed Apache Airflow) is an orchestration tool, not a data processing engine; it would still rely on the Python script's execution, inheriting its scaling and monitoring limitations without addressing the core transformation throughput.

210
MCQhard

A company has a Vertex AI pipeline that trains a model on streaming data from Pub/Sub. The pipeline is triggered by a Cloud Function when new data arrives. Recently, jobs have been failing with 'ResourceExhausted: Quota limit exceeded for regional CPUs in us-central1.' The team needs to ensure successful job execution while minimizing changes. Which approach should they take?

A.Request a quota increase from Google Cloud Support.
B.Change the pipeline to run in a different region with available quota.
C.Reduce the number of parallel pipeline runs by using a Cloud Tasks queue with rate limiting.
D.Configure the pipeline's training job to use preemptible VMs (which count toward a separate, usually higher quota).
AnswerD

Preemptible VMs have a separate quota and are cheaper.

Why this answer

Option D is correct because preemptible VMs count toward a separate, often higher quota for 'Preemptible CPUs' rather than the standard regional CPU quota. By configuring the training job to use preemptible VMs, the team can bypass the exhausted quota without requesting a limit increase or changing the pipeline architecture. This minimizes changes while leveraging the fact that Vertex AI training jobs can be configured to use preemptible VMs via the `worker_pool_specs` with `accelerator_type` and `machine_type` settings.

Exam trap

Google Cloud often tests the misconception that rate limiting (Option C) solves quota exhaustion, but the trap here is that quota limits are per-resource (e.g., regional CPUs) and rate limiting does not change the per-job resource consumption, so it only delays the inevitable failure.

How to eliminate wrong answers

Option A is wrong because requesting a quota increase from Google Cloud Support is a manual, time-consuming process that does not minimize changes and may not be approved quickly, especially if the quota is already at a high default limit. Option B is wrong because changing the pipeline to run in a different region introduces significant architectural changes, potential latency issues, and may require reconfiguring data sources like Pub/Sub topics and Cloud Functions, which contradicts the goal of minimizing changes. Option C is wrong because reducing the number of parallel pipeline runs with a Cloud Tasks queue addresses concurrency but does not resolve the underlying regional CPU quota exhaustion; the quota limit is still hit per run, and rate limiting only delays failures rather than preventing them.

211
MCQhard

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

A.Profile the container's memory usage locally with memory_profiler to find the leak, then fix the code.
B.Reduce the number of replicas to 1 to reduce memory contention.
C.Increase the machine memory to n1-standard-8 (30 GB).
D.Restart the endpoint every hour using a Cloud Scheduler job.
AnswerA

Identifies root cause for permanent fix.

Why this answer

Option A is correct because the steady memory growth despite a fixed 5 GB embedding table indicates a memory leak in the custom container code, not a capacity issue. Profiling locally with memory_profiler allows you to trace object allocations and identify the leak source before modifying the serving code, which is the most direct diagnostic step.

Exam trap

Google Cloud often tests the distinction between scaling up resources (Option C) and fixing the root cause (Option A), tempting candidates to choose a quick capacity increase instead of proper debugging.

How to eliminate wrong answers

Option B is wrong because reducing replicas to 1 does not address the memory leak; it only reduces total cluster memory, making the leak more severe per replica. Option C is wrong because increasing machine memory to n1-standard-8 (30 GB) merely postpones the failure by providing more headroom, but the leak will eventually consume that memory as well. Option D is wrong because restarting the endpoint every hour via Cloud Scheduler is a workaround that masks the symptom without fixing the underlying code defect, and it introduces request downtime during restarts.

212
MCQhard

Refer to the exhibit. A data scientist deploys a new model version (model_v2) to an existing endpoint with 20% traffic. After a few days, they notice that model_v2's error rate is higher than model_v1's. They want to route all traffic back to model_v1 immediately. Which command achieves this with minimal disruption?

A.gcloud ai endpoints update my-endpoint --region=us-central1 --remove-deployed-model=model_v2
B.gcloud ai endpoints undepoly-model my-endpoint --region=us-central1 --model=model_v2
C.gcloud ai endpoints update my-endpoint --region=us-central1 --traffic-split=model_v1=1,model_v2=0
D.gcloud ai endpoints update-traffic my-endpoint --region=us-central1 --model=model_v1 --traffic-percentage=100
AnswerC

This command updates the traffic split to direct 100% traffic to model_v1 and 0% to model_v2, a zero-downtime change.

Why this answer

Option A is correct. The gcloud ai endpoints update command with --traffic-split allows setting the traffic split to 100% for model_v1 and 0% for model_v2, routing all traffic to the stable model without redeploying. Option B removes the model, which may cause temporary unavailability.

Option C uses a misspelled command. Option D changes the endpoint's update time but not traffic.

213
MCQmedium

You have deployed a regression model that predicts house prices. Over the past month, the model's predictions have been consistently too high. You suspect data drift in the input features. Which monitoring metric should you prioritize to confirm this?

A.Monitor prediction drift (prediction distribution)
B.Monitor feature distribution drift using a divergence metric like Jensen-Shannon divergence
C.Monitor feature attribution drift using SHAP values
D.Monitor residual distribution drift
AnswerB

Feature drift measures input distribution change.

Why this answer

Option B is correct because the question describes a scenario where predictions are consistently too high, which is a symptom of data drift—a change in the distribution of input features. Monitoring feature distribution drift using a divergence metric like Jensen-Shannon divergence directly measures whether the input data has shifted from the training distribution, which would cause the model to make biased predictions. This is the most direct way to confirm data drift in the input features.

Exam trap

Google Cloud often tests the distinction between monitoring prediction drift (output) and feature drift (input), trapping candidates who assume that a change in predictions automatically implies data drift without verifying the input distributions.

How to eliminate wrong answers

Option A is wrong because monitoring prediction drift (prediction distribution) only tells you that the outputs have changed, not why; it does not isolate whether the cause is data drift in features or other issues like concept drift. Option C is wrong because monitoring feature attribution drift using SHAP values measures changes in feature importance, not changes in the feature distributions themselves; it can indicate which features are driving predictions differently but does not directly confirm data drift. Option D is wrong because monitoring residual distribution drift focuses on the errors (residuals) between predictions and actual values, which can be influenced by both data drift and concept drift; it does not specifically confirm data drift in input features.

214
Matchingmedium

Match each Google Cloud AI/ML service to its primary purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

End-to-end ML platform for building, deploying, and managing models

Train high-quality custom ML models with minimal effort

Managed service for distributed training of ML models

Custom ASIC for accelerating ML training workloads

Create and execute ML models using SQL queries

Why these pairings

These are core Google Cloud AI/ML services tested in the PMLE exam.

215
MCQmedium

A retail company wants to build a product recommendation system using customer purchase history and product attributes. They have limited ML expertise and want to minimize custom code. Which approach should they choose?

A.Use BigQuery ML to create a matrix factorization model.
B.Use Vertex AI Vizier for hyperparameter tuning on a pre-built recommendation model.
C.Use Vertex AI AutoML Tables to train a recommendation model.
D.Use TensorFlow with Keras to build a custom collaborative filtering model.
AnswerC

AutoML Tables can build a recommendation model from tabular data with minimal code.

Why this answer

Vertex AI AutoML Tables is the correct choice because it enables building a recommendation model with minimal ML expertise and custom code, leveraging automated feature engineering, model selection, and hyperparameter tuning on tabular data (customer purchase history and product attributes). It requires no custom code, unlike TensorFlow/Keras, and provides a managed service that handles data preprocessing and training, aligning with the company's limited ML expertise and desire to minimize custom code.

Exam trap

Google Cloud often tests the distinction between model training services (AutoML, BigQuery ML) and optimization/tuning services (Vizier), leading candidates to confuse Vizier as a complete model-building solution when it only tunes hyperparameters for an existing model.

How to eliminate wrong answers

Option A is wrong because BigQuery ML's matrix factorization model is designed for explicit feedback (e.g., ratings) and requires structured SQL-based feature engineering, which still demands ML knowledge and custom SQL code, not a fully low-code solution. Option B is wrong because Vertex AI Vizier is a hyperparameter tuning service, not a model training service; it cannot build a recommendation model on its own and requires a pre-built model to tune, which the company lacks. Option D is wrong because TensorFlow with Keras requires significant custom code and ML expertise to implement collaborative filtering, contradicting the requirement to minimize custom code and limited ML expertise.

216
Multi-Selectmedium

A company uses Cloud Scheduler to trigger Cloud Functions that submit Vertex AI training jobs. They want to ensure fault tolerance and minimize manual intervention. Which TWO practices should they implement?

Select 2 answers
A.Store training hyperparameters in Cloud Firestore for reproducibility.
B.Use Cloud Run jobs as an alternative execution environment.
C.Use Cloud Tasks with retries to handle failed triggers.
D.Implement a fallback that runs the job on Compute Engine if Vertex AI fails.
E.Set up Cloud Monitoring alerts on failed pipeline runs.
AnswersC, E

Cloud Tasks can schedule and retry HTTP requests to the Cloud Function, providing fault tolerance.

Why this answer

Option C is correct because Cloud Tasks provides built-in retry logic with exponential backoff, which can reliably handle transient failures when triggering Cloud Functions from Cloud Scheduler. By configuring a Cloud Tasks queue with retry parameters, the system automatically retries failed triggers without manual intervention, ensuring fault tolerance for Vertex AI training job submissions.

Exam trap

Google Cloud often tests the distinction between fault tolerance (retry mechanisms) and other concerns like reproducibility or alternative compute; the trap here is that candidates may confuse storing hyperparameters (reproducibility) or switching to Compute Engine (fallback) with actual fault tolerance for trigger failures.

217
Multi-Selectmedium

A team has trained a sentiment analysis model using PyTorch on Vertex AI Training. They now want to deploy it for online predictions with low latency. Which TWO actions should they take? (Choose 2)

Select 2 answers
A.Create multiple model versions for A/B testing.
B.Use a machine type with a GPU for faster inference.
C.Enable batch prediction instead of online prediction.
D.Convert the model to TensorFlow SavedModel format.
E.Package the model in a custom container with a web server (e.g., FastAPI).
AnswersB, E

GPUs can accelerate inference for deep learning models.

Why this answer

Option B is correct because GPU-accelerated inference significantly reduces latency for deep learning models like sentiment analysis, especially when using PyTorch, which has native CUDA support. Vertex AI Prediction supports GPU machine types (e.g., n1-standard-4 with NVIDIA T4) that can process batched requests faster than CPUs, directly addressing the low-latency requirement.

Exam trap

Google Cloud often tests the misconception that converting to TensorFlow SavedModel is required for Vertex AI, but the platform supports PyTorch natively via custom containers, making conversion an unnecessary and potentially error-prone step.

218
MCQhard

A large e-commerce company deploys a recommendation model on Vertex AI with autoscaling enabled. During Black Friday, traffic spikes rapidly. The autoscaler adds new instances, but new instances take several minutes to become ready (cold start). As a result, many requests time out. What should they do to mitigate this issue?

A.Use a larger machine type to reduce the number of instances needed.
B.Configure the autoscaler to use CPU utilization metric instead of request count.
C.Increase the health check grace period for new instances.
D.Set a higher minimum number of instances to handle the expected peak.
AnswerD

Pre-warms instances to absorb traffic spikes without cold start.

Why this answer

Option D is correct because setting a higher minimum number of instances ensures that a baseline capacity is always running and ready to serve traffic. This pre-warms instances, eliminating the cold-start latency during rapid traffic spikes, such as Black Friday, because new instances do not need to initialize from scratch.

Exam trap

The trap here is that candidates confuse scaling metrics or instance readiness with the fundamental need for pre-provisioned capacity, leading them to choose options that adjust autoscaling behavior without eliminating the cold-start latency.

How to eliminate wrong answers

Option A is wrong because using a larger machine type reduces the number of instances needed but does not address the cold-start delay; each new instance still takes minutes to become ready. Option B is wrong because switching to CPU utilization metric does not solve the cold-start problem; the autoscaler still adds instances that take time to initialize, and CPU utilization may not react as quickly to a sudden traffic surge as request count. Option C is wrong because increasing the health check grace period only delays when the load balancer considers an instance healthy, but the instance still takes the same time to become ready; requests will still time out during the cold-start window.

219
Multi-Selectmedium

A data scientist needs to scale a prototype deep learning model to train on a massive dataset using multiple GPUs. Which three strategies are essential for efficient distributed training? (Select THREE)

Select 3 answers
A.Use a single large batch size across all workers.
B.Implement data parallelism.
C.Ensure that the input pipeline is not a bottleneck by using tf.data.Dataset with prefetching and parallel reads.
D.Use synchronous gradient updates.
E.Use asynchronous gradient updates to reduce communication overhead.
AnswersB, C, D

Scales training by splitting data across workers.

Why this answer

Options A, C, and E are correct. Data parallelism (C) is the foundation for scaling across GPUs. Synchronous gradient updates (A) are commonly used to maintain convergence quality.

An optimized input pipeline (E) prevents I/O bottlenecks. Option B is wrong because asynchronous updates can cause convergence issues and are not essential. Option D is wrong because using a single large batch size across all workers is not essential; per-worker batch size must be tuned.

220
MCQeasy

An MLOps team wants to automate the retraining of a model each time new data arrives in a BigQuery table. What is the most efficient Google Cloud service to orchestrate this pipeline?

A.Cloud Composer with an Airflow DAG
B.Dataflow pipeline with a periodic trigger
C.Cloud Functions triggered by BigQuery events
D.Vertex AI Pipelines with a schedule trigger
AnswerD

Vertex AI Pipelines natively supports scheduled triggers and is the recommended service for ML pipeline orchestration.

Why this answer

Vertex AI Pipelines is purpose-built for orchestrating ML workflows, including model retraining. It integrates natively with BigQuery for data ingestion and supports schedule triggers to automate retraining upon new data arrival, making it the most efficient and managed option for this ML-specific task.

Exam trap

The trap here is that candidates often confuse event-driven triggers with BigQuery's lack of native row-level or table-level event notifications, leading them to incorrectly choose Cloud Functions or Dataflow, while Vertex AI Pipelines provides the most integrated and efficient orchestration for ML retraining workflows.

How to eliminate wrong answers

Option A is wrong because Cloud Composer (Airflow) is a general-purpose workflow orchestrator that adds unnecessary overhead and complexity for a simple retraining pipeline, and it is not optimized for ML-specific operations like model versioning and deployment. Option B is wrong because Dataflow is a stream/batch data processing service, not an orchestrator; a periodic trigger would require additional services (e.g., Cloud Scheduler) and does not natively handle model retraining or pipeline orchestration. Option C is wrong because Cloud Functions triggered by BigQuery events cannot directly trigger BigQuery events (BigQuery does not emit event-driven triggers for new table data); this option reflects a misunderstanding of BigQuery's event capabilities.

221
Multi-Selecteasy

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

Select 2 answers
A.Use batch prediction for all requests.
B.Enable autoscaling to handle traffic variations.
C.Use manual scaling with a fixed number of replicas.
D.Deploy all models on the same machine type for consistency.
E.Set up model monitoring for prediction drift and data quality.
AnswersB, E

Autoscaling adjusts resources dynamically.

Why this answer

Option B is correct because Vertex AI's autoscaling dynamically adjusts the number of replicas based on incoming request traffic, ensuring low latency during spikes and cost savings during lulls. This is critical for real-time inference, where consistent response times are required and manual scaling would either over-provision or under-provision resources. Autoscaling uses metrics like CPU utilization or request count to scale up or down, directly improving serving performance and reliability.

Exam trap

Google Cloud often tests the distinction between batch and real-time serving, trapping candidates who think batch prediction can be used for low-latency inference, or who assume that manual scaling is more reliable than autoscaling for variable workloads.

222
MCQhard

Refer to the exhibit. A ML engineer runs this Vertex AI pipeline. After execution, the "train" task fails with a resource exhaustion error. The task consumes more memory than allocated. Which step should the engineer take to fix this issue without increasing the overall quota cost?

A.Add a 'memory' field to the train task specification.
B.Configure the 'train-exec' executor to use a machine type with higher memory.
C.Increase the memory of the train task to 32 GiB.
D.Set 'acceleratorType' to 'NVIDIA_TESLA_T4' on the train task.
AnswerB

The executor defines the machine type, and modifying it to use a higher-memory machine (e.g., n1-highmem-8) will provide more memory without changing other quota.

Why this answer

In Vertex AI pipelines, the executorLabel maps to a predefined executor that defines machine type. To increase memory for the train task, the engineer must modify the executor specification (e.g., 'machine_type: n1-highmem-*'). Increasing the task's memory directly is not supported; it's done via the executor.

Adding an accelerator does not address memory exhaustion.

223
Multi-Selecteasy

Which TWO are best practices for deploying models to Vertex AI Prediction? (Choose 2.)

Select 2 answers
A.Monitor prediction latency and error rates with Cloud Monitoring alerts.
B.Log all raw prediction inputs and outputs for every request for auditing.
C.Use a dedicated service account with minimal permissions for the endpoint.
D.Always deploy the model in the same environment as training to avoid incompatibility.
E.Use the default model version alias 'default' for all deployments to simplify updates.
AnswersA, C

Essential for detecting performance issues.

Why this answer

Options B and D are correct. Option A is wrong because exact same environment may not be available. Option C is wrong because version aliases should be used for easy rollback.

Option E is wrong because logging all inputs may cause privacy issues.

224
MCQhard

A hospital wants to deploy a machine learning model for detecting anomalies in patient vital signs. The model was trained on historical data but must comply with HIPAA regulations. The model serving must be low-latency (under 100 ms) and handle up to 1000 requests per second. Which architecture should they use on Google Cloud?

A.Use Vertex AI Batch Prediction to run predictions in batch jobs every hour
B.Use BigQuery ML to run predictions directly from a BigQuery table
C.Deploy the model as a container on Cloud Run with a load balancer
D.Deploy the model to Vertex AI Prediction with a private endpoint and use VPC Service Controls for data isolation
AnswerD

Vertex AI Prediction with private endpoints offers low latency and VPC-SC provides HIPAA-compliant data boundaries.

Why this answer

Vertex AI Prediction with a private endpoint and VPC Service Controls meets all requirements: it provides low-latency (sub-100ms) online predictions for up to 1000 QPS, enforces HIPAA compliance by isolating the model within a VPC and preventing data exfiltration, and supports autoscaling. Batch Prediction (A) cannot meet the latency requirement, BigQuery ML (B) is designed for analytical queries not real-time serving, and Cloud Run (C) lacks native HIPAA-compliant data isolation controls.

Exam trap

Google Cloud often tests the distinction between batch and online prediction, and candidates mistakenly choose Cloud Run because it offers low latency, but they overlook the HIPAA data isolation requirement that VPC Service Controls uniquely satisfy in a managed ML context.

How to eliminate wrong answers

Option A is wrong because Vertex AI Batch Prediction processes predictions in batch jobs with latency of minutes to hours, not sub-100ms, and cannot handle real-time requests at 1000 QPS. Option B is wrong because BigQuery ML runs predictions via SQL queries on BigQuery tables, which incurs query execution latency (typically seconds) and is not designed for low-latency online serving. Option C is wrong because Cloud Run, while capable of low-latency serving, does not provide built-in VPC Service Controls or private endpoints for HIPAA-compliant data isolation; additional configuration would be needed and it lacks the managed ML serving optimizations of Vertex AI Prediction.

225
MCQhard

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

A.Switch to GPU instances (e.g., n1-standard-4 with T4) and set min replicas=2 with autoscaling up to 10
B.Increase min replicas to 5 to keep warm instances
C.Set min replicas=1 and max replicas=5 to control cost
D.Increase max replicas to 20 and keep CPU instances
AnswerA

GPUs accelerate inference, reducing per-request latency; warm instances handle spike.

Why this answer

Option A is correct because switching to GPU instances (n1-standard-4 with T4) offloads compute-intensive recommendation model inference to GPUs, significantly reducing per-request latency. Setting min replicas=2 ensures that at least two instances are always warm, reducing cold-start delays and allowing autoscaling to handle traffic spikes more responsively. This combination addresses both the CPU bottleneck and the slow scaling trigger, keeping latency under 50ms even at 1000 QPS.

Exam trap

Google Cloud often tests the misconception that simply increasing the number of CPU instances or adjusting autoscaling parameters can solve a CPU-bound latency problem, when the real fix is to change the compute architecture (e.g., GPU) to match the workload's computational profile.

How to eliminate wrong answers

Option B is wrong because increasing min replicas to 5 on CPU-only instances does not resolve the fundamental CPU bottleneck; the model still runs on CPU, so each request will still suffer high latency under load, and the cost increases without performance gain. Option C is wrong because setting max replicas to 5 limits the maximum capacity to only 5 CPU instances, which cannot handle 1000 QPS without severe latency, and min replicas=1 still risks cold-start delays. Option D is wrong because increasing max replicas to 20 on CPU instances only adds more CPU-bound nodes, which still cannot process requests fast enough per instance due to the CPU bottleneck, leading to continued high latency and timeouts.

Page 2

Page 3 of 7

Page 4

All pages