PMLE Exam Questions and Answers

A startup has developed a prototype ML model using scikit-learn on a single machine. They now need to scale it to handle larger datasets and deploy it for real-time predictions. The team is small and wants minimal operational overhead. Which Google Cloud service should they use?

AI Platform Prediction

Vertex AI

Vertex AI provides managed training, deployment, and autoscaling with minimal operational overhead.

Cloud Functions

Compute Engine with TensorFlow Serving

Why: Vertex AI (option B) is the correct choice because it provides a unified, fully managed MLOps platform that integrates model training, deployment, and scaling with minimal operational overhead. It supports scikit-learn models natively, offers auto-scaling for real-time predictions, and eliminates the need to manage infrastructure, making it ideal for a small team transitioning from a prototype.

A data science team has trained a TensorFlow model on-premises using a large dataset. When they try to deploy the model to Vertex AI for online predictions, the deployed model fails to start with a ‘MemoryError’. The model artifact is 2 GB, and the machine type is n1-standard-4 (15 GB RAM). What is the most likely cause?

The model is stored in a regional bucket and the Vertex AI endpoint is in a different region.

The machine type does not support TensorFlow models larger than 1 GB.

The model is too large for the machine's memory, causing an out-of-memory (OOM) error during loading.

The 2 GB model may require more than 15 GB RAM during loading due to overhead and intermediate structures.

The model file is corrupted or missing dependencies, causing a crash.

Why: Option C is correct because the model artifact is 2 GB, and loading it into memory on an n1-standard-4 machine (15 GB RAM) can still cause a MemoryError. TensorFlow models often require additional memory for graph construction, intermediate tensors, and framework overhead, which can easily exceed the available RAM, especially when the model is loaded entirely into memory before serving.

A company has a prototype ML model that works well on historical data, but when deployed to production, the model performance degrades over time. The data distribution shifts gradually. Which strategy should they implement to maintain model accuracy?

Increase the regularization strength to prevent overfitting.

Increase the amount of training data by using more historical records.

Implement a retraining pipeline that periodically retrains the model on recent data.

Periodic retraining with fresh data helps the model adapt to gradual distribution shifts.

Switch to a more complex model architecture to better capture patterns.

Why: Option C is correct because gradual data distribution shifts (concept drift) require the model to adapt to new patterns over time. A retraining pipeline that periodically retrains on recent data ensures the model remains aligned with the current production distribution, directly addressing the degradation caused by drift without relying on static historical data.

An ML engineer is scaling a prototype to production using Vertex AI Pipelines. The pipeline includes data validation, preprocessing, training, and deployment steps. They want to ensure that the pipeline can be reproduced and audited. What is the best practice?

Define the pipeline using Kubeflow Pipelines SDK and run it on Vertex AI Pipelines.

Vertex AI Pipelines automatically tracks artifacts, parameters, and lineage.

Use a Docker container with fixed tags and manually record runs.

Store all data and models in a single Cloud Storage bucket with no versioning.

Pin all library versions in a requirements.txt file.

Why: Using a fully managed pipeline service like Vertex AI Pipelines automatically tracks artifacts, parameters, and lineage, ensuring reproducibility and auditability. Option A is not a service; Option B is about environment consistency but does not provide built-in tracking. Option D is about dependencies but not the pipeline orchestration.

A team has trained a sentiment analysis model using PyTorch on Vertex AI Training. They now want to deploy it for online predictions with low latency. Which TWO actions should they take? (Choose 2)

Create multiple model versions for A/B testing.

Use a machine type with a GPU for faster inference.

GPUs can accelerate inference for deep learning models.

Enable batch prediction instead of online prediction.

Convert the model to TensorFlow SavedModel format.

Package the model in a custom container with a web server (e.g., FastAPI).

Custom containers allow deploying PyTorch models on Vertex AI.

Why: Option B is correct because GPU-accelerated inference significantly reduces latency for deep learning models like sentiment analysis, especially when using PyTorch, which has native CUDA support. Vertex AI Prediction supports GPU machine types (e.g., n1-standard-4 with NVIDIA T4) that can process batched requests faster than CPUs, directly addressing the low-latency requirement.

A company has a prototype ML model that predicts equipment failure. They want to deploy it to production using Vertex AI. The model must be retrained weekly with new data. They also need to monitor for data drift and model performance. Which THREE components should they include in their MLOps pipeline? (Choose 3)

A scheduled training pipeline that retrains the model weekly.

Scheduled retraining is essential for keeping the model up-to-date.

A manual QA step where data scientists approve each deployment.

A manual review of new data before it is used for training.

An automated trigger that redeploys the model when performance drops below a threshold.

Automated redeployment based on performance ensures quick recovery.

A monitoring system that checks for data drift and triggers alerts.

Monitoring is critical for detecting when the model degrades.

Why: Option A is correct because the requirement specifies weekly retraining, which is best implemented as a scheduled training pipeline in Vertex AI using Cloud Scheduler or a recurring AI Platform Pipeline run. This automates the retraining process without manual intervention, ensuring the model stays current with new data.

Want more Scaling prototypes into ML models practice?

All Automating and orchestrating ML pipelines questions

Domain 2: Automating and orchestrating ML pipelines

An MLOps team is implementing a CI/CD pipeline for a TensorFlow model on Vertex AI. The model training job takes 2 hours and produces a SavedModel. The team wants to automatically trigger a new pipeline run whenever a change is pushed to the 'main' branch of their source repository. The pipeline should include training, evaluation, and if metrics exceed a threshold, deploy the model to a Vertex AI endpoint. Which trigger configuration should they use?

Use Eventarc to listen for Cloud Source Repository push events and invoke a Cloud Run service that starts the pipeline.

Use an Artifact Registry trigger to detect new model images and then start the pipeline.

Set up a Cloud Scheduler job that runs every 2 hours and triggers a Vertex AI Pipeline run.

Configure a Cloud Build trigger that watches the 'main' branch of Cloud Source Repositories; in the build config, use steps to run the pipeline via the Vertex AI API.

Cloud Build triggers are designed for source code events and can orchestrate ML pipelines.

Why: Option D is correct because Cloud Build triggers can be configured to watch a specific branch (e.g., 'main') in Cloud Source Repositories and automatically execute a build configuration. Within that build config, you can use the `gcloud` or `curl` steps to invoke the Vertex AI Pipeline API, which starts the training, evaluation, and conditional deployment workflow. This directly matches the requirement for a branch-based push trigger that orchestrates the full ML pipeline.

A data science team is deploying a PyTorch model for real-time inference using Vertex AI Endpoints. The model requires a custom container with specific CUDA drivers and Python packages. They have created a Docker image and pushed it to Artifact Registry. The pipeline should automatically retrain the model every week and deploy the new version if it passes validation. However, the deployment step fails intermittently with the error 'The container image is not compatible with the machine type.' What is the most likely cause?

The service account does not have permission to pull the container from Artifact Registry.

The container image requires GPU support but the machine type specified in the endpoint is a CPU-only machine.

CUDA drivers require GPU machines; using a CPU machine causes compatibility error.

The container's health check endpoint is not responding correctly.

The model artifact size exceeds the maximum allowed for the machine type.

Why: The error 'The container image is not compatible with the machine type' indicates a mismatch between the container's hardware requirements and the machine type selected for the Vertex AI Endpoint. Since the custom container requires specific CUDA drivers, it is built for GPU acceleration. If the endpoint is configured with a CPU-only machine type (e.g., n1-standard-4), the container will fail to run because the GPU drivers cannot initialize, triggering this incompatibility error.

An ML engineer is using Vertex AI Pipelines with Kubeflow Pipelines SDK (KFP) to orchestrate a training and deployment workflow. They want to reuse a custom component across multiple pipelines. The component is defined in a Python file 'preprocess.py' that includes a function decorated with @kfp.components.create_component_from_func. How should they package this component for reuse?

Import the preprocess module and call create_component_from_func on the function, then use the resulting component in pipeline definitions.

This allows the component to be defined once and reused.

Save the component as a YAML file using kfp.components.ComponentStore and load it in other pipelines.

Compile the pipeline that uses the component into a JSON file and upload it to Vertex AI.

Build a custom container image with the function and use it as a base image in other pipelines.

Why: Option A is correct because the recommended way to reuse a custom component defined via `@kfp.components.create_component_from_func` is to import the Python module containing the decorated function and call `create_component_from_func` on that function in each pipeline definition. This creates a reusable component object that can be used directly in the pipeline's `@dsl.pipeline` definition without additional packaging steps. The KFP SDK treats the function as the source of truth, and re-importing ensures the component logic is always current.

A company has a Vertex AI pipeline that trains a model on streaming data from Pub/Sub. The pipeline is triggered by a Cloud Function when new data arrives. Recently, jobs have been failing with 'ResourceExhausted: Quota limit exceeded for regional CPUs in us-central1.' The team needs to ensure successful job execution while minimizing changes. Which approach should they take?

Request a quota increase from Google Cloud Support.

Change the pipeline to run in a different region with available quota.

Reduce the number of parallel pipeline runs by using a Cloud Tasks queue with rate limiting.

Configure the pipeline's training job to use preemptible VMs (which count toward a separate, usually higher quota).

Preemptible VMs have a separate quota and are cheaper.

Why: Option D is correct because preemptible VMs count toward a separate, often higher quota for 'Preemptible CPUs' rather than the standard regional CPU quota. By configuring the training job to use preemptible VMs, the team can bypass the exhausted quota without requesting a limit increase or changing the pipeline architecture. This minimizes changes while leveraging the fact that Vertex AI training jobs can be configured to use preemptible VMs via the `worker_pool_specs` with `accelerator_type` and `machine_type` settings.

An ML team is designing an automated pipeline to retrain a recommendation model every day using new user interaction data stored in BigQuery. The pipeline must be cost-efficient, scalable, and require minimal manual intervention. Which two approaches should they consider?

Deploy a custom Kubernetes cron job on GKE to run the training script directly.

Use Cloud Composer (Airflow) to schedule the pipeline with a DAG.

Use Cloud Scheduler to publish a Pub/Sub message daily, which triggers a Cloud Function that starts the Vertex AI Pipeline.

This provides automated daily triggering with minimal overhead.

Use Dataflow to continuously read from BigQuery and trigger training when new data arrives.

Use Vertex AI Pipelines to define the workflow and preemptible VMs for training to reduce cost.

Preemptible VMs are cost-effective and Vertex AI Pipelines orchestrates the workflow.

Why: Option C is correct because Cloud Scheduler triggers a Pub/Sub message that invokes a Cloud Function, which starts a Vertex AI Pipeline. This serverless approach is cost-efficient (no idle compute), scales automatically, and requires minimal manual intervention. Option E is correct because Vertex AI Pipelines natively orchestrates ML workflows, and using preemptible VMs reduces training costs by up to 80% while maintaining scalability.

You are an ML engineer at a large e-commerce company. Your team has developed a product recommendation model using TensorFlow and deployed it on Vertex AI Endpoints for real-time inference. The model is retrained weekly using a Vertex AI Pipeline that reads new user interaction data from BigQuery, trains the model, evaluates it, and deploys the new version to the endpoint with a traffic split: 10% to the new model and 90% to the previous champion model. Recently, the team noticed that the new model's online prediction latency has increased significantly (from 50ms to 200ms) after deployment, causing timeouts for some requests. The training code has not changed, and the model size is similar. The pipeline uses a custom container with the same TensorFlow Serving image as before. The deployment step uses the same machine type (n1-standard-4) for the endpoint. What is the most likely cause of the latency increase?

The endpoint is using a machine type that is not optimized for the new model's computation.

The new model has a significantly different architecture that requires more computation.

The pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation.

A data validation step might have inadvertently added preprocessing ops, increasing latency.

The new model is experiencing data skew because the training data distribution has changed.

Why: Option C is correct because the pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation. This additional operation runs during inference on Vertex AI Endpoints, increasing the per-request latency from 50ms to 200ms, even though the model architecture and size remain unchanged. The custom container and machine type are identical, so the latency increase must stem from a change in the serving graph itself.

Want more Automating and orchestrating ML pipelines practice?

All Collaborating within and across teams to manage data and models questions

Domain 3: Collaborating within and across teams to manage data and models

A data science team uses a shared Cloud Storage bucket to store training datasets. They notice that some team members accidentally overwrite existing datasets, causing issues with reproducibility. Which approach best prevents accidental overwrites while maintaining collaboration?

Use a single shared service account with strict IAM roles that allow only append operations.

Require team members to manually rename files before uploading.

Set bucket permissions to read-only for all team members except the data owner.

Enable object versioning on the bucket and use lifecycle rules to manage versions.

Versioning allows recovery of previous versions if overwritten.

Why: Option D is correct because enabling object versioning on a Cloud Storage bucket preserves all versions of an object, so even if a team member overwrites a dataset, the previous version remains accessible. This maintains collaboration (anyone can upload) while preventing permanent data loss. Lifecycle rules can then be used to manage storage costs by automatically deleting old versions after a specified period.

A machine learning engineer needs to share a trained model with the product team for integration. The model is stored in Cloud Storage, and the product team’s service account needs read access. The engineer wants to follow the principle of least privilege. Which IAM configuration should be used?

Generate a signed URL with read access and share it with the product team.

Grant the product team's service account the roles/storage.objectViewer role at the bucket level.

Bucket-level grants read access to objects in that bucket only, following least privilege.

Grant the product team's service account the roles/storage.objectAdmin role at the bucket level.

Grant the product team's service account the roles/storage.objectViewer role at the project level.

Why: Option B is correct because granting the product team's service account the roles/storage.objectViewer role at the bucket level provides read-only access to objects in that specific bucket, adhering to the principle of least privilege. This role allows the service account to list and read objects without granting broader permissions, such as modifying or deleting them, and scoping it to the bucket prevents unnecessary access to other buckets in the project.

A team is using Vertex AI Pipelines to automate their ML workflow. They want to ensure that pipeline runs are reproducible and that artifacts are tracked. Which feature should they use?

Vertex AI Feature Store

Vertex AI Experiments

Experiments track parameters, metrics, and artifacts for each run.

Vertex AI Model Registry

Vertex AI Endpoints

Why: Vertex AI Experiments is the correct feature because it captures parameters, metrics, and artifacts for each pipeline run, enabling reproducibility and lineage tracking. This directly supports the team's need to ensure runs are reproducible and artifacts are tracked, as Experiments automatically logs metadata for every execution.

A team of data scientists and ML engineers is collaborating on a project using Vertex AI Workbench. They need to share notebooks and code, but want to avoid conflicts and maintain a history of changes. Which approach should they use?

Email notebook files to each other and manually merge changes.

Store notebooks in a shared Cloud Storage bucket and access them simultaneously.

Use Vertex AI Experiments to share notebook outputs.

Use a git repository (e.g., Cloud Source Repositories) to manage code and notebooks.

Git provides branching, merging, and history.

Why: Option D is correct because using a git repository (e.g., Cloud Source Repositories) provides version control, branching, and a full history of changes, which is essential for collaborative development. This approach avoids conflicts by allowing team members to work on separate branches and merge changes systematically, unlike shared storage or manual methods that lack conflict resolution and audit trails.

A machine learning team is deploying a model for real-time predictions using Vertex AI. They need to ensure that the deployment follows best practices for collaboration and governance. Which TWO actions should they take?

Use a continuous integration/continuous deployment (CI/CD) pipeline to deploy model versions.

CI/CD ensures consistent, repeatable deployments.

Store all model artifacts in a local file system to reduce latency.

Enable model monitoring to detect data drift and performance degradation.

Monitoring is essential for ongoing model quality.

Manually configure autoscaling parameters for the endpoint.

Allow any team member to deploy directly to production without review.

Why: Option A is correct because using a CI/CD pipeline for deploying model versions ensures automated, repeatable, and auditable deployments, which is a best practice for collaboration and governance. This approach enforces version control, testing, and approval gates, reducing the risk of errors and enabling rollback if needed.

A financial services company uses Vertex AI Pipelines to train and deploy models for fraud detection. The ML team consists of data scientists who develop models and ML engineers who deploy them. They use a CI/CD pipeline with Cloud Build to build and push Docker images to Artifact Registry, then trigger Vertex AI Pipelines. Recently, the team noticed that a model deployed to production was trained on a dataset that had not been approved by the data governance team. Upon investigation, they found that a data scientist accidentally used an unapproved version of the training data by specifying a Cloud Storage path that was not the latest approved dataset. The company needs to enforce that only approved datasets are used in training jobs. Which approach should they take?

Implement a manual approval process where data scientists request dataset paths from the data governance team before each training run.

After training, run a validation step that checks if the dataset used matches the latest approved version, and roll back if not.

Use a curated dataset registry in BigQuery or Cloud Storage with IAM conditions that allow access only to datasets tagged as 'approved'. Modify the CI/CD pipeline to pass only approved dataset references to the training job.

This automates governance by restricting training to approved datasets via IAM and pipeline configuration.

Restrict all Cloud Storage buckets to be read-only for the data scientists, and have ML engineers copy approved datasets to a separate bucket.

Why: Option C is correct because it enforces governance at the source by using IAM conditions to restrict access to only approved datasets, preventing unauthorized data from being used in training. This approach integrates with the CI/CD pipeline to automatically pass only approved dataset references, eliminating the risk of human error in specifying Cloud Storage paths.

Want more Collaborating within and across teams to manage data and models practice?

All Architecting low-code ML solutions questions

Domain 4: Architecting low-code ML solutions

A retail company wants to build a product recommendation system using BigQuery ML for their e-commerce platform. The data includes customer purchase history, product metadata, and clickstream logs. The ML engineer needs to minimize manual feature engineering and leverage pre-built solutions. Which approach should the engineer take?

Use a pre-built recommendation model from Vertex AI Model Garden and deploy it to an endpoint.

Write a custom TensorFlow model using the Vertex AI Training service and deploy it via Vertex AI Prediction.

Export the data to CSV and use AutoML Tables to train a recommendation model.

Use BigQuery ML's matrix factorization model (CREATE MODEL with model_type='matrix_factorization') to train directly on historical interaction data.

BigQuery ML provides low-code matrix factorization for recommendations.

Why: Option D is correct because BigQuery ML's matrix factorization model (model_type='matrix_factorization') is purpose-built for recommendation systems using implicit or explicit feedback data. It trains directly on historical interaction data (e.g., user-item purchases) without requiring manual feature engineering, aligning with the goal of minimizing low-code ML effort. This approach leverages BigQuery's native SQL interface and scales automatically, making it ideal for the described e-commerce scenario.

A data scientist wants to quickly train a binary classification model on a tabular dataset stored in BigQuery without writing any code. They have limited ML experience. Which Google Cloud service should they use?

Vertex AI Workbench with a built-in scikit-learn notebook.

Dataflow with a TensorFlow pipeline.

BigQuery ML with CREATE MODEL statement using SQL.

BigQuery ML enables model creation with SQL, no coding required.

AutoML Tables with a direct BigQuery connection.

Why: Option C is correct because BigQuery ML allows a data scientist to train a binary classification model directly in BigQuery using a `CREATE MODEL` SQL statement, without writing any code or moving data. This is the fastest low-code approach for users with limited ML experience, as it leverages familiar SQL syntax and runs entirely within BigQuery's serverless infrastructure.

A company uses Vertex AI Pipelines to orchestrate their ML training workflow. The pipeline includes a BigQuery ML training step, a model evaluation step, and a deployment step to Vertex AI Endpoints. The engineer notices that the pipeline fails intermittently due to a quota exceeded error on Vertex AI Endpoints during model deployment. What is the best long-term solution to prevent this failure?

Run the pipeline steps sequentially with longer wait times.

Add retry logic with exponential backoff to the deployment step in the pipeline.

Handles transient quota errors gracefully without manual intervention.

Switch to deploying models using a custom container on Compute Engine.

Request a permanent quota increase for Vertex AI Endpoints.

Why: Option D is correct because implementing retry logic with exponential backoff is a resilient pattern for transient quota errors. Option A is wrong because increasing quota requires a support ticket and may not be granted immediately. Option B is wrong because using a custom container does not address quota limits. Option C is wrong because sequential execution does not prevent quota errors.

A manufacturing company wants to predict equipment failure using sensor data stored in BigQuery. They have limited ML expertise and want to use AutoML Tables. The data includes timestamps, numerical sensor readings, and a boolean 'failure' column. The dataset is highly imbalanced with only 1% failure cases. Which of the following is the most effective approach to handle the imbalance in AutoML Tables?

Let AutoML Tables handle the imbalance automatically; it has built-in techniques for class imbalance.

AutoML Tables automatically adjusts for imbalance.

Downsample the majority class to balance the dataset.

Use a custom loss function in the training configuration.

Oversample the minority class using SQL before training.

Why: AutoML Tables has built-in techniques to handle class imbalance, such as automatically adjusting class weights and using stratified sampling during training. This allows the model to learn from the minority class without requiring manual data preprocessing, making it the most effective and simplest approach for users with limited ML expertise.

A marketing team wants to use a pre-built natural language processing (NLP) model from Vertex AI Model Garden to analyze customer feedback. They need to extract sentiment from text data stored in Cloud Storage. The team has no experience with model serving infrastructure. Which deployment option minimizes operational overhead?

Deploy the model as a Cloud Function invoked by Cloud Storage events.

Deploy the model as a Cloud Run service using a custom Docker container.

Deploy the model on App Engine flexible environment.

Deploy the model to a Vertex AI Endpoint directly from Model Garden.

Simplest deployment with managed infrastructure.

Why: Option D is correct because deploying directly to a Vertex AI Endpoint from Model Garden eliminates all infrastructure management. Vertex AI handles model serving, scaling, and monitoring automatically, which is ideal for a team with no experience in model serving infrastructure. This is a fully managed, serverless deployment that requires no containerization or server configuration.

A financial institution uses BigQuery ML to train a linear regression model to predict loan default risk. The model is trained on a dataset with 100 million rows and 50 features. During inference, the engineer uses the ML.PREDICT function. However, the query takes several minutes to run and times out frequently. The data is static and updated monthly. What is the most cost-effective and low-code solution to improve prediction latency?

Export the trained model as a SQL function using the EXPORT MODEL statement, then use it for predictions.

Exports model as a persistent function for faster inference.

Create a Dataflow pipeline to precompute predictions and store them in a separate table.

Use a materialized view to precompute the prediction features.

Increase the BigQuery compute capacity by reserving more slots.

Why: Option A is correct because exporting the trained model as a SQL function via `EXPORT MODEL` converts the linear regression coefficients into a persistent SQL UDF, eliminating the overhead of model loading and serialization during each `ML.PREDICT` call. This approach is low-code (no external pipeline) and cost-effective since predictions are executed as standard SQL without consuming BigQuery ML slot resources for model inference.

Want more Architecting low-code ML solutions practice?

All Collaborating to manage data and models questions

Domain 5: Collaborating to manage data and models

A data science team uses BigQuery to store raw data and Vertex AI for model training. They want to ensure that only authorized users can access training data, and that model artifacts are automatically versioned and tracked. Which combination of Google Cloud services should they use?

Dataflow for data access control and Vertex AI Experiments for model tracking

Cloud Storage with bucket-level IAM and Cloud Build for versioning

Cloud Composer for data access control and Cloud Source Repositories for model versioning

Vertex AI Feature Store with access control and Vertex AI ML Metadata for model versioning

Vertex AI Feature Store provides controlled access to features, and ML Metadata tracks model artifacts and versions.

Why: Vertex AI Feature Store provides fine-grained access control to training data, ensuring only authorized users can access it. Vertex AI ML Metadata automatically tracks and versions model artifacts, lineage, and parameters, which aligns with the requirement for automated versioning and tracking.

An ML team uses Vertex AI Pipelines to automate model retraining. The pipeline includes a step that queries BigQuery to create a training dataset. The team notices that the pipeline fails intermittently with a '403 Exceeded rate limits' error. What is the most likely cause and solution?

The pipeline is issuing too many concurrent queries; use a BigQuery reservation to guarantee slot capacity

Reservations provide dedicated slots, avoiding API rate limits.

The training dataset is too large; partition the table and query only the latest partition

The pipeline step timeout is too short; increase the timeout to 30 minutes

The SQL query is inefficient; rewrite it using materialized views

Why: The 403 'Exceeded rate limits' error in BigQuery indicates that the project is hitting the concurrent query rate limit or the rate of bytes read per second. Using a BigQuery reservation guarantees dedicated slot capacity, which prevents rate-limit errors by ensuring the pipeline has consistent compute resources regardless of other workloads in the project. This is the most direct solution because rate limits are enforced at the project level based on available slots, and a reservation provides a fixed number of slots that bypass those limits.

A company stores training data in Cloud Storage and uses Vertex AI Training for model training. They want to implement a data validation pipeline to detect data drift before retraining. Which service should they use?

Vertex AI Model Monitoring

Vertex AI Model Monitoring can detect data drift by comparing distributions.

BigQuery ML

Cloud Data Loss Prevention

Dataflow

Why: Vertex AI Model Monitoring is designed specifically to detect data drift and feature skew in production ML models by continuously comparing prediction requests against a baseline training dataset. It provides automated alerts when statistical distributions shift beyond a defined threshold, making it the correct choice for a data validation pipeline before retraining.

A team uses Vertex AI Feature Store to serve features for real-time predictions. They notice that feature values are frequently updated from multiple source systems, leading to inconsistencies. They need to ensure that feature values are consistent across all serving endpoints. What should they do?

Use batch ingestion with weekly updates to reduce update frequency

Increase the offline storage TTL to retain historical feature values

Implement a manual approval process for feature updates

Use a streaming ingestion pipeline with exactly-once semantics

Exactly-once streaming ensures each update is applied exactly once, maintaining consistency.

Why: Option D is correct because streaming ingestion with exactly-once semantics ensures that each feature update is applied precisely once, preventing duplicates or missed updates that cause inconsistencies. This approach synchronizes feature values across all serving endpoints in near real-time, directly addressing the problem of frequent updates from multiple source systems.

An organization uses Cloud Composer to orchestrate ML workflows. A DAG that triggers Vertex AI training jobs fails because the training job exceeds the 7-day maximum runtime. What is the best way to handle long-running training jobs in Cloud Composer?

Increase the DAG execution timeout to 14 days in the Airflow configuration

Use Vertex AI Pipeline to manage the training job asynchronously

Vertex AI Pipeline can handle long-running jobs independently of the DAG runtime.

Refactor the training job to run on Dataflow, which supports longer runtimes

Set max_active_runs=1 in the DAG to prevent overlapping runs

Why: Option B is correct because Vertex AI Pipelines natively supports asynchronous execution, allowing Cloud Composer to trigger a pipeline and monitor its status without blocking the Airflow worker for the entire duration of the training job. This decouples the DAG execution timeout from the training runtime, enabling workflows that exceed the 7-day Airflow task timeout limit.

A team wants to share a trained model with other teams within the organization. They need to provide access to the model artifact in Vertex AI Model Registry and ensure that only authorized teams can deploy the model. What should they do?

Grant the other teams access to the Cloud Storage bucket where the model is stored

Set the model to public in Vertex AI Model Registry

Use Cloud Key Management Service to encrypt the model and share the decryption key

Use IAM to grant the 'aiplatform.models.deploy' role to the other teams on the model resource

IAM roles provide fine-grained access control within Vertex AI.

Why: Option D is correct because Vertex AI Model Registry uses IAM to control access to model resources. By granting the 'aiplatform.models.deploy' role on the specific model resource, you ensure that only authorized teams can deploy the model, while other operations (like viewing or updating) remain restricted. This follows the principle of least privilege and avoids exposing the model artifact broadly.

Want more Collaborating to manage data and models practice?

All Serving and scaling models questions

Domain 6: Serving and scaling models

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

Enable autoscaling for the deployment

Autoscaling adds nodes during peak traffic, reducing latency.

Increase the machine type of the node

Decrease the min replicas to 0

Enable automatic batching of requests

Why: Enabling autoscaling for the deployment is the correct first step because it allows Vertex AI Prediction to dynamically adjust the number of replicas based on incoming traffic. During peak hours, autoscaling can add more nodes to distribute the inference load, directly reducing latency without requiring manual intervention or over-provisioning.

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

Move the model to Cloud Functions

Use a GPU instance with a fixed number of replicas

Use a GPU instance with min replicas=0 and autoscaling

Scales down to zero when unused, saving costs.

Switch to a CPU-only machine type

Why: Option C is correct because setting min replicas to 0 allows Vertex AI Prediction to scale down to zero instances during off-peak hours, eliminating GPU costs when no requests are being served. Combined with autoscaling, the deployment will spin up GPU-backed instances on demand only when traffic arrives, directly addressing the underutilization issue while maintaining low latency for inference requests.

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

The input data format is incorrect

The model was trained with a different framework

The model uses a scikit-learn version not supported by Vertex AI

Version mismatch causes evaluation failure.

The endpoint is overloaded and timing out

Why: Vertex AI Prediction supports specific versions of scikit-learn for serving models. If the model was trained with a version that is not in the supported list (e.g., 0.19, 0.20, 0.22, 0.23, 0.24, 1.0, 1.1), the prediction endpoint will fail with a 'Model evaluation error' because the underlying runtime cannot load the serialized model (e.g., pickle or joblib file). This is the most likely cause of a 400 error when the input format is otherwise correct.

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

Reduce model size by removing features

Compress the model using gzip and upload

Deploy the model on Cloud Run Functions

Use a custom container to serve the model

Custom containers have no size limit.

Why: Vertex AI Prediction has a 2GB limit for the model artifact when using pre-built containers. A custom container bypasses this limit because you package the model and serving code into a Docker image, which can be arbitrarily large. This allows you to serve XGBoost models exceeding 2GB without size constraints imposed by the managed serving infrastructure.

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

Disable autoscaling and use a fixed number of replicas

Increase the max replicas setting

Decrease the machine type to reduce provisioning time

Set a higher min replicas to maintain a baseline of warm instances

Warm instances reduce latency during spikes.

Why: Option D is correct because setting a higher min replicas ensures that a baseline number of instances are always warm and ready to serve traffic. During a traffic spike, new instances still take time to provision (cold start), but the warm instances handle the initial surge without latency spikes. This directly addresses the observed high latency during spikes.

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

The custom container is not compatible with Vertex AI

The model is cached and needs cache invalidation

Traffic is not split to the new model version

Traffic splitting must be adjusted to route to the new version.

The new model version was not deployed to the same endpoint

Why: In Vertex AI Prediction, when you deploy a new model version to an existing endpoint, you must explicitly allocate traffic to it. By default, the new version receives 0% traffic, so all requests continue to be served by the old version. The correct fix is to update the endpoint's traffic split, for example via the console or the `gcloud ai endpoints update` command with the `--traffic-split` flag.

Want more Serving and scaling models practice?

All Monitoring ML solutions questions

Domain 7: Monitoring ML solutions

You have deployed a regression model that predicts house prices. Over the past month, the model's predictions have been consistently too high. You suspect data drift in the input features. Which monitoring metric should you prioritize to confirm this?

Monitor prediction drift (prediction distribution)

Monitor feature distribution drift using a divergence metric like Jensen-Shannon divergence

Feature drift measures input distribution change.

Monitor feature attribution drift using SHAP values

Monitor residual distribution drift

Why: Option B is correct because the question describes a scenario where predictions are consistently too high, which is a symptom of data drift—a change in the distribution of input features. Monitoring feature distribution drift using a divergence metric like Jensen-Shannon divergence directly measures whether the input data has shifted from the training distribution, which would cause the model to make biased predictions. This is the most direct way to confirm data drift in the input features.

Your team has deployed a text classification model on Vertex AI Endpoints. You notice that the model's latency has increased significantly over the last week, but the request rate has remained stable. Which of the following is the most likely cause?

A sudden increase in the number of prediction requests

The model was replaced with a larger version without updating the endpoint

A change in the preprocessing logic that now includes a computationally expensive step

This increases per-request latency without changing request rate.

A misconfiguration in the autoscaling policy

Why: A computationally expensive preprocessing step directly increases per-request latency on the inference path, even when request rate is stable. Vertex AI Endpoints execute user-provided preprocessing code before model inference, so adding a heavy operation (e.g., large regex, image resizing, or external API call) will linearly increase response time for every prediction.

You are monitoring a classification model that predicts loan default. The model was trained on data from 2020-2022. In 2023, the economic conditions changed, and the model's accuracy dropped significantly. Which monitoring approach would best help you detect this issue early?

Monitor the accuracy of the model on the latest batch of labeled data

Monitor feature distribution drift using KS test

Monitor the prediction distribution for significant shift from training distribution

Prediction distribution shift can indicate concept drift even without labels.

Monitor the freshness of the training data

Why: Option C is correct because monitoring the prediction distribution for a significant shift from the training distribution directly detects changes in the model's output behavior, which is the earliest indicator of concept drift or data drift caused by economic changes. Unlike accuracy monitoring, this approach does not require labeled data, enabling real-time detection of performance degradation before ground truth labels become available.

You are responsible for monitoring a batch prediction pipeline that runs daily. Recently, the pipeline started failing intermittently with out-of-memory errors. The input data volume has not changed. What is the most likely cause?

A recent code change that loads the entire dataset into memory before processing

This could cause OOM for large datasets.

Increase in model size due to retraining

Decrease in the number of worker machines

Increase in input data size

Why: Option A is correct because a code change that loads the entire dataset into memory before processing would directly cause out-of-memory (OOM) errors, even if the input data volume remains unchanged. In batch prediction pipelines, data is typically streamed or processed in chunks to manage memory efficiently. A change that bypasses this pattern and loads all data at once can exceed the available heap or container memory, leading to intermittent failures depending on data characteristics or concurrent loads.

You need to set up monitoring for a Vertex AI model that serves predictions in real-time. The model is expected to have a latency SLA of under 100ms. Which metric should you configure an alert on to ensure the SLA is met?

p50 latency of prediction requests

Prediction drift score

p99 latency of prediction requests

p99 captures tail latency critical for SLA.

Number of prediction requests per second

Why: Option C is correct because p99 latency measures the worst-case latency experienced by 99% of requests, which is the standard metric for enforcing a strict SLA like under 100ms. Monitoring p99 ensures that even the slowest 1% of requests do not violate the threshold, providing a robust guarantee for real-time predictions.

Your company uses a custom container for model serving on Vertex AI. After a recent update, the model returns predictions but they are clearly wrong (e.g., negative probabilities for a classification model). The logs show no errors. What is the most likely cause?

The preprocessing code in the container was updated but the model was not retrained on the new preprocessing

Feature transformation mismatch leads to incorrect predictions.

The model file is corrupted

The model file was accidentally replaced with a different model

The container is using an incompatible version of the serving framework

Why: Option A is correct because the most likely cause of a model returning predictions without errors, but with clearly wrong outputs like negative probabilities, is a mismatch between the preprocessing logic used during training and inference. If the preprocessing code in the container was updated (e.g., scaling, normalization, or feature engineering steps changed) but the model was not retrained on data processed with that new logic, the model receives inputs that are out of distribution, leading to nonsensical outputs. Vertex AI containers run inference with the deployed code, so any change in preprocessing directly affects the input tensor values without raising runtime errors.

Want more Monitoring ML solutions practice?

All Solving business challenges with ML questions

Domain 8: Solving business challenges with ML

A retail company wants to forecast weekly sales for each of its 500 stores. The data includes historical sales, promotions, holidays, and local weather. The company needs to update forecasts every week with new data. Which ML approach should they use?

Use BigQuery ML to create a linear regression model on historical data

Use Vertex AI Forecasting to train a time-series model with holiday and weather features

Vertex AI Forecasting is designed for time series with multiple features and supports automatic retraining.

Export data to AutoML Tables and train a regression model

Build a custom LSTM model using TensorFlow on Vertex AI Workbench

Why: Vertex AI Forecasting is purpose-built for time-series forecasting with support for exogenous features like holidays and weather, making it the ideal choice for weekly sales predictions across 500 stores. It handles multiple time series automatically and integrates with the required weekly retraining cycle, unlike generic regression models that lack temporal awareness.

A media company uses a custom Python script on a Compute Engine VM to run batch predictions with a large ML model. The script loads the model from Cloud Storage, processes records from a Pub/Sub pull subscription, and writes results to BigQuery. Predictions are taking too long and the VM often runs out of memory. Which two changes should the company implement to improve performance and scalability? (Choose TWO)

Deploy the model on Vertex AI Prediction for batch prediction

Change Pub/Sub to a push subscription that sends messages to a load-balanced group of VMs

Push subscriptions with load balancing allow horizontal scaling across multiple VMs.

Use Dataflow to read from Pub/Sub, run predictions using the model, and write to BigQuery

Dataflow provides distributed processing with auto-scaling, handling large volumes efficiently.

Switch to a larger VM with more memory

Store results in Cloud SQL instead of BigQuery

Why: Option B is correct because switching to a push subscription with a load-balanced group of VMs distributes the message processing load across multiple instances, preventing any single VM from being overwhelmed. This directly addresses the memory exhaustion issue by parallelizing the work and allowing horizontal scaling.

A hospital wants to deploy a machine learning model for detecting anomalies in patient vital signs. The model was trained on historical data but must comply with HIPAA regulations. The model serving must be low-latency (under 100 ms) and handle up to 1000 requests per second. Which architecture should they use on Google Cloud?

Use Vertex AI Batch Prediction to run predictions in batch jobs every hour

Use BigQuery ML to run predictions directly from a BigQuery table

Deploy the model as a container on Cloud Run with a load balancer

Deploy the model to Vertex AI Prediction with a private endpoint and use VPC Service Controls for data isolation

Vertex AI Prediction with private endpoints offers low latency and VPC-SC provides HIPAA-compliant data boundaries.

Why: Vertex AI Prediction with a private endpoint and VPC Service Controls meets all requirements: it provides low-latency (sub-100ms) online predictions for up to 1000 QPS, enforces HIPAA compliance by isolating the model within a VPC and preventing data exfiltration, and supports autoscaling. Batch Prediction (A) cannot meet the latency requirement, BigQuery ML (B) is designed for analytical queries not real-time serving, and Cloud Run (C) lacks native HIPAA-compliant data isolation controls.

A data scientist deployed a TensorFlow model for sentiment analysis to Vertex AI Prediction. The model expects input key 'text' but the client sends requests with key 'review_text'. Which step should the data scientist take to resolve the error without retraining the model?

Use a Cloud Function to strip the 'review_text' key and replace it with 'text'

Retrain the model with input key 'review_text'

Create a new Vertex AI Endpoint with an alias mapping 'review_text' to 'text'

Modify the client code to send requests with input key 'text'

This aligns the request with the model's expected signature without changing the model.

Why: Option D is correct because the most straightforward and reliable solution is to modify the client code to send the request with the expected input key 'text'. This avoids any additional infrastructure, latency, or complexity, and does not require retraining the model or altering the deployed endpoint. Vertex AI Prediction serves the model as-is, so aligning the client's request format with the model's expected input is the simplest and most maintainable fix.

A logistics company uses a regression model to predict delivery times. The model currently uses features: distance (km), traffic index, weather condition, and time of day. The data scientist notices that the model's predictions are systematically too low for deliveries during peak traffic hours. Which action would best address this issue?

Switch to a deep neural network model

Remove the traffic index feature as it is causing bias

Add a cross-feature that multiplies distance by traffic index

This interaction term allows the model to capture the combined effect.

Collect more training data during peak traffic hours

Why: The model's systematic underestimation during peak traffic hours indicates a missing interaction effect between distance and traffic. Adding a cross-feature (distance × traffic index) allows a linear model to capture the non-linear relationship where traffic disproportionately increases delivery time over longer distances. This directly addresses the bias without discarding useful data or unnecessarily complicating the model.

An e-commerce company uses a recommendation model that suggests products based on user browsing history. The model was trained on data from the past year and has high accuracy on the test set. However, after deployment, the click-through rate (CTR) on recommendations is much lower than expected. Which three steps should the data scientist take to diagnose and improve the model? (Choose THREE)

Run offline evaluation on a holdout dataset to confirm accuracy

Set up an A/B experiment comparing the model's recommendations against a baseline

A/B testing validates the model's real-world performance and identifies issues.

Retrain the model on the most recent three months of data to capture recent trends

User preferences may have shifted; retraining on recent data addresses concept drift.

Check the distribution of predictions versus the training set to detect drift

Monitoring prediction drift helps identify if the model is seeing different inputs than during training.

Increase the training dataset size by including data from two years ago

Why: Option B is correct because an A/B experiment directly measures the model's real-world impact by comparing its CTR against a baseline (e.g., random or popularity-based recommendations). This isolates the model's performance from confounding factors like seasonality or user behavior changes, providing a causal estimate of its effectiveness.

Want more Solving business challenges with ML practice?