PMLE Practice Test 37 — 15 Questions

Question 1

You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?

Accepted Answer

Add a NVIDIA T4 GPU accelerator to the existing machine type.. Adding a GPU accelerator (e.g., NVIDIA T4) to the instances can significantly speed up the TensorFlow and PyTorch components, which are deep learning models. The XGBoost part runs on CPU but the overall latency bottleneck is likely the deep learning models. GPU will accelerate inference of those models, reducing total latency. Increasing CPUs will help only marginally as the main bottleneck is compute. Reducing min replicas may increase cold start and queue. Switching to batch prediction changes the model from real-time to batch, which does not meet the latency requirement.

Answer

Reduce the min_replica_count to 0 to allow scaling down aggressively and add more replicas during spikes.

Answer

Increase the machine type to n1-highmem-16 with more vCPUs.

Answer

Switch the model to Vertex AI batch prediction and run predictions every hour.

Question 2

Your organization has a large production system that uses Vertex AI Prediction for an NLP model with a 2 GB memory footprint. The endpoint is configured with 5 replicas, each using an n1-standard-4 with a single T4 GPU. Recently, you observed an increase in 503 errors during peak hours. Cloud Monitoring shows that GPU utilization is consistently above 90% across all replicas, while CPU and memory are below 50%. You have already increased the max replicas to 10, but the errors persist because the increased replicas also become saturated. What should you do to resolve the issue?

Accepted Answer

Use a high-memory machine type like n1-highmem-16 to reduce memory pressure.. Option D is correct because CPU bottlenecks cause high latency; switching to a machine type with more CPU cores (e.g., n1-highcpu-16) reduces CPU contention. Option A adds memory but not CPU. Option B uses more replicas but each already saturated. Option C is irrelevant; batch processing is not in use.

Answer

Switch to a larger GPU such as V100 or A100 to increase per-replica throughput.

Answer

Implement request batching in the custom container to improve GPU utilization efficiency.

Answer

Enable model parallelism across multiple GPUs within each replica.

Question 3

A large e-commerce company deploys multiple ML models on Vertex AI Endpoints. They use Vertex AI Model Registry to manage model versions. Recently, a team accidentally deployed an unvalidated model to production, causing a service outage. They want to implement a governance process where models must pass certain validation checks before deployment. The validation includes unit tests, fairness checks, and performance benchmarks. They use CI/CD pipelines (Cloud Build). They also need to allow manual approval for critical models. Which combination of Vertex AI features and Cloud Build steps would enforce the required governance?

Accepted Answer

Implement Cloud Build triggers that run validation steps, then use Vertex AI Model Registry 'state' to mark models as 'validated' before allowing deployment to endpoints.. Option C is correct because it combines Cloud Build triggers to run validation steps (unit tests, fairness checks, performance benchmarks) and uses Vertex AI Model Registry's 'state' field to mark models as 'validated' only after passing those checks. This state then acts as a gate in the deployment pipeline, ensuring that only validated models can be deployed to Vertex AI Endpoints. The manual approval for critical models can be integrated as a Cloud Build approval step before the state is set to 'validated'.

Answer

Use Vertex AI Experiments to log validation results and require manual checks before deployment.

Answer

Set up Cloud Armor to block deployment of unvalidated models.

Answer

Use Vertex AI Continuous Monitoring to automatically detect issues and roll back deployments.

Question 4

A healthcare startup is developing a diagnostic model using sensitive patient data. They use Vertex AI to manage the training pipeline. They need to ensure that the data is encrypted both at rest and in transit. Additionally, they want to prevent the ML engineers from seeing raw data but still allow them to train models. They use Cloud Storage with CMEK and VPC-SC. They plan to use Vertex AI Training with a custom service account. The data stored in Cloud Storage is encrypted with CMEK. What additional step is needed to allow Vertex AI Training to access the encrypted data?

Accepted Answer

Grant the custom service account the Cloud KMS CryptoKey Decrypter role.. The correct answer is D because Vertex AI Training must use a custom service account that has the Cloud KMS CryptoKey Decrypter role to decrypt the CMEK-encrypted data stored in Cloud Storage. The custom service account is the identity that Vertex AI jobs run as, and it needs explicit permission to decrypt the CMEK key to read the training data. Without this role, the encrypted objects remain inaccessible even if the service account has Storage Object Viewer permissions.

Answer

Use a service account with the 'Storage Admin' role and 'Cloud KMS CryptoKey Decrypter' role.

Answer

Grant the Cloud Storage service agent the Cloud KMS CryptoKey Decrypter role.

Answer

Disable encryption for the training data to simplify access.

Question 5

A large e-commerce company uses Vertex AI to train a recommendation model daily. The training pipeline is built with Vertex AI Pipelines and involves three steps: data preprocessing, training, and model evaluation. The pipeline is triggered by a Cloud Scheduler job every morning at 8 AM. Recently, the pipeline has been failing intermittently during the data preprocessing step, with an error message indicating 'ResourceExhausted: Quota limits exceeded for read api requests.' The team has checked and confirmed that the quota for BigQuery read requests is not exceeded at the project level. The preprocessing step reads data from a BigQuery table with billions of rows. The team has also noticed that the pipeline runs on a custom machine type (n1-standard-4) with a persistent disk. What is the most likely cause of this error?

Accepted Answer

The preprocessing component is using a BigQuery client library that does not use exponential backoff for retries.. Option C is correct because the error 'ResourceExhausted: Quota limits exceeded for read api requests' indicates that the BigQuery API is throttling requests from the client, even though the project-level quota is not exceeded. The preprocessing component likely uses a BigQuery client library that lacks exponential backoff retry logic, causing rapid, repeated requests that exhaust the per-client or per-connection quota. Implementing exponential backoff would allow the client to back off and retry, preventing quota exhaustion.

Answer

The BigQuery table is partitioned on a date column, and the pipeline is querying a specific partition that exceeds the quota.

Answer

The Cloud Scheduler job is triggering multiple pipeline runs that overlap, causing concurrent quota usage.

Answer

The pipeline is using a shared VPC that has traffic shaping limits.

Question 6

A financial services company uses a custom deep learning model on Vertex AI to automatically approve or reject credit card transactions. The model is explainable using Vertex Explainable AI, and the company monitors feature attribution drift with thresholds defined per feature. Last week, the monitoring system flagged that the mean absolute attribution score for the 'transaction_amount' feature increased from 0.35 to 0.55. The overall model accuracy, measured on a daily batch of labeled transactions, has remained around 97%. The operations team is concerned about potential compliance issues due to changing model behavior. What should the data scientist do?

Accepted Answer

Investigate whether there has been a shift in the distribution of 'transaction_amount' values in the recent transaction data, which could explain the attribution change.. Option C is correct because a shift in the distribution of the 'transaction_amount' feature (e.g., due to seasonality or a new customer segment) can naturally cause its attribution score to change without indicating model degradation. Vertex Explainable AI computes feature attributions relative to the current data distribution; if the input values shift, the model's reliance on that feature may legitimately increase. Investigating the distribution shift is the first diagnostic step before adjusting thresholds or retraining, as stable accuracy does not rule out data drift that could lead to compliance issues.

Answer

Tune the alert threshold for 'transaction_amount' to 0.6 to avoid future false alarms.

Answer

Retrain the model by increasing regularization to reduce the importance of the 'transaction_amount' feature.

Answer

Disable the feature attribution drift monitoring for 'transaction_amount' since the model accuracy is stable.

Question 7

You are an ML engineer at a logistics company. You have deployed a deep learning model on Vertex AI Endpoints using a custom container with GPU acceleration. The model predicts delivery times based on route features. After one week, you notice that the endpoint's GPU utilization is consistently at 10%, but the prediction latency has increased by 50%. The number of prediction requests per second has remained stable. You check the container logs and see no errors. The model is served using TensorFlow Serving with batching enabled (batch size: 32, batch timeout: 100ms). The custom container uses a single NVIDIA T4 GPU. You have also set the Vertex AI endpoint to use autoscaling with minReplicaCount: 1 and maxReplicaCount: 5, and the CPU utilization target is 60%. Which action should you take to reduce latency?

Accepted Answer

Increase the batch size to 64 and batch timeout to 200ms to improve GPU utilization.. The core issue is low GPU utilization (10%) despite increased latency, indicating that the GPU is underutilized and the bottleneck is likely in batching or data pipeline overhead. Increasing the batch size to 64 and batch timeout to 200ms allows TensorFlow Serving to accumulate more requests per batch, improving GPU throughput and reducing per-request latency by better leveraging GPU parallelism. This directly addresses the mismatch between low GPU utilization and high latency.

Answer

Increase the minReplicaCount to 3 to handle requests in parallel.

Answer

Reduce the CPU utilization target to 40% to trigger more aggressive autoscaling.

Answer

Quantize the model to FP16 to reduce compute time per inference.

Question 8

A financial services company uses Vertex AI Pipelines to train and deploy models for fraud detection. The ML team consists of data scientists who develop models and ML engineers who deploy them. They use a CI/CD pipeline with Cloud Build to build and push Docker images to Artifact Registry, then trigger Vertex AI Pipelines. Recently, the team noticed that a model deployed to production was trained on a dataset that had not been approved by the data governance team. Upon investigation, they found that a data scientist accidentally used an unapproved version of the training data by specifying a Cloud Storage path that was not the latest approved dataset. The company needs to enforce that only approved datasets are used in training jobs. Which approach should they take?

Accepted Answer

Use a curated dataset registry in BigQuery or Cloud Storage with IAM conditions that allow access only to datasets tagged as 'approved'. Modify the CI/CD pipeline to pass only approved dataset references to the training job.. Option C is correct because it enforces governance at the source by using IAM conditions to restrict access to only approved datasets, preventing unauthorized data from being used in training. This approach integrates with the CI/CD pipeline to automatically pass only approved dataset references, eliminating the risk of human error in specifying Cloud Storage paths.

Answer

Implement a manual approval process where data scientists request dataset paths from the data governance team before each training run.

Answer

After training, run a validation step that checks if the dataset used matches the latest approved version, and roll back if not.

Answer

Restrict all Cloud Storage buckets to be read-only for the data scientists, and have ML engineers copy approved datasets to a separate bucket.

Question 9

A healthcare startup is building a diagnostic tool that uses a deep learning model to classify medical images. The model is trained on TensorFlow and deployed on Vertex AI Prediction. The startup has strict latency requirements: predictions must return within 200 ms for 95% of requests. Current performance shows p95 latency of 350 ms. The team has already tried using a smaller model, but accuracy dropped below acceptable levels. The traffic pattern is spiky: low load during nights but bursts of 1000 requests per second during business hours. Currently, they use a single n1-highmem-8 VM with a GPU attached. They have a budget for additional resources but need to optimize cost. The model is about 500 MB and requires GPU for inference. Which course of action should they take to meet the latency requirement while managing costs?

Accepted Answer

Create a Vertex AI Prediction endpoint with an accelerator (GPU) and enable autoscaling (min 1, max 5 nodes). Option C is correct because it leverages Vertex AI Prediction's autoscaling to handle spiky traffic efficiently, using GPU-accelerated endpoints that can scale from 1 to 5 nodes to meet the 200 ms p95 latency requirement. This approach minimizes cost during low-load periods while providing burst capacity for the 1000 requests per second peak, addressing both the latency and budget constraints without compromising model accuracy.

Answer

Upgrade to an n1-highmem-16 VM with a more powerful GPU

Answer

Switch to batch prediction using Vertex AI Batch Prediction and store results in a database for retrieval

Answer

Deploy the model as a Cloud Function using TensorFlow Serving

Question 10

You are an ML engineer at a logistics company. The company uses a Vertex AI Pipeline with BigQuery ML to train a model that predicts delivery delays based on weather, traffic, and historical order data. The pipeline runs daily and includes steps: (1) data extraction from BigQuery, (2) feature engineering using Dataflow, (3) model training with BigQuery ML (logistic regression), (4) model evaluation, and (5) conditional deployment to a Vertex AI Endpoint if accuracy > 0.85. Recently, the pipeline has been failing at step 5 with the error: "Vertex AI Endpoint creation failed: Quota limit of 1 endpoint per region exceeded." The company has already created one endpoint in the same region for another model. The pipeline is configured to create a new endpoint each time a model is deployed. The engineer needs to fix this with minimal changes to the pipeline code. Which course of action should the engineer take?

Accepted Answer

Modify the deployment step to check if an endpoint already exists and, if so, deploy a new model version to the existing endpoint instead of creating a new one.. Option D is correct because it directly addresses the root cause: the pipeline fails because it tries to create a new endpoint each time, exceeding the regional quota of one endpoint. By modifying the deployment step to check for an existing endpoint and deploying a new model version to it, the engineer avoids quota issues without altering the pipeline's core logic or requiring external approvals. This approach leverages Vertex AI's model versioning capability, which allows multiple model versions under a single endpoint, aligning with minimal code changes.

Answer

Submit a quota increase request to Google Cloud for Vertex AI Endpoints in the current region.

Answer

Change the region in the pipeline configuration to a region with available endpoint quota.

Answer

Remove the accuracy threshold and deploy every model automatically to a pre-created endpoint.

Question 11

A financial services company uses Vertex AI to deploy multiple models for fraud detection. The ML team has set up a CI/CD pipeline using Cloud Build and Cloud Deploy. The pipeline builds a custom container with the trained model, pushes it to Artifact Registry, and deploys it to a Vertex AI Endpoint. Recently, a new regulation requires that all model deployments be audited and approved by the compliance team before going live. The compliance team wants to review the model's evaluation metrics and approve the deployment via a ticketing system. Currently, the CI/CD pipeline automatically deploys after the container is built. The team needs to implement a gating process without slowing down the development cycle. What should they do?

Accepted Answer

Modify the CI/CD pipeline to use Cloud Deploy's approval gate feature, requiring a manual approval from the compliance team before the deployment step.. Option C is correct because Cloud Deploy provides a native approval gate feature that can be inserted into a delivery pipeline to require manual sign-off before a deployment proceeds. This allows the compliance team to review model evaluation metrics and approve via a ticketing system without modifying the CI/CD pipeline's build process, thus maintaining development velocity. The approval gate pauses the deployment at a specific stage, waiting for an external approval signal, which integrates seamlessly with Cloud Deploy's rollout management.

Answer

Use Cloud Composer to orchestrate the deployment and add a sensor that waits for approval from the ticketing system via a custom operator.

Answer

Use Cloud Build's built-in approval gate feature to require compliance team sign-off before deployment.

Answer

Store the model artifacts in Cloud Storage and have the compliance team deploy manually using the gcloud command.

Question 12

You are the ML engineer for a financial services company. You have deployed a fraud detection model on Vertex AI Endpoints using a custom container. The model is a gradient boosting model trained on transactional data. Over the past week, the model's precision has dropped from 95% to 80%, while recall has remained stable. The input data volume and distribution have not changed significantly. The model is served on a single endpoint with autoscaling enabled (min replicas=2, max replicas=10). You notice that the average CPU utilization of the serving containers has increased from 40% to 90%, and the p99 latency has increased from 50ms to 200ms. The model is retrained weekly using the latest data, and the last retraining was 3 days ago. The logs show no errors, and the model version is unchanged. Given these symptoms, what is the most likely cause of the precision drop?

Accepted Answer

A recent change in the preprocessing code in the container transformed features differently than what the model expects, causing incorrect predictions.. Option C is correct because the precision drop without a change in input distribution or recall strongly indicates a systematic error in predictions, not a data shift. A preprocessing code change in the custom container would cause the model to receive features transformed differently than during training, leading to incorrect probability estimates. The increased CPU utilization and latency are consistent with the container performing additional or different preprocessing steps, not with autoscaling issues or model version changes.

Answer

The autoscaling policy is not scaling up fast enough, causing increased latency and prediction errors.

Answer

The model is overfitting to recent transaction patterns due to weekly retraining.

Answer

The model was replaced with a different version without updating the endpoint.

Question 13

Your team manages a production ML pipeline on Google Cloud that trains a fraud detection model every 6 hours using new transaction data. The pipeline steps are: (1) Cloud Function triggered by new files in Cloud Storage to validate data, (2) Dataflow job for feature engineering, (3) Vertex AI CustomJob for training, (4) Cloud Function to deploy the model to a Vertex AI endpoint after evaluation. You notice that the pipeline sometimes fails during the Dataflow job step with an error: 'Workflow failed. Causes: The job encountered a system error. Please try again later.' The error occurs sporadically, and retrying the pipeline manually usually succeeds. The team needs a reliable automated solution. What should you do?

Accepted Answer

Orchestrate the pipeline using Cloud Composer with retry policies on the Dataflow operator.. Option D is correct because Cloud Composer (Apache Airflow) provides native retry policies on its Dataflow operators, enabling automatic retries of the Dataflow job when it fails due to transient system errors. This addresses the sporadic failure pattern without manual intervention, ensuring the pipeline runs reliably every 6 hours.

Answer

Schedule the pipeline to run less frequently to reduce load on the Dataflow service.

Answer

Use Cloud Tasks to queue the Dataflow job and retry on failure.

Answer

Increase the number of Dataflow workers and use flexRS to handle transient errors.

Question 14

You are an ML engineer at a fintech company. You have a prototype credit risk model built using XGBoost that achieves high accuracy on historical data. The model is trained on a dataset with 500,000 rows and 50 features. The company wants to deploy this model to production to score loan applications in real-time. The production environment must handle a peak load of 100 requests per second with a latency under 200ms. You have decided to use Vertex AI for deployment. After deploying the model as a Vertex AI endpoint with a single n1-standard-4 machine, you notice that latency exceeds 500ms at peak load and some requests time out. You have verified that the model prediction itself (excluding network overhead) takes about 50ms on average. What should you do to meet the latency and throughput requirements?

Accepted Answer

Enable autoscaling with a minimum of 2 replicas and use a larger machine type (e.g., n1-standard-8) to handle more concurrent requests.. Option C is correct because the latency bottleneck is not the model inference time (50ms) but the inability of a single n1-standard-4 machine to handle 100 concurrent requests per second without queuing. By enabling autoscaling with a minimum of 2 replicas and upgrading to n1-standard-8, you increase both the number of concurrent requests the endpoint can process and the CPU/memory resources per replica, reducing queue wait times and keeping total latency under 200ms. This directly addresses the throughput and latency requirements without changing the model or switching to batch processing.

Answer

Change the machine type to a GPU-accelerated machine like n1-standard-4 with a T4 GPU.

Answer

Prune the model to reduce size and improve prediction speed.

Answer

Switch from online prediction to batch prediction using Vertex AI Batch Prediction.

Question 15

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

Accepted Answer

Profile the container's memory usage locally with memory_profiler to find the leak, then fix the code.. Option A is correct because the steady memory growth despite a fixed 5 GB embedding table indicates a memory leak in the custom container code, not a capacity issue. Profiling locally with memory_profiler allows you to trace object allocations and identify the leak source before modifying the serving code, which is the most direct diagnostic step.

Answer

Reduce the number of replicas to 1 to reduce memory contention.

Answer

Increase the machine memory to n1-standard-8 (30 GB).

Answer

Restart the endpoint every hour using a Cloud Scheduler job.