How many Troubleshooting Scenario Questions questions are on this page?

This page has 12 Troubleshooting Scenario Questions scenario questions for the PMLE exam, each with detailed explanations and wrong-answer analysis.

How should I approach PMLE scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to Google Professional Machine Learning Engineer questions

Scenario-based practice

Troubleshooting Scenario Questions

Practise Google Professional Machine Learning Engineer practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

PMLE

exam code

Google Cloud

vendor

Scenario guide

How to approach troubleshooting scenario questions

These questions describe a network symptom and ask you to identify the root cause or the correct fix. They appear across all certification exams and reward systematic thinking over memorisation. The best candidates follow a consistent troubleshooting framework even under time pressure.

Quick answer

Troubleshooting Scenario Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1easymultiple choice

Full question →

You have an online prediction model that is showing increasing prediction latency. You have already verified that the request rate and input data size are unchanged. Which of the following should you investigate next?

A
Check if the model was recently updated to a larger version
Larger model increases inference latency.
B
Check the monitoring dashboard configuration
Why wrong: Dashboard does not affect latency.
C
Check if the feature engineering logic was changed
Why wrong: Feature engineering change would be part of model update.
D
Check the geographic location of the endpoint
Why wrong: Location affects network latency, not compute latency.

Full breakdown with real-world context →

Question 2hardmultiple choice

Full question →

A data science team has trained a TensorFlow model on-premises using a large dataset. When they try to deploy the model to Vertex AI for online predictions, the deployed model fails to start with a ‘MemoryError’. The model artifact is 2 GB, and the machine type is n1-standard-4 (15 GB RAM). What is the most likely cause?

A
The model is stored in a regional bucket and the Vertex AI endpoint is in a different region.
Why wrong: Cross-region access is allowed, though not optimal; it would not cause MemoryError.
B
The machine type does not support TensorFlow models larger than 1 GB.
Why wrong: No such limitation exists; TensorFlow can load larger models given sufficient memory.
C
The model is too large for the machine's memory, causing an out-of-memory (OOM) error during loading.
The 2 GB model may require more than 15 GB RAM during loading due to overhead and intermediate structures.
D
The model file is corrupted or missing dependencies, causing a crash.
Why wrong: Corruption typically causes import errors, not MemoryError.

Full breakdown with real-world context →

Question 3hardmulti select

Full question →

A company uses Vertex AI Model Monitoring to detect training-serving skew. They have a categorical feature 'product_category' with high cardinality. The monitoring job alerts for skew, but the data scientists believe the model performance is still acceptable. Which THREE actions should the team take to investigate and resolve the alert?

A
Examine which categories have the largest distribution changes to understand the nature of the shift.
Identifying specific categories helps assess whether the drift is due to seasonal effects or other benign causes.
B
Adjust the alerting threshold based on historical drift patterns to reduce noise.
Tuning thresholds helps filter out inconsequential drift.
C
Compare model performance metrics (e.g., AUC) on the drifted segment vs. the non-drifted segment.
Segment-level performance analysis determines if drift is actually harmful.
D
Remove the drifted categories from the feature set to eliminate the alert.
Why wrong: Removing categories reduces model information and could degrade performance.
E
Ignore the alert because the model is performing well; monitoring alerts are often false positives.
Why wrong: Ignoring alerts is not recommended; the team should investigate to confirm it's a false positive.

Full breakdown with real-world context →

Question 4hardmultiple choice

Full question →

A Vertex AI pipeline is triggered from Cloud Build using the configuration above. The pipeline fails with an error: 'Unable to submit build: The source code is not available.' What is the most likely cause?

Network Topology

A
The Docker build step failed silently due to a missing dependency.
Why wrong: Docker build completed, as push step ran.
B
The 'gcloud builds submit' command does not have access to the source code in the Cloud Build environment.
The source code must be provided or referenced explicitly; using 'gcloud builds submit' in a step requires the source to be available via a trigger or artifact.
C
The Docker image tag does not include a hash, causing the push to fail.
Why wrong: The push succeeded; the error is later.
D
The Cloud Build service account lacks permission to access the Vertex AI Pipeline API.
Why wrong: Error message specifically says source code not available, not permission denied.

Full breakdown with real-world context →

Question 5easymultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. The team notices that the pipeline fails to read data from the specified Cloud Storage path. What is the most likely issue?

Exhibit

pipeline:
  execution_config:
    runner: DataflowRunner
    project: my-project
    region: us-central1
  components:
    - component_type: CsvExampleGen
      component_name: example_gen
      arguments:
        input_basedir: gs://my-bucket/data/

A
The bucket does not exist
Why wrong: If the bucket didn't exist, the error would be clearer; permission is more likely.
B
The pipeline runner is incorrect
Why wrong: DataflowRunner is valid for this pipeline.
C
The region is mismatched
Why wrong: Region mismatch affects Dataflow execution, not reading from GCS.
D
The service account lacks `storage.objectViewer` permission
The Dataflow service account needs read access to Cloud Storage.

Full breakdown with real-world context →

Question 6mediummultiple choice

Full question →

A data scientist trains an XGBoost model on Vertex AI with a custom container. The model performs well on a held-out test set but fails to generalize in production. They suspect data leakage between training and validation. What is the best practice to prevent this?

A
Store and serve features using Vertex AI Feature Store with point-in-time correctness
Feature Store provides consistent feature values for each timestamp, preventing leakage.
B
Implement feature engineering in Vertex AI Pipelines to ensure temporal ordering
Why wrong: Pipelines orchestrate but don't enforce feature consistency by themselves.
C
Store all features in BigQuery and join on timestamp during training and serving
Why wrong: Manual joins are error-prone and may still introduce leakage.
D
Use Vertex AI AutoML instead of custom training
Why wrong: AutoML may not solve custom feature engineering leakage.

Full breakdown with real-world context →

Question 7hardmultiple choice

Full question →

A company deploys a model to Vertex AI Prediction with autoscaling enabled. During a flash sale, traffic spikes 10x, but the endpoint fails to scale fast enough, causing high latency. What is the most likely cause and solution?

A
The min_nodes setting is too low; increase min_nodes to handle baseline traffic
Higher min nodes allow faster scaling as they are already running.
B
Switch to preemptible VMs to reduce cost and allow more instances
Why wrong: Preemptible VMs are not supported for Vertex AI Prediction.
C
The model container is too large; rebuild with a smaller image
Why wrong: Image size affects start time but not autoscaling responsiveness.
D
Use Cloud Functions to pre-warm instances before the sale
Why wrong: Cloud Functions cannot pre-warm Vertex AI endpoints.

Full breakdown with real-world context →

Question 8hardmulti select

Full question →

A team is troubleshooting a Vertex AI Pipelines run that keeps failing at the model evaluation step. The pipeline includes steps: data preprocessing, training, evaluation, and deployment. Which THREE actions should they take to diagnose the issue?

A
Verify that the training step output is correctly linked as input to evaluation.
Mismatched outputs are a common pipeline failure cause.
B
Run the evaluation code locally with the same input data.
Reproducing locally helps isolate environment-specific issues.
C
Increase the memory of the evaluation step's machine.
Why wrong: Premature fix without knowing if memory is the issue.
D
Check the logs of the evaluation step in Cloud Logging.
Logs provide error details essential for diagnosis.
E
Replace the evaluation step with a Vertex AI Model Evaluation service.
Why wrong: Changes the process instead of diagnosing the current failure.

Full breakdown with real-world context →

Question 9hardmultiple choice

Full question →

You are troubleshooting a Vertex AI endpoint for a customer. The exhibit shows the endpoint configuration. The customer reports that Model A is experiencing high latency during peaks. Model B runs fine. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "name": "projects/my-project/locations/us-central1/endpoints/1234",
  "displayName": "my-endpoint",
  "dedicatedEndpointEnabled": false,
  "deployedModels": [
    {
      "id": "model-a-1",
      "displayName": "model-a",
      "model": "projects/my-project/locations/us-central1/models/456",
      "dedicatedResources": {
        "minReplicaCount": 1,
        "maxReplicaCount": 5,
        "machineSpec": {
          "machineType": "n1-standard-4",
          "acceleratorType": "NVIDIA_TESLA_T4",
          "acceleratorCount": 1
        }
      }
    },
    {
      "id": "model-b-1",
      "displayName": "model-b",
      "model": "projects/my-project/locations/us-central1/models/789",
      "dedicatedResources": {
        "minReplicaCount": 1,
        "maxReplicaCount": 5,
        "machineSpec": {
          "machineType": "n1-standard-8",
          "acceleratorType": "NVIDIA_TESLA_T4",
          "acceleratorCount": 2
        }
      }
    }
  ],
  "trafficSplit": {
    "model-a-1": 50,
    "model-b-1": 50
  }
}

A
Model A is not autoscaling properly due to minReplicaCount=1.
Why wrong: Autoscaling is based on max; min=1 doesn't prevent scaling up.
B
Model A's machine type has insufficient CPU and GPU for the load.
Model A uses n1-standard-4 with 1 GPU, while Model B uses n1-standard-8 with 2 GPUs.
C
Dedicated endpoint is disabled, causing resource sharing between models.
Why wrong: Even with shared endpoint, resources are per deployed model.
D
The traffic split is unevenly balanced, causing Model A to receive more requests.
Why wrong: Split is 50-50, so even.

Full breakdown with real-world context →

Question 10hardmultiple choice

Full question →

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

A
Change the machine type to n1-highmem-16 (104 GB memory).
Increasing memory directly resolves out-of-memory errors.
B
Use batch prediction instead of online prediction.
Why wrong: Batch prediction does not solve the underlying memory issue for the model.
C
Add a GPU accelerator (e.g., NVIDIA T4) to offload computation.
Why wrong: GPU helps compute but does not increase available RAM.
D
Quantize the model from FP32 to INT8.
Why wrong: Quantization reduces model size but intermediate activations still consume memory; OOM may persist.

Full breakdown with real-world context →

Question 11mediummultiple choice

Full question →

A model deployed on Vertex AI Prediction is returning high latency for real-time requests. The model is a small TensorFlow model. Which troubleshooting step should the team take first?

A
Retrain the model with a larger batch size
Why wrong: Batch size affects throughput, not latency per request.
B
Check if the machine type is too small and enable autoscaling
Low latency often requires adequate resources.
C
Use a custom container with optimized runtime
Why wrong: Optimization may help but is more effort than checking scaling.
D
Enable Cloud Armor to reduce traffic
Why wrong: Cloud Armor is for DDoS, not latency.

Full breakdown with real-world context →

Question 12easymultiple choice

Full question →

A machine learning model deployed on Vertex AI is returning erroneous predictions. The team needs to investigate the root cause by examining the prediction request and response details. Which Google Cloud tool is best suited for this?

A
Cloud Monitoring
Why wrong: Cloud Monitoring provides metrics and dashboards but does not store detailed request/response payloads.
B
Cloud Debugger
Why wrong: Cloud Debugger captures code state in production but is not designed for analyzing prediction request/response data.
C
Cloud Logging
Cloud Logging can capture structured logs from Vertex AI predictions, including request and response data for analysis.
D
Cloud Trace
Why wrong: Cloud Trace is for latency tracing across services, not for inspecting individual prediction inputs/outputs.

Full breakdown with real-world context →

These PMLE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PMLE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.

Troubleshooting Scenario Questions

How to approach troubleshooting scenario questions

Quick answer

Related PMLE topic practice pages

Scaling prototypes into ML models practice questions

Automating and orchestrating ML pipelines practice questions

Collaborating within and across teams to manage data and models practice questions

Architecting low-code ML solutions practice questions

Collaborating to manage data and models practice questions

Serving and scaling models practice questions

Monitoring ML solutions practice questions

Solving business challenges with ML practice questions

PMLE fundamentals practice questions

PMLE scenario practice questions

PMLE troubleshooting practice questions

Practice scenarios

You have an online prediction model that is showing increasing prediction latency. You have already verified that the request rate and input data size are unchanged. Which of the following should you investigate next?

A Vertex AI pipeline is triggered from Cloud Build using the configuration above. The pipeline fails with an error: 'Unable to submit build: The source code is not available.' What is the most likely cause?

Refer to the exhibit. The team notices that the pipeline fails to read data from the specified Cloud Storage path. What is the most likely issue?

Exhibit

A data scientist trains an XGBoost model on Vertex AI with a custom container. The model performs well on a held-out test set but fails to generalize in production. They suspect data leakage between training and validation. What is the best practice to prevent this?

A company deploys a model to Vertex AI Prediction with autoscaling enabled. During a flash sale, traffic spikes 10x, but the endpoint fails to scale fast enough, causing high latency. What is the most likely cause and solution?

A team is troubleshooting a Vertex AI Pipelines run that keeps failing at the model evaluation step. The pipeline includes steps: data preprocessing, training, evaluation, and deployment. Which THREE actions should they take to diagnose the issue?

You are troubleshooting a Vertex AI endpoint for a customer. The exhibit shows the endpoint configuration. The customer reports that Model A is experiencing high latency during peaks. Model B runs fine. What is the most likely cause?

Exhibit

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

A model deployed on Vertex AI Prediction is returning high latency for real-time requests. The model is a small TensorFlow model. Which troubleshooting step should the team take first?

A machine learning model deployed on Vertex AI is returning erroneous predictions. The team needs to investigate the root cause by examining the prediction request and response details. Which Google Cloud tool is best suited for this?