How many Troubleshooting Scenario Questions questions are on this page?

This page has 8 Troubleshooting Scenario Questions scenario questions for the PDE exam, each with detailed explanations and wrong-answer analysis.

How should I approach PDE scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to Google Professional Data Engineer questions

Scenario-based practice

Troubleshooting Scenario Questions

Practise Google Professional Data Engineer practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

PDE

exam code

Google Cloud

vendor

Scenario guide

How to approach troubleshooting scenario questions

These questions describe a network symptom and ask you to identify the root cause or the correct fix. They appear across all certification exams and reward systematic thinking over memorisation. The best candidates follow a consistent troubleshooting framework even under time pressure.

Quick answer

Troubleshooting Scenario Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1hardmultiple choice

Full question →

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

A
Check Stackdriver logging for error messages.
Identifies root cause.
B
Disable exactly-once processing in Dataflow.
Why wrong: Can cause data loss.
C
Increase the number of Dataflow workers.
Why wrong: May mask underlying issue.
D
Switch to BigQuery streaming inserts.
Why wrong: Design change, not troubleshooting.

Full breakdown with real-world context →

Question 2hardmultiple choice

Full question →

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

Exhibit

Refer to the exhibit.

Error log from Dataflow job:

"""
Workflow failed. Causes: S3D3: BigQueryIO.Write/BatchLoads/Loads/AllocateLoadTable/ParDo(AllocateLoadTable) failed.
org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write$BigQueryWriteException: BigQuery insertion failed: Response JSON: {
  "error": {
    "errors": [
      {
        "domain": "global",
        "reason": "invalid",
        "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
      }
    ],
    "code": 400,
    "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
  }
}
"""

A
The pipeline dropped the last_name field entirely.
Why wrong: Dropping a field would cause a missing field error, not type mismatch.
B
The pipeline code was changed to send an integer for the last_name field.
The error clearly states that an integer was provided for a string field.
C
The BigQuery table quota was exceeded.
Why wrong: Quota errors have different error messages.
D
The BigQuery table schema was changed from STRING to INTEGER for last_name.
Why wrong: If the table changed, the error would indicate that the table expects INTEGER, but it says provided type INTEGER.

Full breakdown with real-world context →

Question 3mediummulti select

Full question →

A company uses Cloud Composer to orchestrate data pipelines. They have a DAG that runs hourly and processes files from Cloud Storage. The DAG is triggered by a Pub/Sub message sent from a Cloud Storage bucket notification. Recently, some DAG runs are not starting even though the Pub/Sub messages are published. Which two likely causes should the team investigate? (Choose TWO.)

A
The Cloud Storage bucket notification is not sending messages to the correct Pub/Sub topic, or the subscription's ack deadline is too short.
C is correct because misconfiguration of the notification or subscription can cause message loss.
B
The DAG's start_date is set in the past and catchup is set to False, so DAG runs are only triggered on schedule.
Why wrong: E is wrong because the DAG is sensor-triggered, not schedule-triggered.
C
The total number of DAGs in the environment exceeds the maximum limit of 100, causing DAG processing to stop.
Why wrong: A is wrong because the default limit is 500 DAGs, and even if exceeded, Composer does not stop processing.
D
The DAG's schedule interval is set too frequently, causing the executor queue to be full and new runs are skipped.
B is correct because if the DAG execution takes longer than the interval, subsequent runs may be queued and skipped.
E
The Cloud Composer environment is using a pull subscription instead of a push subscription for the Pub/Sub sensor.
Why wrong: D is wrong because Cloud Composer's Pub/Sub sensors use pull subscriptions.

Full breakdown with real-world context →

Question 4mediummulti select

Full question →

A Dataflow streaming job is processing data from Pub/Sub and writing to BigQuery. The job is stuck with the message 'No progress has been made' for several minutes. Which TWO actions should the team take to troubleshoot and resolve the issue? (Choose TWO.)

A
Set the updateCompatibility flag to true and restart the pipeline.
Why wrong: C is wrong because updateCompatibility is for checking compatibility of new pipeline version, not for stuck jobs.
B
Increase the persistent disk size for all workers to reduce I/O contention.
Why wrong: A is wrong because disk size is rarely the cause of no progress in Dataflow.
C
Examine the worker logs in Cloud Logging for any error messages or exceptions.
B is correct because logs can reveal the root cause, such as out-of-memory errors or stuck transforms.
D
Force stop the pipeline and update it with a new version using the --update flag.
Why wrong: E is wrong because force stopping may cause data loss; it's better to diagnose without stopping.
E
Enable Dataflow Streaming Engine to move state to the backend and reduce worker load.
D is correct because Streaming Engine can alleviate progress issues caused by state contention.

Full breakdown with real-world context →

Question 5easymultiple choice

Full question →

A team deployed a model to Vertex AI Endpoint and notices latency spikes during peak hours. What should they first investigate?

A
Switch to batch prediction
Why wrong: Batch prediction is not for real-time use cases.
B
Reduce number of features
Why wrong: Reducing features may impact model accuracy and does not address scaling issues.
C
Increase machine type
Why wrong: Increasing machine type addresses capacity but not scaling logic.
D
Check if autoscaling is enabled and configured correctly
Autoscaling misconfiguration is a common cause of latency spikes during traffic surges.

Full breakdown with real-world context →

Question 6mediummultiple choice

Full question →

A data science team wants to deploy a model that requires a custom container with specific NVIDIA CUDA version. They build the image and push to Artifact Registry. When deploying to Vertex AI, the model fails to load with an error: 'Failed to start container: invalid ELF header'. What is the most likely cause?

A
The container image was built for a different CPU architecture (e.g., ARM64) than the Vertex AI machine (x86_64)
Invalid ELF header indicates the binary is incompatible with the platform architecture.
B
The model file (saved as .pkl) is corrupted
Why wrong: Corrupted model files would cause loading errors within the container, not the container failing to start.
C
The CUDA version in the container is incompatible with the GPU on the machine
Why wrong: CUDA incompatibility would produce a different error, not invalid ELF header.
D
The container does not have the necessary permissions to access the model file in Cloud Storage
Why wrong: Permission issues would cause access denied errors, not container startup failure.

Full breakdown with real-world context →

Question 7easymultiple choice

Full question →

Refer to the exhibit. A subscriber is unable to pull messages from the topic. What is the most likely cause?

Exhibit

gcloud pubsub topics get-iam-policy my-topic
Bindings:
- members:
  - serviceAccount:sa@project.iam.gserviceaccount.com
  role: roles/pubsub.subscriber

A
The service account has the subscriber role but the topic is not configured correctly.
Why wrong: The subscriber role on the topic is necessary but not sufficient without a subscription.
B
The service account needs roles/pubsub.viewer to list subscriptions.
Why wrong: The viewer role is not required for pulling; the subscriber role is sufficient.
C
No subscription has been created for the topic.
A subscription is required to pull messages; the topic only provides the ability to publish.
D
The service account lacks roles/pubsub.publisher.
Why wrong: The subscriber does not need the publisher role.

Full breakdown with real-world context →

Question 8hardmultiple choice

Full question →

Your company runs a real-time recommendation system for a popular e-commerce website using a machine learning model deployed on Vertex AI Endpoints. The model takes user features and product catalog data as input and returns top-10 product recommendations. The system uses a feature store to serve user embeddings and product embeddings. Recently, the recommender team retrained the model with a new algorithm and deployed it as a new version. Since the deployment, the latency for recommendation requests has increased from 100ms to 500ms on average, exceeding the 200ms SLO. The model accuracy is acceptable, and there are no errors. The endpoint uses an n1-standard-8 machine with a single GPU. The new model is larger but still fits on the GPU. You investigate and find that the GPU utilization remains low (<20%), but CPU utilization is high (90%). What should you do to reduce latency while maintaining accuracy?

A
Upgrade the machine type to one with more GPU memory (e.g., n1-standard-8 with a larger GPU) to reduce model inference time.
Why wrong: GPU memory is not the issue; GPU utilization is low, so inference is fast. More GPU memory won't help CPU bottleneck.
B
Change the batch size in the model serving code to process multiple requests together, improving GPU utilization.
Why wrong: Batching increases latency for individual requests as they wait for a batch to fill. It would not reduce CPU bottleneck and could hurt latency SLO.
C
Increase the number of replicas (nodes) to parallelize the CPU-bound preprocessing work.
Adding more nodes will distribute the preprocessing load across multiple CPUs, reducing the overall latency per request if the load balancer dispatches requests efficiently. However, this increases cost.
D
Offload preprocessing to a dedicated Cloud Run service that runs asynchronously and returns precomputed feature vectors.
Why wrong: Asynchronous preprocessing would not help if the model needs features inline; synchronous processing is required. Offloading could add network latency.

Full breakdown with real-world context →

These PDE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PDE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.

Troubleshooting Scenario Questions

How to approach troubleshooting scenario questions

Quick answer

Related PDE topic practice pages

Designing data processing systems practice questions

Building and operationalizing data processing systems practice questions

Operationalizing machine learning models practice questions

Ensuring solution quality practice questions

PDE fundamentals practice questions

PDE scenario practice questions

PDE troubleshooting practice questions

Practice scenarios

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

Exhibit

A Dataflow streaming job is processing data from Pub/Sub and writing to BigQuery. The job is stuck with the message 'No progress has been made' for several minutes. Which TWO actions should the team take to troubleshoot and resolve the issue? (Choose TWO.)

A team deployed a model to Vertex AI Endpoint and notices latency spikes during peak hours. What should they first investigate?

Refer to the exhibit. A subscriber is unable to pull messages from the topic. What is the most likely cause?

Exhibit