Back to Google Professional Data Engineer questions

Scenario-based practice

Refer to the Exhibit Practice Questions

Practise Google Professional Data Engineer practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

15
scenario questions
PDE
exam code
Google Cloud
vendor

Scenario guide

How to approach refer to the exhibit practice questions

Practise exhibit-style questions that ask you to read a topology, table, command output or diagram before choosing the best answer.

Quick answer

Exhibit-style questions test whether you can read a topology, command output, diagram or table before choosing the best answer.

How to extract the relevant detail from an exhibit.

How topology, command output or routing information affects the answer.

How to avoid answering from memory before reading the evidence.

How to map the exhibit back to the exam objective.

Related practice questions

Related PDE topic practice pages

Scenario questions usually connect to one or more exam topics. Use these links to review the underlying concepts behind the scenario.

Practice set

Practice scenarios

Question 1easymultiple choice
Full question →

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

Refer to the exhibit.

```
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ]
}
```
Question 2hardmultiple choice
Full question →

In the Vertex AI Pipeline component YAML exhibit, the component is designed to evaluate a model and produce metrics. If the threshold_accuracy is set to 0.85, what is the expected behavior of this component?

Exhibit

Refer to the exhibit.

```
# Vertex AI Pipeline component YAML
name: model-evaluation
inputs:
  model_path:
    type: String
  test_data_path:
    type: String
  threshold_accuracy:
    type: Float
    default: 0.85
outputs:
  evaluation_metrics:
    type: Metrics
implementation:
  container:
    image: gcr.io/my-project/eval:latest
    args: [
      --model_path, {inputValue: model_path},
      --test_data_path, {inputValue: test_data_path},
      --threshold_accuracy, {inputValue: threshold_accuracy},
      --output_path, {outputPath: evaluation_metrics}
    ]
```
Question 3hardmultiple choice
Full question →

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

Question 4hardmultiple choice
Full question →

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

Network Topology
cluster=my-clusterregion=us-central1 \class=org.apache.spark.examples.SparkPijars=file:///usr/lib/spark/examples/jars/spark-examples.jar \Refer to the exhibit.```Job [job-id] submitted.
Question 5mediummultiple choice
Full question →

The exhibit shows a Cloud Logging query result. A data engineer sees this log for a streaming Dataflow job. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
resource.type="dataflow_step"
resource.labels.job_id="2023-01-01_000000-12345678"
"worker pool exhausted"
```
Question 6mediummultiple choice
Full question →

Refer to the exhibit. A Dataflow streaming pipeline subscribes to this Pub/Sub subscription. The pipeline occasionally takes more than 10 seconds to process a message. Which behavior will occur?

Exhibit

Refer to the exhibit.

Cloud Pub/Sub subscription configuration:

{
  "name": "projects/my-project/subscriptions/my-sub",
  "topic": "projects/my-project/topics/my-topic",
  "pushConfig": {},
  "ackDeadlineSeconds": 10,
  "messageRetentionDuration": "86400s",
  "expirationPolicy": {
    "ttl": "604800s"
  },
  "enableMessageOrdering": false,
  "retryPolicy": {
    "minimumBackoff": "10s",
    "maximumBackoff": "600s"
  },
  "deadLetterPolicy": {
    "deadLetterTopic": "projects/my-project/topics/dead-letter-topic",
    "maxDeliveryAttempts": 5
  }
}
Question 7hardmultiple choice
Full question →

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

Exhibit

Refer to the exhibit.

Error log from Dataflow job:

"""
Workflow failed. Causes: S3D3: BigQueryIO.Write/BatchLoads/Loads/AllocateLoadTable/ParDo(AllocateLoadTable) failed.
org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write$BigQueryWriteException: BigQuery insertion failed: Response JSON: {
  "error": {
    "errors": [
      {
        "domain": "global",
        "reason": "invalid",
        "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
      }
    ],
    "code": 400,
    "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
  }
}
"""
Question 8easymultiple choice
Full question →

Based on the exhibit, what is the most likely cause of the out-of-memory error?

Exhibit

Refer to the exhibit.

```
# Dataflow pipeline error log:
Workflow failed. Causes: S02:ReadPubSub/Read+Transform/ParDo(ExtractTimestamps)+ ... (4b9c3d2e)
The job failed because a worker experienced a "out of memory" error.
```

Pipeline configuration:
- Streaming engine: disabled
- Worker machine type: n1-standard-4 (4 vCPU, 15 GB memory)
- Number of workers: 2 (autoscaling enabled, max 10)
- Input: Pub/Sub topic with 1000 messages/sec, each message ~50 KB
- Transform: Parse JSON, enrich with external API call, window into 1-minute fixed windows, write to BigQuery
Question 9hardmultiple choice
Full question →

Based on the exhibit, what is the most likely cause of duplicate rows despite using the same event_id as insertId?

Exhibit

Refer to the exhibit.

```
# BigQuery table schema and sample data
Table: mydataset.events
Columns:
  event_id: STRING (REQUIRED)
  event_timestamp: TIMESTAMP (REQUIRED)
  event_data: STRING (NULLABLE)
  user_id: STRING (REQUIRED)
Partitioned by: event_timestamp (daily)
Clustered by: user_id

Job: Dataflow pipeline writing 1000 events/second to this table using streaming inserts with insertId = event_id.

Monitoring shows intermittent 'duplicate rows' in queries that count distinct event_ids.
```
Question 10hardmultiple choice
Full question →

A Dataflow pipeline as described in the exhibit has increasing lag. Which optimization is most likely to reduce the lag?

Exhibit

Refer to the exhibit.

Exhibit:
Pipeline description:
- Source: PubSubIO.read()
- Transform: ParDo(Process)
- Window: Window.into(FixedWindows of 1 minute)
- Transform: GroupByKey
- Sink: Write to BigQuery using StreamingInserts
- Estimated throughput: 10MB/s
- Observed lag: increasing
Question 11mediummultiple choice
Full question →

Refer to the exhibit. What is the most likely cause of the error?

Exhibit

Error: Vertex AI.Exception: 400 Failed to deploy model to endpoint projects/.../endpoints/1234. Details: The resource 'projects/.../models/5678' is missing an artifact URI. Please upload the model artifact to Cloud Storage and create a new model version.
Question 12mediummultiple choice
Full question →

Refer to the exhibit. A BigQuery dataset has the IAM policy shown above. An analyst is trying to run a SELECT query on a table in this dataset but receives an 'Access Denied' error. What is the most likely reason?

Exhibit

Refer to the exhibit.
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "user:analyst@example.com"
      ]
    },
    {
      "role": "roles/bigquery.metadataviewer",
      "members": [
        "user:analyst@example.com"
      ]
    }
  ],
  "etag": "BwXX2Yz7k0Q="
}
Question 13easymultiple choice
Full question →

Refer to the exhibit. An auditor sees the following output from `gcloud ai models list`. What can they conclude about versioning?

Exhibit

MODEL_ID: my_model
VERSION_ID: v1
DISPLAY_NAME: my_model_v1
STATE: READY
VERSION_UPDATE_TIME: 2023-01-10T12:00:00

MODEL_ID: my_model
VERSION_ID: v2
DISPLAY_NAME: my_model_v2
STATE: READY
VERSION_UPDATE_TIME: 2023-01-15T12:00:00
Question 14hardmultiple choice
Full question →

Refer to the exhibit. A team is trying to run a custom prediction container on Vertex AI Endpoint. They get this error when the container starts. What is the most likely cause?

Exhibit

Log: "Container failed with error: exec format error. Ensure the container has an entry point."
Question 15mediummultiple choice
Full question →

Refer to the exhibit. An ML engineer sees this error when invoking a Vertex AI endpoint. What is the most likely cause?

Exhibit

Refer to the exhibit.
{
  "error": {
    "code": 400,
    "message": "Prediction failed: Exception during run: Input tensor shape mismatch. Expected: [1, 128, 128, 3]. Got: [1, 256, 256, 3] in model 'resnet50'.",
    "status": "INVALID_ARGUMENT"
  }
}

These PDE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PDE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.