PDE Building and operationalizing data processing systems — All Questions With Answers

Question 1mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is migrating its on-premises Apache Spark jobs to Dataproc. The jobs read from and write to Cloud Storage. After migration, the jobs are slower than expected. The Dataproc cluster uses standard worker machines with local SSDs. What is the most likely cause of the performance degradation?

Question 2hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

Question 3easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

Question 4hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

Question 5mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is using Dataflow to stream data from Cloud Pub/Sub to BigQuery. The pipeline includes a custom ParDo transformation that enriches the data with external API calls. The pipeline is experiencing high latency and occasional failures due to API timeouts. What strategy should be employed to improve reliability and performance?

Question 6easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

Question 7hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

Question 8mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company uses Cloud Composer to orchestrate data pipelines. One DAG fails intermittently with the error: 'Task received SIGTERM signal.' The task runs a long-running Dataproc job. What is the most likely cause?

Question 9mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO factors should be considered when choosing between Cloud Dataflow and Dataproc for a batch processing pipeline?

Question 10hardmulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Question 11easymulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO actions can reduce the cost of running a Dataproc cluster for a nightly batch job?

Question 12hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

Exhibit

Refer to the exhibit.

Error log from Dataflow job:

"""
Workflow failed. Causes: S3D3: BigQueryIO.Write/BatchLoads/Loads/AllocateLoadTable/ParDo(AllocateLoadTable) failed.
org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write$BigQueryWriteException: BigQuery insertion failed: Response JSON: {
  "error": {
    "errors": [
      {
        "domain": "global",
        "reason": "invalid",
        "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
      }
    ],
    "code": 400,
    "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
  }
}
"""

Question 13mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

Refer to the exhibit. A Dataflow streaming pipeline subscribes to this Pub/Sub subscription. The pipeline occasionally takes more than 10 seconds to process a message. Which behavior will occur?

Exhibit

Refer to the exhibit.

Cloud Pub/Sub subscription configuration:

{
  "name": "projects/my-project/subscriptions/my-sub",
  "topic": "projects/my-project/topics/my-topic",
  "pushConfig": {},
  "ackDeadlineSeconds": 10,
  "messageRetentionDuration": "86400s",
  "expirationPolicy": {
    "ttl": "604800s"
  },
  "enableMessageOrdering": false,
  "retryPolicy": {
    "minimumBackoff": "10s",
    "maximumBackoff": "600s"
  },
  "deadLetterPolicy": {
    "deadLetterTopic": "projects/my-project/topics/dead-letter-topic",
    "maxDeliveryAttempts": 5
  }
}

Question 14hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company runs a critical real-time data pipeline using Dataflow that ingests events from Cloud Pub/Sub, performs aggregations using sliding windows, and writes results to BigQuery. The pipeline is deployed in us-central1. The pipeline's latency has increased recently, and the Dataflow monitoring shows that the 'system lag' metric is consistently above 5 minutes. The pipeline is using Streaming Engine and has 10 workers with 4 vCPUs each. The pipeline processes approximately 100,000 events per second. The team has verified that the source Pub/Sub topic has sufficient publish throughput and the BigQuery table has no quota issues. The pipeline logs show that some workers are experiencing GC overhead limit exceeded errors. The pipeline code uses stateful processing with a custom keyed state for deduplication. What is the most likely cause of the increased latency?

Question 15mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company uses Cloud Composer to orchestrate a daily ETL pipeline that includes multiple Dataproc jobs. The pipeline processes sensitive financial data. The security team requires that all data in transit be encrypted, and all Cloud Storage buckets used by the pipeline should have uniform bucket-level access enabled and VPC Service Controls. The pipeline currently uses a single Cloud Composer environment in us-east1. The Dataproc clusters are created using the standard image and use custom service accounts with minimal permissions. The pipeline runs successfully during testing, but in production, the Dataproc jobs fail with 'Access Denied' errors when trying to write to a Cloud Storage bucket. The bucket has uniform bucket-level access enabled and is inside a VPC Service Controls perimeter. The Dataproc service account has the Storage Object Admin role at the project level. What is the most likely cause of the access denied error?

Question 16mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

Question 17hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

Question 18easymulti select

Read the full Building and operationalizing data processing systems explanation →

A data team uses Cloud Composer to orchestrate Airflow DAGs. They need to ensure that a downstream task runs only if at least two out of three upstream sensor tasks succeed. Which TWO configurations should they combine?

Question 19mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A Dataflow pipeline reads log files from Cloud Storage, parses them into LogEvent objects, and writes to BigQuery. The pipeline fails with the above errors. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
# Dataflow pipeline log snippet
2024-03-15 10:00:00 ERROR Transform 'ParseLogs': org.apache.beam.sdk.util.WindowedValue$CoderLoadingException: Unable to load coder for class com.example.LogEvent
2024-03-15 10:00:01 ERROR Transform 'ParseLogs': java.lang.NoSuchMethodError: com.example.LogEvent: method <init>()V not found
```

Question 20hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A financial services company runs a batch Dataflow pipeline daily to process transaction data. The pipeline reads from Cloud Storage, performs complex transformations, and writes to BigQuery. Recently, the pipeline has been failing intermittently with the error: 'Workflow failed. Causes: (9c3f7a2b1d4e): The worker missed 2000 data samples in the last 30 seconds. This can be caused by a variety of factors, including slow work items, network issues, or resource contention.' The team has already increased the number of workers and tried using e2-standard-8 machine types, but the issue persists. The pipeline processes approximately 500 GB of data per run and uses approximately 200 workers. The team suspects that the issue might be related to shuffle operations. What should the team do next to resolve the issue?

Question 21mediumdrag order

Read the full Building and operationalizing data processing systems explanation →

Drag and drop the steps to set up a BigQuery dataset with a scheduled query into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 22mediumdrag order

Read the full Building and operationalizing data processing systems explanation →

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 23mediumdrag order

Read the full Building and operationalizing data processing systems explanation →

Drag and drop the steps to create a Cloud Function triggered by Cloud Storage events into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 24mediummatching

Read the full Building and operationalizing data processing systems explanation →

Match each data pipeline term to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Extract, Transform, Load

Extract, Load, Transform

Raw data storage in native format

Optimized storage for structured analytics

Question 25mediummatching

Read the full Building and operationalizing data processing systems explanation →

Match each data storage term to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Atomicity, Consistency, Isolation, Durability

Basically Available, Soft state, Eventual consistency

Consistency, Availability, Partition tolerance trade-off

Horizontal partitioning of data across databases

Question 26mediummatching

Read the full Building and operationalizing data processing systems explanation →

Match each data lifecycle stage to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Collecting data from various sources

Persisting data in a durable system

Transforming and analyzing data

Making data available for consumption

Moving data to long-term, low-cost storage

Question 27easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company uses Cloud Dataflow to process streaming data from Pub/Sub into BigQuery. The pipeline uses a side input from a Cloud Bigtable table containing user profile information to enrich the events. The side input is updated every hour. Which approach should the company use to ensure that the pipeline uses the latest profile data without causing high memory usage?

Question 28mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineering team uses Cloud Pub/Sub to ingest clickstream events and Cloud Dataflow to process them. They need to maintain strict event ordering per user session, and the processing output must be written to a BigQuery table with exactly-once semantics. Which configuration should the team implement?

Question 29hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company runs a Dataproc cluster for nightly batch jobs. The cluster uses preemptible workers for cost savings. Recently, the jobs have been failing intermittently with 'Disk quota exceeded' errors on the persistent disks attached to the preemptible workers. The cluster is configured with a master node and 10 worker nodes, each with a 100 GB persistent disk. The preemptible workers are dynamically added and removed. What is the most likely cause and the best long-term solution?

Question 30easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company uses BigQuery for real-time analytics. They stream data from IoT devices into a BigQuery table. After a few hours, some of the recent data becomes visible in the table although it was streamed less than 10 minutes ago. The data team confirms that no one ran any manual queries. What is the most likely reason for the data visibility?

Question 31mediummulti select

Read the full Building and operationalizing data processing systems explanation →

A company uses Cloud Composer to orchestrate data pipelines. They have a DAG that runs hourly and processes files from Cloud Storage. The DAG is triggered by a Pub/Sub message sent from a Cloud Storage bucket notification. Recently, some DAG runs are not starting even though the Pub/Sub messages are published. Which two likely causes should the team investigate? (Choose TWO.)

Question 32hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company wants to replicate a Cloud SQL (PostgreSQL) database to BigQuery in near real-time for analytics. The volume is about 10GB per day with frequent updates and deletes. They need to capture changes with low latency and ensure exactly-once delivery to BigQuery. Which approach should they use?

Question 33easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company stores raw data files in Cloud Storage in a bucket named 'raw-data'. After processing, the files are moved to a 'processed' bucket. To reduce costs, they want to automatically delete raw data older than 30 days. What should they do?

Question 34mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company needs to grant analysts access to a BigQuery table that contains sensitive PII columns. The analysts should be able to run aggregate queries on the entire dataset but must not see individual PII values. Which approach should the team use?

Question 35hardmultiple choice

Read the full NAT/PAT explanation →

A Dataflow streaming pipeline processes events from Pub/Sub and writes to BigQuery using a dynamically generated table destination based on the event type. The pipeline is experiencing high latency, and the worker CPU utilization is low. Which action is most likely to reduce latency?

Question 36mediummulti select

Read the full Building and operationalizing data processing systems explanation →

A Dataflow streaming job is processing data from Pub/Sub and writing to BigQuery. The job is stuck with the message 'No progress has been made' for several minutes. Which TWO actions should the team take to troubleshoot and resolve the issue? (Choose TWO.)

Question 37hardmulti select

Read the full Building and operationalizing data processing systems explanation →

A company is migrating their on-premises Apache Spark jobs to Google Cloud Dataproc. They want to minimize operational overhead and cost for jobs that run only a few times per day. Which TWO strategies should they adopt? (Choose TWO.)

Question 38easymulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE Google Cloud services are considered fully managed serverless data processing services? (Choose THREE.)

Question 39mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

Refer to the exhibit. A BigQuery dataset has the IAM policy shown above. An analyst is trying to run a SELECT query on a table in this dataset but receives an 'Access Denied' error. What is the most likely reason?

Exhibit

Refer to the exhibit.
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "user:analyst@example.com"
      ]
    },
    {
      "role": "roles/bigquery.metadataviewer",
      "members": [
        "user:analyst@example.com"
      ]
    }
  ],
  "etag": "BwXX2Yz7k0Q="
}

Question 40easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is ingesting real-time sensor data from thousands of devices into Cloud Pub/Sub. They need to process this data with low latency (seconds) and exactly-once semantics. Which data processing service should they use?

Question 41mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A Dataflow pipeline is processing a high-volume streaming data stream. The job is lagging behind by 30 minutes, and the Dataflow monitoring UI shows high system latency with low CPU utilization. Which action should be taken to improve throughput?

Question 42hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company processes large volumes of GPS sensor data stored in Cloud Storage. Each hour, they run an Apache Spark job that aggregates the data by geohash region. The job must be cost-effective and scale automatically. Currently, they are using a Dataproc cluster with preemptible workers. Which improvement would best reduce costs while maintaining performance?

Question 43easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are designing a streaming Dataflow pipeline that reads from Cloud Pub/Sub. Some data may arrive late due to network delays. You need to ensure that late-arriving data is still processed, but after a certain point, it should be discarded to avoid unbounded state. What is the best practice?

Question 44mediummultiple choice

Read the full NAT/PAT explanation →

A team needs to orchestrate a complex ETL workflow that includes conditional branching (if new data arrives, run transformation A, else run transformation B), error handling, and coordination across multiple services. Which service should they use?

Question 45hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are optimizing a BigQuery query that runs on a large table (hundreds of TB). The table is partitioned by date and frequently queried with filters on a specific customer_id column and date range. Queries are slow even after partitioning. Which optimization should you apply?

Question 46easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are monitoring a Dataflow streaming job and need to track the freshness of data being processed. What metric should you alert on?

Question 47mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A streaming Dataflow pipeline ingests events from Cloud Pub/Sub and writes to BigQuery. The event schema evolves occasionally (new columns added). The pipeline fails when new columns appear. What is the best long-term solution?

Question 48hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are designing a disaster recovery strategy for a critical streaming data processing pipeline. The pipeline reads from Cloud Pub/Sub, processes with Dataflow streaming, and writes to BigQuery. The required RPO is less than 1 minute, and RTO is less than 5 minutes. Which architecture should you implement?

Question 49mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO security best practices should be applied to secure data in transit for a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery? (Choose 2)

Question 50hardmulti select

Read the full Building and operationalizing data processing systems explanation →

A Dataflow batch job frequently fails with 'OutOfMemoryError'. Which THREE are common causes? (Choose 3)

Question 51easymulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO options can help reduce costs for a Dataflow batch pipeline that processes 100 GB of data daily from Cloud Storage? (Choose 2)

Question 52easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your company wants to analyze real-time user clickstream data from a website. The data arrives as JSON messages via an HTTP endpoint. The pipeline should be able to handle spikes in traffic, provide low-latency insights, and store the raw data in a data lake for historical analysis. Which Google Cloud service should you use to ingest and process the streaming data?

Question 53mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your Dataflow streaming pipeline is reading from Cloud Pub/Sub and writing to BigQuery. Users report occasional data duplication in the BigQuery table. You verify the pipeline uses exactly-once processing and idempotent writes. The Dataflow monitoring shows no errors, but the pipeline has occasional worker restarts. What is the most likely cause of the duplicates?

Question 54hardmultiple choice

Read the full NAT/PAT explanation →

A company is designing a data lake on Google Cloud. The data lake will store raw, curated, and analytics-ready data. Security requirements include: data must be encrypted at rest and in transit, access must be controlled based on data sensitivity (public, internal, confidential), and all access to sensitive data must be audited. The company also wants to minimize data transfer costs for frequently accessed curated datasets. Which combination of services and configurations best meets these requirements?

Question 55easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are operating a streaming data pipeline that uses Cloud Pub/Sub and Dataflow. The data source sometimes emits events that are delayed by several minutes due to network issues. Your pipeline must produce accurate aggregations (e.g., counts per minute) even for late data, but you also need to avoid waiting for a long time before emitting results. Which approach should you use?

Question 56mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your team runs a weekly batch ETL pipeline using Cloud Dataproc. The pipeline reads raw data from Cloud Storage, transforms it with Apache Spark, and writes results to BigQuery. Recently, the pipeline has been failing with the error 'Out of Memory' during the shuffle phase. The cluster uses standard worker nodes (n1-standard-4). What is the most effective way to resolve this without increasing total cost?

Question 57hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is migrating their on-premises Hadoop cluster to Google Cloud. The existing cluster runs HDFS, Hive, and Spark jobs. The migration must minimize changes to existing job code and configuration. The data volume is 50 TB and growing. The team expects to run both batch and interactive SQL queries. Which architecture should they use?

Question 58easymultiple choice

Read the full NAT/PAT explanation →

Your team needs to store time-series data from millions of IoT devices. Each device sends a reading every 5 minutes, and the total data volume is about 2 TB per month. The most common query pattern is retrieving all readings for a specific device over a time range (e.g., last 24 hours). Which storage service should you choose?

Question 59mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your Dataflow streaming pipeline is processing financial transactions and writing results to BigQuery. You need to monitor the pipeline for data freshness (end-to-end latency) and alert if it exceeds 5 minutes. The pipeline uses fixed windows of 1 minute. Which metrics should you use for alerting?

Question 60hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are designing a streaming data pipeline that must guarantee exactly-once processing semantics for financial transactions. The pipeline reads from Cloud Pub/Sub and writes to Cloud Bigtable. Each transaction has a unique transaction ID. Which features do you need to implement to ensure exactly-once semantics end-to-end?

Question 61mediummulti select

Read the full Building and operationalizing data processing systems explanation →

You are optimizing a Dataflow pipeline that performs a group-by-key transformation on a large, skewed dataset. The pipeline is experiencing high latency due to data skew (some keys have many more values). Which TWO actions can help mitigate the skew? (Choose two.)

Question 62hardmulti select

Read the full Building and operationalizing data processing systems explanation →

Your company is building a data processing system that ingests sensor data from millions of devices, processes it in near real-time to detect anomalies, and stores raw and processed data for long-term analytics. The system must meet a 99.9% uptime SLA and minimize data loss. Which THREE design choices are best? (Choose three.)

Question 63easymulti select

Read the full Building and operationalizing data processing systems explanation →

Your company is evaluating managed messaging services for a new event-driven application. The application requires pub/sub semantics, high throughput (millions of messages per second), and integration with Google Cloud services like Cloud Functions and Dataflow. Which TWO services should you consider? (Choose two.)

Question 64mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company runs a Dataflow pipeline that reads from Pub/Sub, aggregates events in a 10-minute fixed window, and writes to BigQuery. Recently, the pipeline has been failing with 'high uncommitted bytes' errors during periods of high traffic. What is the most likely cause and recommended action?

Question 65easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineer needs to process large CSV files (hundreds of GB) stored in Cloud Storage using Spark on a Dataproc cluster. The job performs a series of transformations and aggregations. Which configuration is most cost-effective and operationally efficient?

Question 66hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company uses Cloud Composer (Airflow) to orchestrate a data pipeline. One DAG has many tasks that run in parallel and dependencies that span multiple days. Recently, the DAG started failing with 'DagRun already exists' errors. What is the most likely cause?

Question 67mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A team wants to ingest streaming data from millions of IoT devices and store historical data in BigQuery for analysis. They need near real-time analytics on the most recent data, with sub-second latency. Which architecture should they use?

Question 68easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineer needs to design a batch processing pipeline using Cloud Data Fusion. The pipeline should read data from Cloud Storage, perform transformations (join, filter, aggregate), and write to BigQuery. What is the most efficient way to handle the transformations?

Question 69hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company's Dataflow pipeline uses the PubSubIO source to read messages and writes to BigQuery via the BigQueryIO sink. The pipeline is running in Streaming mode with exactly-once semantics enabled. Occasionally, duplicate rows appear in BigQuery. What is the most likely reason?

Question 70mediummultiple choice

Read the full NAT/PAT explanation →

A retail company is building a recommendation engine that requires processing customer clickstream data in near real-time. The data is ingested via Pub/Sub, and must be joined with a lookup table of product details (updated daily) before being used for model inference. Which design pattern should they use?

Question 71easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A team is setting up a Dataflow pipeline for a time-sensitive ETL job that must complete within a specific time window. Which monitoring metric should they use to determine if the pipeline is on track to finish on time?

Question 72hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

An organization uses Cloud Dataproc to run Spark jobs that process sensitive data. They need to ensure data is encrypted at rest and that only specific service accounts can access the data on cluster disks. What should they do?

Question 73mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO actions should be taken to optimize a Dataflow streaming pipeline that is experiencing high system lag and backpressure? (Choose two.)

Question 74mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE features of Cloud Pub/Sub guarantee at-least-once delivery and enable exactly-once processing downstream? (Choose three.)

Question 75hardmulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE practices are recommended when designing a Cloud Data Fusion pipeline to ensure efficient execution and monitoring? (Choose three.)

Question 76mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A Dataflow batch job fails consistently with the error shown. The job uses a custom container image and runs in a VPC with a private IP. What should the engineer do to resolve the issue?

Exhibit

Refer to the exhibit.

```
# error log from Dataflow job
Worker failed to start: Operation timed out after 30.0 seconds.
Possible causes:
- Insufficient CPU quota in the region.
- Networking issues preventing VM creation.
- Stale custom image.
- gRPC connection failure to the Dataflow service.
```

Question 77hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

Based on the exhibit, what is the most likely cause of duplicate rows despite using the same event_id as insertId?

Exhibit

Refer to the exhibit.

```
# BigQuery table schema and sample data
Table: mydataset.events
Columns:
  event_id: STRING (REQUIRED)
  event_timestamp: TIMESTAMP (REQUIRED)
  event_data: STRING (NULLABLE)
  user_id: STRING (REQUIRED)
Partitioned by: event_timestamp (daily)
Clustered by: user_id

Job: Dataflow pipeline writing 1000 events/second to this table using streaming inserts with insertId = event_id.

Monitoring shows intermittent 'duplicate rows' in queries that count distinct event_ids.
```

Question 78easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data engineer notices that Spark jobs on the Dataproc cluster shown often fail with executor lost errors. What is the most likely reason?

Exhibit

Refer to the exhibit.

```
# gcloud dataproc clusters describe output
clusterName: my-cluster
config:
  softwareConfig:
    imageVersion: '2.0-debian10'
  gceClusterConfig:
    zoneUri: projects/my-project/zones/us-central1-a
    internalIpOnly: false
  masterConfig:
    machineTypeUri: n1-standard-4
    numInstances: 1
  workerConfig:
    machineTypeUri: n1-standard-4
    numInstances: 10
    preemptibility: ON
  secondaryWorkerConfig:
    numInstances: 0
status:
  state: RUNNING
```

Question 79easymultiple choice

Read the full NAT/PAT explanation →

Your company uses Cloud Dataflow to process streaming data from Pub/Sub. The pipeline occasionally fails with a 'worker terminated unexpectedly' error. What is the most likely cause of this error?

Question 80mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A data pipeline uses Cloud Composer (Airflow) to orchestrate Dataproc jobs. Each job submits a Spark application that reads from BigQuery and writes to Cloud Storage. The pipeline runs nightly and takes 6 hours. Management wants to reduce costs. Which approach is most effective?

Question 81hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are designing a streaming pipeline using Cloud Dataflow with exactly-once semantics. The source is Pub/Sub and the sink is Cloud Bigtable. The pipeline must handle late data up to 10 minutes. You need to minimize cost while maintaining correctness. Which configuration should you use?

Question 82easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your team uses Cloud Dataproc to run a Spark ML training job. The job is failing with an error: 'Container killed by YARN for exceeding memory limits.' What should you do to fix this?

Question 83mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company has a Cloud Functions function that triggers on new files in Cloud Storage and writes a message to Pub/Sub for downstream processing. Recently, the function has been timing out after 60 seconds. The downstream processing is critical. What is the best solution?

Question 84hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are implementing a data pipeline that reads from Cloud Storage (parquet files), transforms data with Cloud Dataflow, and writes to BigQuery. The pipeline runs on a batch schedule every hour. You notice that the Dataflow job takes 10 minutes, but the overall pipeline latency is 15 minutes due to file availability and scheduling. The business requires latency under 5 minutes. Which change should you make?

Question 85easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your Cloud Dataflow pipeline is failing due to a 'Permission denied' error when writing to a BigQuery table. The error persists even though the service account has bigquery.dataEditor role. What is the most likely missing permission?

Question 86mediummultiple choice

Study the full Python automation breakdown →

A company uses Cloud Composer (Airflow) to orchestrate a daily batch job that runs a custom Python script on a Compute Engine instance. The process is slow because the instance takes 2 minutes to boot. How can you reduce the total runtime?

Question 87hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are designing a data pipeline that must process sensitive customer data with strict access controls. The data is ingested via Cloud Pub/Sub, processed by Cloud Dataflow, and stored in BigQuery. The security team requires that data is encrypted at rest and in transit, and that access is limited to specific service accounts. Which implementation strategy meets all requirements?

Question 88mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO are best practices for managing a Cloud Dataflow pipeline in production?

Question 89hardmulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE actions reduce the cost of a Cloud Composer environment?

Question 90easymulti select

Read the full Building and operationalizing data processing systems explanation →

Which TWO are valid approaches to handle late-arriving data in a Cloud Dataflow streaming pipeline?

Question 91hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your company runs a batch data processing pipeline using Cloud Dataproc and Cloud Composer. The pipeline processes hundreds of terabytes of data daily. Recently, the pipeline has been failing intermittently due to Dataproc cluster creation errors: 'Insufficient resources to create cluster in zone us-central1-f.' The project has a global quota of 1000 vCPUs for Compute Engine. The team usually uses n2-standard-8 (8 vCPU) worker nodes. You notice that the error occurs during peak usage times. You need to ensure the pipeline runs reliably without increasing the global quota. Which action should you take?

Question 92mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A financial services firm uses Cloud Pub/Sub to ingest real-time market data. The data is processed by a Cloud Dataflow streaming pipeline that aggregates trades per symbol and writes to BigQuery. The pipeline currently uses a single global window with a trigger that fires every minute. The firm now needs to support late data up to 5 minutes and also wants to reduce the number of writes to BigQuery to avoid hitting the table limit of 1,500 inserts per second. The current pipeline writes every minute, which is acceptable for inserts per second, but after adding late data handling, the number of writes doubles. How can you redesign the pipeline to handle late data while keeping write volume low?

Question 93easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your organization has a data lake on Cloud Storage with millions of small files (average 10 KB). You need to build a batch processing pipeline using Cloud Dataproc that runs a Spark job to transform the data and output results to BigQuery. The pipeline currently takes 4 hours to run because Spark spends a large amount of time listing files and managing tasks. You want to reduce the run time without changing the cluster size. Which action should you take?

Question 94mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company is building a real-time streaming pipeline to ingest clickstream events from web servers, enrich them with user profile data from Cloud Bigtable, and aggregate metrics into BigQuery. The expected throughput is 10,000 events per second with occasional spikes up to 50,000. The data must be processed with low latency (seconds) and exactly-once semantics. Which Google Cloud service should be the core processing engine?

Question 95mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Your team is running a Dataflow streaming pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. You notice that the pipeline's backlog is growing and the processing latency has increased from seconds to minutes. You need to diagnose and resolve the issue. Which TWO actions should you take? (Choose two.)

Question 96hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

You are a data engineer at a financial services company. You manage a batch pipeline that processes daily trade settlement reports. The pipeline runs on Cloud Dataproc using PySpark jobs triggered by Cloud Composer (Airflow). Recent trades have increased by 3x, and the pipeline now frequently fails with 'OutOfMemoryError' in the executor logs. You have already increased the executor memory from 4g to 8g, but the problem persists. The cluster uses standard worker nodes (n1-standard-4) with 15 GB RAM per node. You need to make the pipeline stable and cost-efficient. What should you do?

Question 97mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A retail company uses Cloud Dataflow for a streaming pipeline that aggregates sales events from thousands of stores. The pipeline writes aggregated results to BigQuery every 5 minutes. Recently, the Dataflow job has been restarting multiple times a day with the error: 'Worker ran out of memory' in the logs. The streaming engine is enabled. The pipeline uses keyed state (ParDo with stateful processing) to maintain per-store counters. The average event size is 2KB, and the throughput is 2,000 events/sec. You need to resolve the out-of-memory issues without losing data. What should you do?

Question 98hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A healthcare analytics company runs a nightly Dataproc workflow that reads radiology reports from Cloud Storage (CSV files), transforms them using PySpark, and writes results to BigQuery. The workflow is orchestrated by Cloud Composer. Recently, the job has started failing with 'Disk quota exceeded' errors on the worker nodes. The data volume has grown 5x over the past month. Currently, the cluster uses 5 n1-standard-4 workers (each 10GB persistent disk). The PySpark jobs heavily use intermediate shuffles. You need a cost-effective solution that avoids future failures as data grows. What should you do?

Question 99easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

Your team is using Cloud Data Fusion to build batch ETL pipelines that load data from Cloud Storage into BigQuery. You have several pipelines that run daily. Recently, one pipeline started failing with a 'Permission denied' error when trying to read a new CSV file uploaded to a specific Cloud Storage bucket. Other pipelines using the same bucket succeed. The failing pipeline has a Cloud Storage source plugin that uses a service account with the roles/storage.objectViewer role. The bucket has uniform bucket-level access enabled. What is likely causing the issue?

Question 100mediummultiple choice

Read the full Building and operationalizing data processing systems explanation →

A gaming company uses Cloud Pub/Sub to ingest player activity events. A Dataflow streaming pipeline consumes these events, performs stateful processing to compute session metrics, and writes results to Cloud Bigtable for low-latency queries. Recently, the pipeline's processing latency increased, and the Bigtable write throughput dropped. Monitoring shows that the pipeline is experiencing a high rate of 'out-of-order' messages and 'duplicate' events. The Pub/Sub subscription is configured with exactly-once delivery. The Dataflow job uses a GlobalWindow with a trigger that fires every 10 seconds. What is the most likely cause and solution?

Question 101easymultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company needs to process streaming data from IoT devices with sub-second latency and exactly-once processing guarantees. Which Google Cloud service should they use?

Question 102hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A Dataflow streaming pipeline that uses global windows and triggers every 5 seconds is experiencing increasing lag and high system latency. The pipeline reads from Pub/Sub, transforms data with a ParDo, and writes to BigQuery. Which action is most likely to reduce lag?

Question 103mediummulti select

Read the full Building and operationalizing data processing systems explanation →

Which THREE of the following are best practices when designing a Cloud Dataflow pipeline for batch processing? (Choose three.)

Question 104hardmultiple choice

Read the full Building and operationalizing data processing systems explanation →

A company runs a daily batch data processing pipeline using Cloud Dataproc. The pipeline reads 10 TB of CSV files from Cloud Storage, performs a heavy aggregation (GroupBy) and joins with a small reference table, then writes the results to BigQuery. The cluster consists of 20 n1-standard-8 nodes, including 10 preemptible workers for cost savings. Recently, the job completion time has doubled from 30 minutes to over an hour. The job logs show many tasks being retried, and the Shuffle spill ratio is high. No significant data volume change was observed. What is the most likely root cause?

Refer to the exhibit. Error log from Dataflow job: """ Workflow failed. Causes: S3D3: BigQueryIO.Write/BatchLoads/Loads/AllocateLoadTable/ParDo(AllocateLoadTable) failed. org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write$BigQueryWriteException: BigQuery insertion failed: Response JSON: { "error": { "errors": [ { "domain": "global", "reason": "invalid", "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER" } ], "code": 400, "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER" } } """

Refer to the exhibit. Cloud Pub/Sub subscription configuration: { "name": "projects/my-project/subscriptions/my-sub", "topic": "projects/my-project/topics/my-topic", "pushConfig": {}, "ackDeadlineSeconds": 10, "messageRetentionDuration": "86400s", "expirationPolicy": { "ttl": "604800s" }, "enableMessageOrdering": false, "retryPolicy": { "minimumBackoff": "10s", "maximumBackoff": "600s" }, "deadLetterPolicy": { "deadLetterTopic": "projects/my-project/topics/dead-letter-topic", "maxDeliveryAttempts": 5 } }

Refer to the exhibit. ``` # Dataflow pipeline log snippet 2024-03-15 10:00:00 ERROR Transform 'ParseLogs': org.apache.beam.sdk.util.WindowedValue$CoderLoadingException: Unable to load coder for class com.example.LogEvent 2024-03-15 10:00:01 ERROR Transform 'ParseLogs': java.lang.NoSuchMethodError: com.example.LogEvent: method <init>()V not found ```

Refer to the exhibit. { "bindings": [ { "role": "roles/bigquery.dataViewer", "members": [ "user:analyst@example.com" ] }, { "role": "roles/bigquery.metadataviewer", "members": [ "user:analyst@example.com" ] } ], "etag": "BwXX2Yz7k0Q=" }

Refer to the exhibit. ``` # error log from Dataflow job Worker failed to start: Operation timed out after 30.0 seconds. Possible causes: - Insufficient CPU quota in the region. - Networking issues preventing VM creation. - Stale custom image. - gRPC connection failure to the Dataflow service. ```

Refer to the exhibit. ``` # BigQuery table schema and sample data Table: mydataset.events Columns: event_id: STRING (REQUIRED) event_timestamp: TIMESTAMP (REQUIRED) event_data: STRING (NULLABLE) user_id: STRING (REQUIRED) Partitioned by: event_timestamp (daily) Clustered by: user_id Job: Dataflow pipeline writing 1000 events/second to this table using streaming inserts with insertId = event_id. Monitoring shows intermittent 'duplicate rows' in queries that count distinct event_ids. ```

Refer to the exhibit. ``` # gcloud dataproc clusters describe output clusterName: my-cluster config: softwareConfig: imageVersion: '2.0-debian10' gceClusterConfig: zoneUri: projects/my-project/zones/us-central1-a internalIpOnly: false masterConfig: machineTypeUri: n1-standard-4 numInstances: 1 workerConfig: machineTypeUri: n1-standard-4 numInstances: 10 preemptibility: ON secondaryWorkerConfig: numInstances: 0 status: state: RUNNING ```