CCNA Building and operationalizing data processing systems Questions

75 of 104 questions · Page 1/2 · Building and operationalizing data processing systems · Answers revealed

1
Multi-Selectmedium

A Dataflow streaming job is processing data from Pub/Sub and writing to BigQuery. The job is stuck with the message 'No progress has been made' for several minutes. Which TWO actions should the team take to troubleshoot and resolve the issue? (Choose TWO.)

Select 2 answers
A.Set the updateCompatibility flag to true and restart the pipeline.
B.Increase the persistent disk size for all workers to reduce I/O contention.
C.Examine the worker logs in Cloud Logging for any error messages or exceptions.
D.Force stop the pipeline and update it with a new version using the --update flag.
E.Enable Dataflow Streaming Engine to move state to the backend and reduce worker load.
AnswersC, E

B is correct because logs can reveal the root cause, such as out-of-memory errors or stuck transforms.

Why this answer

Option C is correct because examining worker logs in Cloud Logging is the first step to identify the root cause of a stuck pipeline. Common issues like out-of-memory errors, serialization failures, or worker crashes are logged there, and without inspecting logs, troubleshooting is guesswork.

Exam trap

Google Cloud often tests the misconception that increasing resources (like disk size) or restarting the pipeline is the default fix, when in reality the first step is always to inspect logs to understand the failure mode.

2
Multi-Selecteasy

Which TWO actions can reduce the cost of running a Dataproc cluster for a nightly batch job?

Select 2 answers
A.Increase the number of worker nodes for faster processing.
B.Use high-memory machine types for master node.
C.Use preemptible VMs for worker nodes.
D.Attach local SSDs to all nodes.
E.Delete the cluster after the job completes.
AnswersC, E

Preemptible VMs are much cheaper.

Why this answer

Preemptible VMs (Option C) are significantly cheaper than standard VMs because Compute Engine can terminate them at any time, making them ideal for fault-tolerant, stateless batch jobs like nightly data processing on Dataproc. Deleting the cluster after the job completes (Option E) eliminates ongoing compute costs for idle resources, which is a best practice for ephemeral workloads.

Exam trap

Google Cloud often tests the misconception that scaling up resources (more nodes or faster hardware) always reduces cost by shortening runtime, but in reality, the increased per-hour cost usually outweighs the time savings for batch jobs.

3
MCQhard

A company runs a daily batch data processing pipeline using Cloud Dataproc. The pipeline reads 10 TB of CSV files from Cloud Storage, performs a heavy aggregation (GroupBy) and joins with a small reference table, then writes the results to BigQuery. The cluster consists of 20 n1-standard-8 nodes, including 10 preemptible workers for cost savings. Recently, the job completion time has doubled from 30 minutes to over an hour. The job logs show many tasks being retried, and the Shuffle spill ratio is high. No significant data volume change was observed. What is the most likely root cause?

A.The cluster's HDFS is running out of space due to intermediate shuffle data.
B.Data skew has developed, causing a few tasks to process most of the data.
C.Preemptible workers are being reclaimed, causing YARN container failures and task retries.
D.The reference table has increased in size, causing more data to be broadcast to all workers.
AnswerC

Preemptible nodes can be taken at any time; Shuffle-heavy jobs suffer greatly from lost intermediate data.

Why this answer

The correct answer is C because preemptible workers are frequently reclaimed by Google Cloud, causing YARN containers to fail and tasks to be retried. This leads to increased job completion time and a high shuffle spill ratio, as partial shuffle data is lost and must be recomputed. The doubling of job time without data volume change strongly points to infrastructure instability rather than data or configuration issues.

Exam trap

The trap here is that candidates may attribute high shuffle spill and task retries to data skew or HDFS space, but the key clue is the unchanged data volume and the use of preemptible workers, which directly cause container failures and retries.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc uses Cloud Storage for intermediate shuffle data by default (via the 'spark.shuffle.useOldFetchProtocol' or 'spark.shuffle.manager' settings), not HDFS, so HDFS space is not a bottleneck. Option B is wrong because data skew would cause a few tasks to process most data, but the symptom of many tasks being retried and high shuffle spill ratio is more consistent with container failures, not skew; skew typically manifests as a few long-running tasks, not widespread retries. Option D is wrong because the reference table is described as small, and even if it increased, broadcasting more data would not cause task retries or high shuffle spill; it would instead increase memory pressure on executors, not trigger widespread failures.

4
MCQeasy

A company uses Cloud Dataflow to process streaming data from Pub/Sub into BigQuery. The pipeline uses a side input from a Cloud Bigtable table containing user profile information to enrich the events. The side input is updated every hour. Which approach should the company use to ensure that the pipeline uses the latest profile data without causing high memory usage?

A.Use a side input that is periodically refreshed by reading the Cloud Bigtable table at a regular interval.
B.For each incoming event, read the corresponding profile from Cloud Bigtable using a synchronous call.
C.Use a CoGroupByKey transform to join the stream with a bounded PCollection created from the Cloud Bigtable table.
D.Stream the profile updates into a separate BigQuery table and use a BigQuery streaming query to join in real-time.
AnswerA

B is correct because side inputs with periodic refreshes provide a fresh snapshot of the reference data without high memory overhead.

Why this answer

Option A is correct because Cloud Dataflow supports periodically refreshing side inputs by reading from an external source like Cloud Bigtable at a specified interval. This approach keeps the profile data up-to-date without storing the entire side input in memory for the lifetime of the pipeline; instead, the side input is rebuilt and cached only when refreshed, controlling memory usage.

Exam trap

Google Cloud often tests the misconception that side inputs are static and cannot be updated, leading candidates to choose per-element lookups (Option B) or complex joins (Option C), when in fact Dataflow's side input refresh mechanism is the correct, efficient solution for periodically updated reference data.

How to eliminate wrong answers

Option B is wrong because making a synchronous call to Cloud Bigtable for every incoming event would introduce high latency and potentially overwhelm Bigtable with thousands of read requests per second, leading to performance degradation and increased cost. Option C is wrong because CoGroupByKey requires both inputs to be bounded PCollections; the streaming Pub/Sub source is unbounded, and joining it with a bounded Bigtable snapshot would not reflect updates to the profile data over time. Option D is wrong because streaming profile updates into a separate BigQuery table and using a streaming query to join in real-time would add unnecessary complexity and latency, and BigQuery is not designed for high-frequency per-event joins in a streaming pipeline.

5
MCQmedium

A team wants to ingest streaming data from millions of IoT devices and store historical data in BigQuery for analysis. They need near real-time analytics on the most recent data, with sub-second latency. Which architecture should they use?

A.Use Pub/Sub to receive data, then stream directly into BigQuery using the streaming API, and use standard SQL queries for real-time analytics.
B.Use Pub/Sub, then a Dataflow pipeline that filters and transforms data, writing to Cloud Bigtable for real-time queries and to Cloud Storage for periodic BigQuery loads.
C.Use Pub/Sub to ingest data into a Dataproc Spark Streaming job that writes to both Bigtable and BigQuery.
D.Use Cloud SQL to store the latest data and periodically move historical data to BigQuery via cron jobs.
AnswerB

Bigtable provides sub-millisecond latency for real-time queries, and BigQuery handles large-scale analytics.

Why this answer

Option B is correct because it uses Cloud Bigtable for sub-second latency on recent data, which is ideal for near real-time analytics on streaming IoT data. Dataflow provides the necessary stream processing, filtering, and transformation before writing to Bigtable for low-latency queries and to Cloud Storage for periodic batch loads into BigQuery for historical analysis. This architecture decouples real-time and historical paths, meeting both latency and storage requirements.

Exam trap

Google Cloud often tests the misconception that BigQuery's streaming API can provide sub-second query latency, but in reality, BigQuery is a columnar analytics engine optimized for large scans, not for low-latency point reads, which is why a separate low-latency store like Bigtable is required for real-time access.

How to eliminate wrong answers

Option A is wrong because streaming directly into BigQuery via the streaming API does not guarantee sub-second latency for queries; BigQuery is optimized for analytical queries on large datasets, not for real-time point lookups or low-latency access to the most recent data. Option C is wrong because Dataproc Spark Streaming adds unnecessary operational overhead and latency compared to a managed service like Dataflow, and writing directly to both Bigtable and BigQuery from Spark can cause contention and complexity without the built-in exactly-once semantics and auto-scaling of Dataflow. Option D is wrong because Cloud SQL is not designed for high-throughput streaming ingestion from millions of devices and cannot handle the scale; also, periodic cron jobs to move data to BigQuery introduce latency that violates the sub-second requirement for near real-time analytics.

6
Matchingmedium

Match each data pipeline term to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Extract, Transform, Load

Extract, Load, Transform

Raw data storage in native format

Optimized storage for structured analytics

Why these pairings

Common data pipeline concepts and their meanings.

7
MCQmedium

A company has a Cloud Functions function that triggers on new files in Cloud Storage and writes a message to Pub/Sub for downstream processing. Recently, the function has been timing out after 60 seconds. The downstream processing is critical. What is the best solution?

A.Replace Cloud Functions with a Cloud Run job that has longer timeout
B.Increase the function memory to 2 GB to speed up execution
C.Reduce the function timeout to 30 seconds to force faster execution
D.Increase function timeout to 540 seconds and delegate heavy processing to Cloud Dataflow
AnswerD

This addresses both timeout and heavy processing.'

Why this answer

Option D is correct because Cloud Functions has a maximum timeout of 540 seconds (9 minutes) for HTTP-triggered functions, and by increasing the timeout you allow the function to complete its work. Delegating heavy processing to Cloud Dataflow offloads the computationally intensive tasks, preventing future timeouts and ensuring scalable, reliable downstream processing for critical workloads.

Exam trap

Google Cloud often tests the misconception that increasing memory or reducing timeout directly solves performance issues, but the real solution is to extend the timeout and delegate heavy processing to a scalable service like Dataflow.

How to eliminate wrong answers

Option A is wrong because Cloud Run jobs are designed for batch workloads that run to completion, not for event-driven triggers like Cloud Storage; replacing Cloud Functions with a Cloud Run job would require a different invocation pattern and does not directly solve the timeout issue. Option B is wrong because increasing memory may improve performance for memory-bound tasks but does not guarantee faster execution for I/O-bound or CPU-bound operations, and the function still has a 60-second timeout limit. Option C is wrong because reducing the timeout to 30 seconds would force the function to fail even faster, making the timeout problem worse and potentially losing critical messages.

8
MCQmedium

Refer to the exhibit. A Dataflow streaming pipeline subscribes to this Pub/Sub subscription. The pipeline occasionally takes more than 10 seconds to process a message. Which behavior will occur?

A.The message will be sent to the dead letter topic immediately.
B.The message will be retried with exponential backoff as per retry policy.
C.The message will be redelivered after 10 seconds if not acknowledged.
D.The message will be dropped after 10 seconds due to expiration policy.
AnswerC

The ack deadline is 10 seconds; if processing exceeds that, Pub/Sub redelivers the message.

Why this answer

Option C is correct because Pub/Sub delivery requires an acknowledgment within the configurable `ackDeadlineSeconds` (default 10 seconds). If the pipeline takes longer than the ack deadline to process a message, Pub/Sub considers the message unacknowledged and redelivers it. This is the standard behavior for at-least-once delivery in Google Cloud Pub/Sub.

Exam trap

Google Cloud often tests the distinction between ack deadline expiration and dead letter topics, trapping candidates who assume any processing delay immediately triggers a dead letter or that Pub/Sub uses exponential backoff like some other messaging systems.

How to eliminate wrong answers

Option A is wrong because a dead letter topic is only triggered after a message has been retried the maximum number of times (configurable via `maxDeliveryAttempts`), not immediately upon exceeding the ack deadline. Option B is wrong because Pub/Sub does not use exponential backoff for redelivery; it uses a fixed or configurable `ackDeadlineSeconds` and redelivers after that deadline expires, with no built-in exponential backoff retry policy. Option D is wrong because the expiration policy (`messageRetentionDuration`) controls how long unacknowledged messages are retained in the subscription, not a 10-second drop; messages are retained for up to 7 days by default.

9
Matchingmedium

Match each data lifecycle stage to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Collecting data from various sources

Persisting data in a durable system

Transforming and analyzing data

Making data available for consumption

Moving data to long-term, low-cost storage

Why these pairings

Common stages in a data lifecycle.

10
MCQeasy

You are monitoring a Dataflow streaming job and need to track the freshness of data being processed. What metric should you alert on?

A.Output throughput (elements/sec)
B.Error count
C.Data freshness (seconds)
D.CPU utilization
AnswerC

Data freshness measures the latency of the last processed event, indicating pipeline delay.

Why this answer

Data freshness (seconds) is the correct metric to alert on because it directly measures the lag between when an event occurs and when it is processed by the Dataflow pipeline. This metric, exposed as the 'system_lag' in Dataflow monitoring, indicates how up-to-date the output is relative to the input watermark. Alerting on data freshness ensures that the pipeline is meeting service-level agreements (SLAs) for real-time or near-real-time processing.

Exam trap

Google Cloud often tests the distinction between throughput and latency metrics, and the trap here is that candidates confuse high throughput with low latency, not realizing that a pipeline can process many elements per second while still having stale data due to watermark delays or unprocessed late data.

How to eliminate wrong answers

Option A is wrong because output throughput (elements/sec) measures processing rate, not timeliness; a pipeline can have high throughput but still be processing stale data due to backlog or watermark delays. Option B is wrong because error count tracks failures (e.g., exceptions, dropped elements) but does not indicate how current the processed data is; a pipeline with zero errors could still have high latency. Option D is wrong because CPU utilization is a resource metric that reflects compute efficiency, not data freshness; high CPU might cause delays, but it is an indirect indicator and not the direct measure of data staleness.

11
MCQhard

A company is migrating their on-premises Hadoop cluster to Google Cloud. The existing cluster runs HDFS, Hive, and Spark jobs. The migration must minimize changes to existing job code and configuration. The data volume is 50 TB and growing. The team expects to run both batch and interactive SQL queries. Which architecture should they use?

A.Keep HDFS on persistent Cloud Dataproc clusters and use BigQuery for SQL queries.
B.Use Cloud Dataflow for all batch processing and BigQuery for storage and querying.
C.Migrate HDFS to Cloud Storage, create a Cloud Dataproc cluster for Spark jobs, and use BigQuery for interactive SQL queries via a Hive metastore linked to BigQuery.
D.Use Cloud Dataproc with ephemeral clusters and Cloud Storage (instead of HDFS) for data storage. Run Spark jobs directly, and use Cloud Dataproc's built-in Hive on Cloud Dataproc for SQL queries.
AnswerD

Cloud Dataproc can use Cloud Storage as the data layer; most Spark and Hive jobs need minimal changes (e.g., file path prefix). Ephemeral clusters reduce cost. This preserves existing code.

Why this answer

Option D is correct because it uses Cloud Storage as the underlying storage layer, which is HDFS-compatible and allows existing Spark jobs to run without code changes. Ephemeral Dataproc clusters reduce costs and provide native Hive support for interactive SQL queries, meeting both batch and interactive requirements without altering job configurations.

Exam trap

Google Cloud often tests the misconception that BigQuery must be used for all SQL queries in a migration, ignoring that Dataproc's Hive can directly query data in Cloud Storage without code changes, making it a simpler path for interactive SQL on existing Hive workloads.

How to eliminate wrong answers

Option A is wrong because keeping HDFS on persistent Dataproc clusters does not leverage Cloud Storage's scalability and cost benefits, and using BigQuery for SQL queries would require significant code changes to redirect queries away from Hive. Option B is wrong because Cloud Dataflow is not designed for Spark job compatibility, and using BigQuery for storage would break existing HDFS-based job code and configurations. Option C is wrong because linking a Hive metastore to BigQuery requires modifying the Hive configuration and does not support running Spark jobs directly on BigQuery storage without additional connectors, increasing complexity and potential code changes.

12
Multi-Selectmedium

You are optimizing a Dataflow pipeline that performs a group-by-key transformation on a large, skewed dataset. The pipeline is experiencing high latency due to data skew (some keys have many more values). Which TWO actions can help mitigate the skew? (Choose two.)

Select 2 answers
A.Use hot key detection and split the hot key into multiple sub-keys (e.g., append a random number).
B.Enable the Dataflow service's automatic reshuffling feature.
C.Use CoGroupByKey to reduce the number of keys.
D.Increase the number of worker machines.
E.Use Combine.perKey with a combiner to aggregate values locally before shuffling.
AnswersA, E

Splitting a hot key distributes its values across multiple workers, reducing bottleneck.

Why this answer

Option A is correct because splitting a hot key into multiple sub-keys (e.g., by appending a random number) distributes the values across multiple shards during the shuffle phase, reducing the load on any single worker. This technique, often called "salting," is a standard pattern in Dataflow and Apache Beam to handle data skew by breaking the bottleneck caused by a single key with disproportionately many values.

Exam trap

Google Cloud often tests the misconception that simply adding more workers (Option D) or enabling automatic reshuffling (Option B) can fix data skew, when in fact these actions do not address the root cause of a single key being processed by one shard.

13
MCQhard

A company processes large volumes of GPS sensor data stored in Cloud Storage. Each hour, they run an Apache Spark job that aggregates the data by geohash region. The job must be cost-effective and scale automatically. Currently, they are using a Dataproc cluster with preemptible workers. Which improvement would best reduce costs while maintaining performance?

A.Use a larger Dataproc cluster with standard workers
B.Migrate the job to BigQuery scheduled queries
C.Switch to Dataflow batch pipeline with Apache Beam
D.Use Dataproc Serverless Spark
AnswerD

Dataproc Serverless Spark runs Spark jobs without cluster management, scales automatically, and you pay only for resources used, reducing cost.

Why this answer

Dataproc Serverless Spark (Option D) eliminates the need to manage a cluster, automatically scaling resources to match job demand and charging only for the resources consumed during execution. This removes the overhead of preemptible worker management and idle cluster costs, directly reducing expenses while maintaining performance for the hourly aggregation job.

Exam trap

Google Cloud often tests the misconception that migrating to a different processing engine (like Dataflow or BigQuery) is always the best cost-saving move, when in fact reusing existing Spark code on a serverless platform avoids migration costs and leverages the same API.

How to eliminate wrong answers

Option A is wrong because using a larger cluster with standard workers increases costs due to higher per-hour instance pricing and potential idle time, without addressing the cost inefficiency of preemptible workers. Option B is wrong because BigQuery scheduled queries are designed for SQL-based analytics on data already in BigQuery, not for processing large volumes of GPS sensor data stored in Cloud Storage with Apache Spark aggregations; migrating would require rewriting the Spark logic and may incur high BigQuery slot costs. Option C is wrong because while Dataflow batch pipelines with Apache Beam can process data cost-effectively, they require rewriting the existing Spark job into Beam, introducing development overhead and potential performance differences, whereas Dataproc Serverless Spark directly runs the existing Spark code without migration.

14
Drag & Dropmedium

Drag and drop the steps to create a Cloud Function triggered by Cloud Storage events into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Cloud Functions can respond to changes in Cloud Storage buckets.

15
MCQmedium

A data pipeline uses Cloud Composer (Airflow) to orchestrate Dataproc jobs. Each job submits a Spark application that reads from BigQuery and writes to Cloud Storage. The pipeline runs nightly and takes 6 hours. Management wants to reduce costs. Which approach is most effective?

A.Use preemptible VMs for the Dataproc cluster
B.Switch to Cloud Dataproc billing per second instead of per minute
C.Increase the memory of the driver node to improve performance
D.Upgrade the Cloud Storage class from Standard to Nearline
AnswerA

Preemptible VMs are cheaper and suitable for batch jobs.

Why this answer

Preemptible VMs are significantly cheaper (up to 80% discount) than standard VMs and are ideal for fault-tolerant, batch workloads like nightly Dataproc jobs. Since the pipeline runs nightly and takes 6 hours, it can tolerate the occasional preemption of worker nodes by using Spark's built-in resilience (e.g., task retries). This directly reduces compute cost without sacrificing completion, assuming the cluster is configured with enough preemptible workers to handle the workload.

Exam trap

Google Cloud often tests the misconception that 'upgrading' storage class or changing billing granularity saves money, when in fact the correct answer involves leveraging cheaper compute resources (preemptible VMs) that are designed for fault-tolerant batch jobs.

How to eliminate wrong answers

Option B is wrong because Dataproc already bills per second after a 1-minute minimum, so switching to per-second billing is not a change that reduces costs further. Option C is wrong because increasing driver memory does not reduce costs; it may actually increase costs by requiring a larger, more expensive VM, and performance gains are unlikely if the bottleneck is not driver memory. Option D is wrong because upgrading from Standard to Nearline storage increases cost (Nearline has higher retrieval and minimum storage duration fees) and is intended for infrequently accessed data, not for nightly write workloads where data is read soon after writing.

16
Multi-Selecthard

A Dataflow batch job frequently fails with 'OutOfMemoryError'. Which THREE are common causes? (Choose 3)

Select 3 answers
A.Too many parallel workers
B.Inefficient GroupByKey with hot keys
C.Too many side inputs
D.Too large window accumulation in streaming mode
E.Using Dataflow Shuffle
AnswersB, C, D

Hot keys cause all values to be processed by a single worker, leading to memory exhaustion.

Why this answer

Option B is correct because a hot key in a GroupByKey operation causes all values for that key to be processed by a single worker, leading to memory exhaustion when the key's associated data exceeds the worker's memory capacity. This is a common cause of OutOfMemoryError in Dataflow batch jobs, as the SDK buffers all values for a key before emitting the result.

Exam trap

Google Cloud often tests the misconception that increasing parallelism (Option A) always reduces memory errors, but in Dataflow, hot keys cause memory issues regardless of worker count because the hot key's data is processed by a single worker.

17
MCQmedium

A team needs to orchestrate a complex ETL workflow that includes conditional branching (if new data arrives, run transformation A, else run transformation B), error handling, and coordination across multiple services. Which service should they use?

A.Cloud Functions
B.Cloud Composer (Apache Airflow)
C.Cloud Workflows
D.Cloud Scheduler
AnswerB

Airflow natively supports branching, dependencies, and error handling in Python DAGs, ideal for complex orchestration.

Why this answer

Cloud Composer (Apache Airflow) is the correct choice because it is designed for orchestrating complex, multi-step ETL workflows with conditional branching, error handling, and cross-service coordination. Airflow's directed acyclic graphs (DAGs) natively support conditional logic (e.g., BranchPythonOperator), retries, and dependency management across heterogeneous services, making it ideal for this use case.

Exam trap

Google Cloud often tests the distinction between orchestration (Cloud Composer) and simple scheduling or event-driven compute (Cloud Scheduler, Cloud Functions), leading candidates to pick Cloud Functions for its event-driven nature or Cloud Workflows for its branching capability, without recognizing that Airflow is the only service purpose-built for complex, multi-step ETL orchestration with conditional logic and error handling.

How to eliminate wrong answers

Option A is wrong because Cloud Functions is a serverless compute service for single-purpose, event-driven functions, not a workflow orchestrator; it lacks native support for conditional branching, retry policies, and multi-step coordination across services. Option C is wrong because Cloud Workflows is a low-code orchestration service that can handle branching and error handling, but it is designed for simpler, synchronous workflows and does not provide the same level of scheduling, retry, and monitoring capabilities as Airflow for complex ETL pipelines. Option D is wrong because Cloud Scheduler is a cron job service that triggers tasks on a schedule, but it cannot manage conditional branching, error handling, or multi-service coordination within a single workflow.

18
MCQhard

A company wants to replicate a Cloud SQL (PostgreSQL) database to BigQuery in near real-time for analytics. The volume is about 10GB per day with frequent updates and deletes. They need to capture changes with low latency and ensure exactly-once delivery to BigQuery. Which approach should they use?

A.Export the entire database to Cloud Storage as CSV files every hour and load them into BigQuery using a load job with WRITE_TRUNCATE.
B.Use a Dataflow pipeline with JDBCIO to read from Cloud SQL every minute and write changes to BigQuery using upserts.
C.Use Cloud Data Fusion with a Debezium streaming source to capture CDC from Cloud SQL and a BigQuery sink with exactly-once mode.
D.Use Cloud SQL's change data capture feature to write changes to a Pub/Sub topic and use a Dataflow pipeline to stream into BigQuery.
AnswerC

D is correct because Data Fusion with Debezium provides near real-time CDC with exactly-once semantics.

Why this answer

Option C is correct because Cloud Data Fusion with a Debezium streaming source provides native change data capture (CDC) from PostgreSQL, capturing inserts, updates, and deletes with low latency. The BigQuery sink in exactly-once mode ensures no duplicate records, meeting the requirement for near real-time analytics with frequent updates and deletes.

Exam trap

Google Cloud often tests the misconception that Cloud SQL has a native CDC feature to write to Pub/Sub, but in reality, it requires an external CDC tool like Debezium or Datastream to capture changes.

How to eliminate wrong answers

Option A is wrong because exporting the entire database as CSV files every hour and using WRITE_TRUNCATE overwrites the entire BigQuery table, losing all historical data and failing to capture updates and deletes in near real-time; it also does not provide exactly-once delivery. Option B is wrong because JDBCIO reads snapshots of the table at each poll interval, not change data capture, so it cannot capture deletes and may miss updates between polls; it also does not guarantee exactly-once semantics for upserts in BigQuery. Option D is wrong because Cloud SQL does not have a built-in change data capture feature that writes directly to Pub/Sub; this option describes a non-existent capability, as Cloud SQL requires third-party tools like Debezium or Datastream to capture CDC.

19
MCQeasy

A data engineer needs to design a batch processing pipeline using Cloud Data Fusion. The pipeline should read data from Cloud Storage, perform transformations (join, filter, aggregate), and write to BigQuery. What is the most efficient way to handle the transformations?

A.Use Data Fusion Wrangler to visually design the transformations and then run the pipeline on a Dataproc cluster.
B.Use SQL queries in BigQuery to perform the transformations after loading raw data into staging tables.
C.Use custom Python scripts in a Cloud Function triggered after the files land in Cloud Storage.
D.Use Apache Spark on Dataproc to code the transformations manually, bypassing Data Fusion.
AnswerA

Wrangler provides a UI for transformations and Data Fusion executes them on Dataproc.

Why this answer

Option A is correct because Cloud Data Fusion Wrangler provides a visual, no-code interface for designing transformations (join, filter, aggregate) that are then compiled into an Apache Spark or MapReduce program and executed on a Dataproc cluster. This approach leverages Data Fusion's native integration with Dataproc for efficient, scalable batch processing without manual coding, while keeping the pipeline fully managed within the Data Fusion ecosystem.

Exam trap

Google Cloud often tests the misconception that Cloud Data Fusion is only a visual tool and that transformations must be coded manually in Spark or SQL, when in fact Wrangler generates optimized Spark code under the hood and integrates seamlessly with Dataproc for execution.

How to eliminate wrong answers

Option B is wrong because it bypasses Data Fusion entirely, requiring raw data to be loaded into BigQuery staging tables first, which adds latency and storage costs; transformations in BigQuery are better suited for analytics queries, not as a primary ETL step in a Data Fusion pipeline. Option C is wrong because Cloud Functions have a maximum timeout of 9 minutes (540 seconds) and limited memory (up to 8 GB), making them unsuitable for large-scale batch transformations like joins and aggregations on datasets that may be gigabytes or terabytes in size. Option D is wrong because it suggests manually coding Spark on Dataproc, which defeats the purpose of using Data Fusion's visual design and managed execution; while Spark can be used, Data Fusion already abstracts and optimizes the Spark execution, so manual coding adds unnecessary complexity and maintenance overhead.

20
MCQeasy

You are operating a streaming data pipeline that uses Cloud Pub/Sub and Dataflow. The data source sometimes emits events that are delayed by several minutes due to network issues. Your pipeline must produce accurate aggregations (e.g., counts per minute) even for late data, but you also need to avoid waiting for a long time before emitting results. Which approach should you use?

A.Use processing-time windows and ignore the event timestamps entirely.
B.Use event-time processing with allowed lateness and a trigger that fires early to provide speculative results.
C.Use global windows and hold all data for 24 hours before processing to ensure completeness.
D.Use event-time processing and discard any data that arrives after the window ends.
AnswerB

Dataflow supports allowed lateness and triggers; you can set a trigger to emit early results every minute, and then a final result after the allowed lateness period, ensuring both low latency and eventual accuracy.

Why this answer

Option B is correct because it uses event-time processing to handle late data via allowed lateness, combined with early triggers to emit speculative results before the window closes. This balances accuracy for delayed events with low latency for downstream consumers, which is a common requirement in streaming pipelines using Cloud Pub/Sub and Dataflow.

Exam trap

Google Cloud often tests the distinction between processing-time and event-time semantics, and the trap here is that candidates may choose processing-time windows (Option A) thinking they are simpler, not realizing they sacrifice correctness for late data.

How to eliminate wrong answers

Option A is wrong because processing-time windows ignore event timestamps entirely, so late-arriving data would be assigned to the wrong window, producing inaccurate aggregations. Option C is wrong because global windows with a 24-hour hold would cause unbounded latency and memory pressure, violating the requirement to avoid waiting a long time before emitting results. Option D is wrong because discarding late data after the window ends would lose delayed events, failing the requirement for accurate aggregations even with late data.

21
MCQmedium

A streaming Dataflow pipeline ingests events from Cloud Pub/Sub and writes to BigQuery. The event schema evolves occasionally (new columns added). The pipeline fails when new columns appear. What is the best long-term solution?

A.Configure the BigQuery sink to use stored 'dynamic' schema by setting create_disposition to CREATE_NEVER and writing to a temporary table with schema auto-detection
B.Stop the pipeline and update the BigQuery schema manually whenever a new column appears
C.Switch to Dataproc to process the data with Spark and write to BigQuery using the Avro format
D.Use a Cloud Function to transform the data and add null columns for missing fields
AnswerA

Using schema auto-detection on a temporary table and then merging into the main table with wildcard tables or using BigQuery's schema flexibility can handle new columns.

Why this answer

Option A is correct because it leverages BigQuery's schema auto-detection with a temporary table to handle schema evolution dynamically. By setting create_disposition to CREATE_NEVER, the pipeline writes to a table that already exists, while the temporary table with auto-detection allows the pipeline to infer new columns from the incoming data. This approach avoids pipeline failures when new columns appear, as the sink can adapt without manual intervention or pipeline restarts.

Exam trap

Google Cloud often tests the misconception that manual schema updates or external transformations are acceptable long-term solutions, when in fact the correct answer leverages a built-in BigQuery feature (schema auto-detection) to handle schema evolution dynamically without pipeline downtime.

How to eliminate wrong answers

Option B is wrong because it requires manual intervention to stop the pipeline and update the BigQuery schema each time a new column appears, which is not a long-term solution and defeats the purpose of a streaming pipeline that needs to handle schema evolution automatically. Option C is wrong because switching to Dataproc with Spark and Avro format does not inherently solve the schema evolution problem; it adds unnecessary complexity and still requires handling schema changes in the Spark job or BigQuery sink. Option D is wrong because using a Cloud Function to transform data and add null columns for missing fields is a brittle workaround that requires maintaining a separate function and does not scale well with frequent schema changes; it also introduces additional latency and cost.

22
Multi-Selectmedium

Which TWO factors should be considered when choosing between Cloud Dataflow and Dataproc for a batch processing pipeline?

Select 2 answers
A.Dataproc allows custom Docker containers, while Dataflow does not.
B.Dataflow is built for data processing patterns, while Dataproc is better for general-purpose compute.
C.Dataproc supports Python, while Dataflow only supports Java.
D.Dataflow provides auto-scaling, while Dataproc requires manual cluster sizing.
E.Dataflow supports Java and Python, while Dataproc only supports Java.
AnswersB, D

Dataflow is specialized for data pipelines.

Why this answer

Option B is correct because Dataflow is purpose-built for data processing patterns like batch and stream processing with unified programming models (Apache Beam), while Dataproc is optimized for general-purpose compute workloads such as running custom Spark, Hadoop, or ML jobs. Option D is correct because Dataflow provides automatic horizontal autoscaling based on pipeline throughput, whereas Dataproc requires manual cluster sizing or configuration of autoscaling policies, which are not as granular or reactive as Dataflow's.

Exam trap

Google Cloud often tests the misconception that Dataflow only supports Java and that Dataproc requires manual scaling, when in fact both services support multiple languages and Dataproc offers optional autoscaling, but Dataflow's autoscaling is more dynamic and fine-grained.

23
MCQeasy

Your company wants to analyze real-time user clickstream data from a website. The data arrives as JSON messages via an HTTP endpoint. The pipeline should be able to handle spikes in traffic, provide low-latency insights, and store the raw data in a data lake for historical analysis. Which Google Cloud service should you use to ingest and process the streaming data?

A.Cloud Pub/Sub combined with Dataflow
B.Cloud Dataproc
C.Cloud Functions
D.Cloud IoT Core
AnswerA

Cloud Pub/Sub provides reliable, scalable ingestion; Dataflow enables stream processing with exactly-once semantics and can write to Cloud Storage.

Why this answer

Cloud Pub/Sub is the correct ingestion service because it provides a highly scalable, fully managed message queue that can handle traffic spikes by decoupling producers from consumers. Dataflow (Apache Beam) then processes the streaming data with low latency, supports exactly-once semantics, and can write raw data to a data lake like Cloud Storage for historical analysis. This combination meets all requirements: spike handling, low-latency insights, and raw data storage.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle streaming ingestion due to its HTTP trigger, but its 9-minute timeout and lack of native streaming support make it unsuitable for high-throughput, low-latency pipelines.

How to eliminate wrong answers

Option B (Cloud Dataproc) is wrong because it is a managed Hadoop/Spark service designed for batch and stream processing but requires manual cluster management and autoscaling configuration, making it less suitable for handling unpredictable traffic spikes with low latency compared to the serverless Pub/Sub + Dataflow pipeline. Option C (Cloud Functions) is wrong because it is a lightweight, event-driven compute service with a maximum timeout of 9 minutes and limited throughput, making it unsuitable for high-volume, real-time streaming ingestion and processing. Option D (Cloud IoT Core) is wrong because it is specifically designed for ingesting data from IoT devices using MQTT/HTTP protocols, not for general web clickstream data from an HTTP endpoint, and it lacks the native streaming analytics capabilities needed for low-latency insights.

24
Multi-Selecthard

A company is migrating their on-premises Apache Spark jobs to Google Cloud Dataproc. They want to minimize operational overhead and cost for jobs that run only a few times per day. Which TWO strategies should they adopt? (Choose TWO.)

Select 2 answers
A.Configure HDFS replication factor to 3 to ensure data durability during cluster restarts.
B.Rewrite the Spark jobs as Dataflow pipelines to take advantage of serverless processing.
C.Store all data in Cloud Storage instead of HDFS, and use the Cloud Storage connector to access it.
D.Create an ephemeral Dataproc cluster for each job and delete it after completion.
E.Use a small persistent cluster that runs continuously and submit jobs to it.
AnswersC, D

C is correct because Cloud Storage is durable and eliminates HDFS management.

Why this answer

Option C is correct because storing data in Cloud Storage decouples storage from compute, allowing ephemeral clusters to be spun up and down without data loss. The Cloud Storage connector provides Hadoop-compatible file system access, eliminating the need for HDFS replication and reducing costs by avoiding persistent cluster storage.

Exam trap

Google Cloud often tests the misconception that persistent clusters are necessary for data durability, but the correct approach for intermittent workloads is to use ephemeral clusters with Cloud Storage to minimize cost and operational overhead.

25
MCQeasy

A data engineer notices that Spark jobs on the Dataproc cluster shown often fail with executor lost errors. What is the most likely reason?

A.All 10 workers are preemptible and can be reclaimed by Compute Engine at any time.
B.The master node has only 4 vCPUs, which may be insufficient for job coordination.
C.The cluster is in a single zone, so a zone failure could cause all workers to shut down.
D.Autoscaling is enabled and scaling down is causing workers to be removed during job execution.
AnswerA

Preemptible VMs can be terminated within 24 hours; Spark executors fail when workers are preempted.

Why this answer

Preemptible VMs in Google Compute Engine can be terminated at any time due to resource contention or other factors, with only 30 seconds notice. If all 10 worker nodes are preemptible, Spark executors running on them will be frequently lost, causing job failures. This is the most direct cause of 'executor lost' errors in a Dataproc cluster.

Exam trap

The trap here is that candidates may overlook the 'all 10 workers are preemptible' detail and instead focus on common misconfigurations like single-zone risk or autoscaling, but the explicit mention of preemptible VMs is the key indicator of frequent, unpredictable executor loss.

How to eliminate wrong answers

Option B is wrong because the master node's vCPUs (4) are typically sufficient for job coordination; executor lost errors are not caused by insufficient master resources but by worker instability. Option C is wrong because a single-zone cluster does not cause frequent executor losses; zone failures are rare and would cause complete cluster failure, not intermittent executor lost errors. Option D is wrong because autoscaling removes workers gracefully, allowing Spark to reschedule tasks before termination; it does not cause the abrupt 'executor lost' errors seen here.

26
MCQeasy

A data engineer needs to process large CSV files (hundreds of GB) stored in Cloud Storage using Spark on a Dataproc cluster. The job performs a series of transformations and aggregations. Which configuration is most cost-effective and operationally efficient?

A.Use a cluster with 10 high-memory (n1-highmem-8) VMs as workers to improve shuffle performance.
B.Use a cluster with a standard master node and 10 preemptible worker nodes (n1-standard-4).
C.Use a single-node cluster with a high-memory machine type.
D.Use a cluster with 10 standard (n1-standard-4) VMs as master and worker nodes, all non-preemptible.
AnswerB

Preemptible workers are cost-effective and suitable for fault-tolerant jobs like Spark.

Why this answer

Option B is correct because preemptible workers are significantly cheaper (about 80% discount) and ideal for batch processing of large CSV files where fault tolerance is built into Spark via RDD lineage. Using standard nodes for the master ensures cluster stability, while preemptible workers handle the distributed transformations and aggregations cost-effectively. This configuration balances cost and operational efficiency for ephemeral, fault-tolerant workloads.

Exam trap

Google Cloud often tests the misconception that preemptible VMs are unreliable for all workloads, but in Spark batch processing with fault tolerance, they are both cost-effective and operationally efficient, unlike stateful or latency-sensitive applications.

How to eliminate wrong answers

Option A is wrong because using high-memory VMs (n1-highmem-8) for all workers increases cost unnecessarily; shuffle performance is better addressed by tuning Spark parameters (e.g., spark.shuffle.partitions) and using SSDs, not by over-provisioning memory. Option C is wrong because a single-node cluster cannot process hundreds of GB of data efficiently due to lack of parallelism and memory constraints, and it violates the distributed processing paradigm of Spark. Option D is wrong because using all non-preemptible standard VMs (n1-standard-4) for both master and workers eliminates the cost savings of preemptible instances, and having a separate master node is unnecessary for small clusters—the driver can run on a worker—but the main issue is the higher cost without fault-tolerance benefits.

27
Multi-Selecteasy

Which TWO are valid approaches to handle late-arriving data in a Cloud Dataflow streaming pipeline?

Select 2 answers
A.Change to processing time windows instead of event time windows
B.Set allowed lateness on the window
C.Use a side input with a fixed window to join late data
D.Discard any events that arrive after the window closes
E.Use a trigger that fires every second
AnswersB, C

Allowed lateness tells the pipeline how long to wait for late data.

Why this answer

Option B is correct because setting allowed lateness on a window in Cloud Dataflow allows the pipeline to wait for late-arriving data within a specified duration after the watermark passes the window end. This is a standard mechanism to handle out-of-order or delayed events without discarding them, ensuring completeness of windowed aggregations.

Exam trap

Google Cloud often tests the misconception that processing time windows are a valid substitute for handling late data, but they fundamentally change the semantics from event-time to processing-time, which is not a proper solution for late-arriving events.

28
MCQmedium

A retail company uses Cloud Dataflow for a streaming pipeline that aggregates sales events from thousands of stores. The pipeline writes aggregated results to BigQuery every 5 minutes. Recently, the Dataflow job has been restarting multiple times a day with the error: 'Worker ran out of memory' in the logs. The streaming engine is enabled. The pipeline uses keyed state (ParDo with stateful processing) to maintain per-store counters. The average event size is 2KB, and the throughput is 2,000 events/sec. You need to resolve the out-of-memory issues without losing data. What should you do?

A.Disable stateful processing and use side inputs from BigQuery to get per-store aggregates.
B.Modify the pipeline to use sliding windows with a shorter duration to reduce the state size.
C.Increase the number of workers in the pipeline configuration and ensure the maximum worker count is set higher to allow better distribution of state.
D.Reduce the number of workers to limit the overhead of data shuffling.
AnswerC

More workers spread the stateful processing and reduce memory per worker.

Why this answer

Option C is correct because increasing the number of workers distributes the keyed state (per-store counters) across more VMs, reducing the memory pressure on each individual worker. With streaming engine enabled, state is still held in worker memory for low-latency access, so adding workers is the direct way to scale the state footprint. This avoids data loss because the pipeline continues processing with exactly-once semantics and state is preserved via checkpointing.

Exam trap

The trap here is that candidates may confuse window-based state (which can be reduced by shortening windows) with keyed state (which is independent of window duration), leading them to incorrectly choose option B.

How to eliminate wrong answers

Option A is wrong because disabling stateful processing and using side inputs from BigQuery would introduce significant latency and inconsistency (BigQuery is not designed for real-time per-record lookups), and it would break the streaming aggregation logic. Option B is wrong because sliding windows do not reduce state size for keyed state (ParDo with stateful processing uses per-key state, not windowed state); changing window duration has no effect on the memory used by the per-store counters. Option D is wrong because reducing the number of workers would concentrate more state on fewer VMs, worsening the out-of-memory issue and increasing the risk of worker crashes.

29
MCQhard

A Dataflow streaming pipeline that uses global windows and triggers every 5 seconds is experiencing increasing lag and high system latency. The pipeline reads from Pub/Sub, transforms data with a ParDo, and writes to BigQuery. Which action is most likely to reduce lag?

A.Use a session window to group related events.
B.Replace the global window with a sliding window of 1 minute.
C.Change the trigger to processing time instead of event time.
D.Increase the number of workers manually.
AnswerB

A sliding window reduces the number of elements per trigger and improves latency by distributing state across workers.

Why this answer

B is correct because sliding windows of 1 minute allow the pipeline to process data in overlapping fixed-size windows, which can reduce the buildup of data in memory compared to global windows. Global windows with frequent triggers (every 5 seconds) can cause unbounded state growth and high latency as the pipeline must maintain state for all elements until the trigger fires, whereas sliding windows naturally bound the data per window and enable more efficient watermark and trigger management in Dataflow.

Exam trap

Google Cloud often tests the misconception that increasing workers or changing trigger timing alone can fix lag caused by inappropriate windowing strategy, when the real issue is that global windows with frequent triggers create unbounded state that overwhelms the pipeline's memory and shuffle capacity.

How to eliminate wrong answers

Option A is wrong because session windows group events based on inactivity gaps, which does not address the core issue of unbounded state from global windows and can actually increase state size if sessions are long. Option C is wrong because changing the trigger to processing time instead of event time does not reduce lag; it may cause data to be processed based on when it arrives rather than when it occurred, potentially increasing latency due to watermark misalignment and still requiring global window state. Option D is wrong because manually increasing the number of workers can help with throughput but does not fix the fundamental design flaw of using global windows with frequent triggers, which leads to excessive state accumulation and shuffling; autoscaling in Dataflow already handles worker count based on backlog.

30
MCQmedium

A company uses Cloud Composer (Airflow) to orchestrate a daily batch job that runs a custom Python script on a Compute Engine instance. The process is slow because the instance takes 2 minutes to boot. How can you reduce the total runtime?

A.Switch to Dataproc Serverless to avoid VM boot time
B.Use a larger machine type for faster provisioning
C.Create a custom image with the script and dependencies pre-installed
D.Use a GPU-accelerated instance to speed up the script
AnswerC

Custom image reduces boot time by avoiding package installs.

Why this answer

Option C is correct because creating a custom image with the script and dependencies pre-installed eliminates the need to install packages or configure the environment at boot time. In Cloud Composer, when a Compute Engine instance is provisioned via a BashOperator or SSHOperator, the boot process includes OS initialization and package installation. A custom image bypasses these steps, reducing boot time from minutes to seconds, directly addressing the 2-minute boot delay.

Exam trap

The trap here is that candidates may assume 'faster provisioning' means a larger machine type (Option B) or a serverless service (Option A), but the question specifically targets the boot time caused by environment setup, which is solved by pre-installing dependencies in a custom image.

How to eliminate wrong answers

Option A is wrong because Dataproc Serverless is designed for Apache Spark and Hadoop workloads, not for running arbitrary Python scripts on a single Compute Engine instance; it introduces overhead for job submission and cluster management that is not suitable for this use case. Option B is wrong because a larger machine type does not reduce boot time; boot time is dominated by OS initialization and package installation, not by CPU or memory size. Option D is wrong because GPU-accelerated instances are intended for compute-intensive tasks like machine learning or rendering, not for reducing boot time; the script's slowness is due to boot delay, not computational performance.

31
MCQmedium

A Dataflow pipeline is processing a high-volume streaming data stream. The job is lagging behind by 30 minutes, and the Dataflow monitoring UI shows high system latency with low CPU utilization. Which action should be taken to improve throughput?

A.Enable Streaming Engine
B.Increase the number of workers
C.Enable Dataflow Shuffle
D.Disable hot key detection
AnswerC

Dataflow Shuffle offloads shuffle operations to a managed service, reducing worker overhead and improving throughput when shuffle is the bottleneck.

Why this answer

Option C is correct because high system latency with low CPU utilization indicates a bottleneck in data shuffling, not in processing capacity. Enabling Dataflow Shuffle offloads the shuffle operation to Google-managed resources, reducing disk I/O and network overhead, which directly improves throughput in streaming pipelines.

Exam trap

Google Cloud often tests the misconception that low CPU utilization always means more workers are needed, but the trap here is that shuffle bottlenecks cause high latency without saturating CPU, so the correct fix is to offload shuffle operations rather than scale workers.

How to eliminate wrong answers

Option A is wrong because Streaming Engine is designed to reduce streaming latency by moving state management from workers to backend services, but the issue here is low CPU utilization and high latency due to shuffle bottlenecks, not state management. Option B is wrong because increasing workers would add more processing capacity, but with low CPU utilization, the bottleneck is elsewhere (shuffle), so more workers would not resolve the shuffle contention and could increase cost without benefit. Option D is wrong because disabling hot key detection would remove the ability to identify and optimize for skewed keys, which could worsen the shuffle bottleneck; hot key detection helps in redistributing load, not causing the latency issue.

32
MCQhard

A company uses Cloud Composer (Airflow) to orchestrate a data pipeline. One DAG has many tasks that run in parallel and dependencies that span multiple days. Recently, the DAG started failing with 'DagRun already exists' errors. What is the most likely cause?

A.The DAG has a large number of tasks, overwhelming the Airflow scheduler.
B.The DAG has max_active_runs_per_dag set to a low number, causing overlapping runs to be rejected.
C.The DAG's schedule interval is too short, causing task instances to be created with duplicate run IDs.
D.The DAG has a depends_on_past set to True, causing upstream failures to block new runs.
AnswerB

If max_active_runs_per_dag is too low, a new DAG run cannot start while the previous one is active.

Why this answer

The 'DagRun already exists' error occurs when Airflow attempts to create a new DAG run for a logical date that already has an active or completed run, and the DAG's concurrency settings prevent overlapping runs. Setting max_active_runs_per_dag to a low number (e.g., 1) restricts the number of concurrent runs, so if a previous run hasn't finished or been cleared, a new run for the same or overlapping schedule interval is rejected with this error. This is the most likely cause given the DAG has dependencies spanning multiple days, which can cause runs to overlap if not properly configured.

Exam trap

Google Cloud often tests the distinction between DAG-level concurrency settings (max_active_runs_per_dag) and task-level parallelism (e.g., pool, task concurrency), leading candidates to confuse the 'DagRun already exists' error with scheduler overload or task dependency issues.

How to eliminate wrong answers

Option A is wrong because a large number of tasks may cause scheduler performance issues or resource exhaustion, but it does not directly produce a 'DagRun already exists' error; that error is related to DAG run creation, not task-level parallelism. Option C is wrong because a short schedule interval does not create duplicate run IDs; Airflow uses the logical date (execution_date) as the run ID, and each scheduled interval produces a unique logical date, so duplicate run IDs would only occur if the same logical date is triggered twice (e.g., via manual backfill or API). Option D is wrong because depends_on_past=True causes tasks to wait for previous task instances to succeed, but it does not prevent the creation of new DAG runs; the 'DagRun already exists' error occurs at the DAG run level, not at the task dependency level.

33
MCQhard

A financial services company runs a batch Dataflow pipeline daily to process transaction data. The pipeline reads from Cloud Storage, performs complex transformations, and writes to BigQuery. Recently, the pipeline has been failing intermittently with the error: 'Workflow failed. Causes: (9c3f7a2b1d4e): The worker missed 2000 data samples in the last 30 seconds. This can be caused by a variety of factors, including slow work items, network issues, or resource contention.' The team has already increased the number of workers and tried using e2-standard-8 machine types, but the issue persists. The pipeline processes approximately 500 GB of data per run and uses approximately 200 workers. The team suspects that the issue might be related to shuffle operations. What should the team do next to resolve the issue?

A.Enable Streaming Engine for the pipeline.
B.Increase the persistent disk size per worker to 100 GB.
C.Reduce the number of workers to 100 to decrease shuffle overhead.
D.Use Cloud Storage as a shuffle sink.
AnswerB

Provides more space for shuffle data, reducing disk contention.

Why this answer

The error indicates that workers are missing data samples due to slow shuffle operations, often caused by insufficient disk I/O. Increasing the persistent disk size per worker to 100 GB provides more local scratch space for Dataflow's shuffle, reducing disk contention and allowing the shuffle to complete within the 30-second window. This directly addresses the root cause without changing the worker count or machine type.

Exam trap

Google Cloud often tests the misconception that increasing workers or machine type always solves performance issues, when in fact shuffle-bound pipelines require adequate local disk I/O, not just more CPU or memory.

How to eliminate wrong answers

Option A is wrong because Streaming Engine is designed for streaming pipelines, not batch pipelines, and enabling it would not resolve shuffle disk I/O issues in a batch Dataflow job. Option C is wrong because reducing the number of workers from 200 to 100 would increase the data volume each worker must shuffle, worsening the disk contention and likely increasing the number of missed samples. Option D is wrong because Cloud Storage as a shuffle sink is not a supported configuration in Dataflow; Dataflow uses persistent disk for shuffle by default, and switching to an external sink would introduce network latency and not fix the local disk bottleneck.

34
Multi-Selectmedium

Which THREE of the following are best practices when designing a Cloud Dataflow pipeline for batch processing? (Choose three.)

Select 3 answers
A.Use mutable state within ParDo to track running totals.
B.Use side inputs to hold a large lookup table that is read in every element.
C.Always insert a Reshuffle transform after every GroupByKey to redistribute data.
D.Create separate pipelines for independent jobs to allow independent scaling.
E.Tune the batch size in Write transforms to optimize BigQuery streaming inserts.
AnswersB, D, E

Side inputs enable efficient broadcast of static data to all workers.

Why this answer

Option B is correct because side inputs in Cloud Dataflow are designed to efficiently broadcast a read-only dataset (like a lookup table) to all parallel workers. When the side input is a large but static dataset, Dataflow can cache it in memory or on disk across workers, avoiding repeated external lookups and reducing per-element processing overhead. This pattern is especially effective for batch processing where the side input is read once and reused across all elements.

Exam trap

Google Cloud often tests the misconception that mutable state is acceptable in Dataflow's ParDo for batch processing, but the correct understanding is that Dataflow's execution model requires stateless transforms to ensure fault tolerance and exactly-once processing.

35
Multi-Selecthard

Your company is building a data processing system that ingests sensor data from millions of devices, processes it in near real-time to detect anomalies, and stores raw and processed data for long-term analytics. The system must meet a 99.9% uptime SLA and minimize data loss. Which THREE design choices are best? (Choose three.)

Select 3 answers
A.Use Cloud Pub/Sub as the ingestion layer with a dead-letter topic to capture unprocessed messages.
B.Store raw data in Cloud Bigtable and processed data in Cloud Storage.
C.Use Dataflow with at-least-once processing guarantees and perform deduplication downstream.
D.Use Cloud Storage for raw data archival and BigQuery for processed analytics data.
E.Use a global Cloud Load Balancer in front of the Dataflow workers.
AnswersA, C, D

Dead-letter topics prevent data loss by storing messages that cannot be processed after retries.

Why this answer

Cloud Pub/Sub with a dead-letter topic ensures that messages that cannot be processed are captured and not lost, directly supporting the requirement to minimize data loss. The dead-letter topic allows for later reprocessing or analysis of failed messages, which is critical for meeting a 99.9% uptime SLA by preventing message backlogs from blocking the ingestion pipeline.

Exam trap

Google Cloud often tests the misconception that a load balancer is needed to scale Dataflow workers, when in fact Dataflow auto-scales its own workers and uses Pub/Sub's pull subscriptions to distribute messages evenly across workers without a separate load balancer.

36
MCQmedium

A company is migrating its on-premises Apache Spark jobs to Dataproc. The jobs read from and write to Cloud Storage. After migration, the jobs are slower than expected. The Dataproc cluster uses standard worker machines with local SSDs. What is the most likely cause of the performance degradation?

A.The Spark shuffle service is not enabled on the cluster.
B.The local SSDs are not mounted or are misconfigured.
C.The Cloud Storage connector is not using the gRPC protocol.
D.The jobs use the Cloud Storage connector instead of HDFS, causing network latency.
AnswerD

Reading from Cloud Storage over network is slower than local HDFS reads.

Why this answer

D is correct because the performance degradation is most likely due to network latency when using the Cloud Storage connector instead of HDFS. Cloud Storage is an object store accessed over the network, while HDFS leverages local SSDs for data locality and faster I/O. In Dataproc, jobs that read/write to Cloud Storage incur higher latency compared to using HDFS on local SSDs, especially for shuffle-heavy Spark workloads.

Exam trap

Google Cloud often tests the misconception that local SSDs or connector protocols are the bottleneck, when the real issue is the inherent latency of using a remote object store (Cloud Storage) versus a distributed filesystem (HDFS) with data locality.

How to eliminate wrong answers

Option A is wrong because the Spark shuffle service is enabled by default on Dataproc clusters and is not related to Cloud Storage I/O performance. Option B is wrong because local SSDs are automatically mounted and configured by Dataproc; misconfiguration would cause failures, not just slower performance. Option C is wrong because the Cloud Storage connector uses HTTP/HTTPS by default, and while gRPC can improve performance, it is not the primary cause of degradation compared to the fundamental latency difference between object storage and HDFS.

37
MCQmedium

A gaming company uses Cloud Pub/Sub to ingest player activity events. A Dataflow streaming pipeline consumes these events, performs stateful processing to compute session metrics, and writes results to Cloud Bigtable for low-latency queries. Recently, the pipeline's processing latency increased, and the Bigtable write throughput dropped. Monitoring shows that the pipeline is experiencing a high rate of 'out-of-order' messages and 'duplicate' events. The Pub/Sub subscription is configured with exactly-once delivery. The Dataflow job uses a GlobalWindow with a trigger that fires every 10 seconds. What is the most likely cause and solution?

A.The Bigtable instance is under-provisioned; add more nodes to increase write throughput.
B.Change the Pub/Sub subscription from exactly-once to at-least-once delivery to avoid redelivery overhead.
C.The pipeline's trigger is too frequent; increase the trigger interval to 30 seconds and set allowed lateness to 1 minute to handle out-of-order events.
D.The streaming engine is disabled; enable Streaming Engine to reduce worker memory pressure.
AnswerC

A longer trigger allows more events to be processed before firing, reducing duplicates and correcting out-of-order handling.

Why this answer

Option C is correct because the high rate of out-of-order and duplicate events indicates that the pipeline's trigger is firing too frequently, causing the stateful processing to attempt to commit partial windows before all events arrive. Increasing the trigger interval to 30 seconds and setting allowed lateness to 1 minute allows the pipeline to buffer more events, reduce the number of speculative triggers, and handle late-arriving data within the lateness bound, which directly reduces processing latency and Bigtable write contention.

Exam trap

Google Cloud often tests the misconception that increasing Bigtable nodes or changing Pub/Sub delivery mode will fix pipeline latency, when the real issue is the trigger configuration causing excessive speculative windowing and state churn.

How to eliminate wrong answers

Option A is wrong because the root cause is not Bigtable provisioning; the symptom of low write throughput is a downstream effect of the pipeline's trigger behavior, not a capacity issue. Option B is wrong because changing from exactly-once to at-least-once delivery would increase duplicates, not reduce them, and the subscription's exactly-once mode is not causing the redelivery overhead—the problem is the trigger frequency. Option D is wrong because disabling Streaming Engine would increase worker memory pressure, not reduce it; the described symptoms are not related to Streaming Engine being disabled, and enabling it would not fix the trigger-induced out-of-order and duplicate events.

38
MCQhard

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

A.Create a long-running Dataproc cluster that remains idle and reuse it for each workflow.
B.Implement a retry loop with exponential backoff in the DAG.
C.Use preemptible VMs for the cluster to reduce cost and improve creation speed.
D.Increase the cluster creation timeout in the Airflow configuration.
AnswerA

Reusing an existing cluster eliminates the creation step and associated timeout.

Why this answer

Option A is correct because creating a long-running Dataproc cluster and reusing it eliminates the variable cluster creation time that causes timeouts. Cloud Composer (Airflow) can manage cluster lifecycle separately from the workflow, ensuring the cluster is always available when the Dataproc job runs. This approach decouples cluster provisioning from job execution, making the workflow resilient to creation delays.

Exam trap

The trap here is that candidates often assume retries or timeout adjustments are sufficient for infrastructure variability, but the most effective solution is to eliminate the variable step entirely by reusing a persistent cluster.

How to eliminate wrong answers

Option B is wrong because retry loops with exponential backoff only handle transient failures after a timeout occurs, but they do not address the root cause—the variable cluster creation time—and can lead to long delays or eventual failure if creation consistently exceeds the timeout. Option C is wrong because preemptible VMs are designed to reduce cost, not improve creation speed; they are actually more likely to be reclaimed and can cause cluster creation to fail or take longer due to availability constraints. Option D is wrong because increasing the cluster creation timeout in Airflow configuration merely extends the wait time without solving the underlying variability; it can mask the problem and lead to longer workflow execution times without guaranteeing success.

39
MCQhard

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

A.The pipeline dropped the last_name field entirely.
B.The pipeline code was changed to send an integer for the last_name field.
C.The BigQuery table quota was exceeded.
D.The BigQuery table schema was changed from STRING to INTEGER for last_name.
AnswerB

The error clearly states that an integer was provided for a string field.

Why this answer

Option B is correct because if the pipeline code was changed to send an integer for the last_name field, BigQuery will reject the write due to a schema mismatch. BigQuery enforces strict type checking at ingestion time; an integer value cannot be written into a STRING column unless the schema explicitly allows coercion. Since the pipeline was working yesterday, the most likely change is in the data type being sent, not the schema itself.

Exam trap

Google Cloud often tests the misconception that schema changes in BigQuery are the primary cause of pipeline failures, when in fact the most common cause is a code change that alters the data type of a field being written, especially in streaming or batch pipelines where schema enforcement is strict.

How to eliminate wrong answers

Option A is wrong because dropping the last_name field entirely would cause a 'Required field missing' error, but the question states the pipeline fails today, and dropping a field is less likely than a type mismatch if the code was unchanged. Option C is wrong because BigQuery table quota exceeded would affect all writes, not just this pipeline, and would typically produce a 'quota exceeded' error message, not a schema mismatch failure. Option D is wrong because if the BigQuery table schema was changed from STRING to INTEGER for last_name, the pipeline sending a string would also fail, but the question states the pipeline code was changed to send an integer, making the schema change less likely as the cause; moreover, schema changes are typically controlled and would be noticed, whereas a code change is a common oversight.

40
Multi-Selectmedium

Your team is running a Dataflow streaming pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. You notice that the pipeline's backlog is growing and the processing latency has increased from seconds to minutes. You need to diagnose and resolve the issue. Which TWO actions should you take? (Choose two.)

Select 2 answers
A.Stop the pipeline, increase the number of workers in the streaming engine configuration, and restart it.
B.Increase the batch size in the WriteToBigQuery transform to reduce I/O operations.
C.Configure a dead-letter queue in Cloud Storage for failed messages to reduce reprocessing load.
D.Increase the maximum number of workers in the pipeline's autoscaling configuration to allow more compute resources.
E.Examine the Dataflow monitoring dashboard for metrics like system lag, data freshness, and worker throughput.
AnswersD, E

Allowing more workers can reduce backlog if the pipeline is CPU-bound.

Why this answer

Option D is correct because increasing the maximum number of workers in the autoscaling configuration allows Dataflow to scale out horizontally, adding more compute resources to handle the increased backlog and reduce processing latency. Dataflow's autoscaling algorithm uses metrics like backlog bytes and CPU utilization to decide when to add workers, but it is capped by the max workers setting. Raising this cap enables the pipeline to allocate more VMs, thus processing more messages per second and reducing the backlog.

Exam trap

Google Cloud often tests the misconception that you must stop a streaming pipeline to change worker count or that increasing batch size always improves throughput, when in fact Dataflow supports live autoscaling and larger batches can worsen latency.

41
MCQhard

A company runs a Dataproc cluster for nightly batch jobs. The cluster uses preemptible workers for cost savings. Recently, the jobs have been failing intermittently with 'Disk quota exceeded' errors on the persistent disks attached to the preemptible workers. The cluster is configured with a master node and 10 worker nodes, each with a 100 GB persistent disk. The preemptible workers are dynamically added and removed. What is the most likely cause and the best long-term solution?

A.The persistent disks of the preemptible workers are too small. Resize the persistent disks to 200 GB each.
B.The preemptible workers are using local SSDs that are not recreated on reclaim. Use non-preemptible workers with local SSDs instead.
C.The preemptible workers are exceeding the project's persistent disk quota in the region because every time a preempted worker restarts, it tries to attach a new disk. Increase the disk quota.
D.The preemptible workers do not have enough persistent disk space to store intermediate shuffle data. Switch to standard workers to avoid this issue.
AnswerC

A is correct because preemptible workers can cause disk quota exhaustion due to rapid creation/deletion of persistent disks.

Why this answer

Option C is correct because the intermittent 'Disk quota exceeded' errors on preemptible workers are caused by the project's regional persistent disk quota being exhausted. When a preemptible worker is reclaimed, the cluster attempts to attach a new persistent disk to the replacement worker, but the old disk is not immediately deleted, leading to a buildup of unattached disks that consume quota. The best long-term solution is to increase the persistent disk quota in the region to accommodate the temporary disks from preempted workers.

Exam trap

The trap here is that candidates mistakenly attribute the error to insufficient disk size or shuffle data capacity, rather than recognizing it as a regional quota exhaustion issue caused by orphaned disks from preempted workers.

How to eliminate wrong answers

Option A is wrong because resizing disks to 200 GB does not address the quota exhaustion issue; the error is about quota, not disk size, and increasing disk size would actually consume more quota per disk. Option B is wrong because local SSDs are ephemeral and not recommended for preemptible workers, as they are lost on preemption, and the error is about persistent disk quota, not local SSD recreation. Option D is wrong because the error is not about insufficient disk space for shuffle data; it is a quota limit error, and switching to standard workers would increase costs without fixing the underlying quota issue.

42
MCQmedium

A company needs to grant analysts access to a BigQuery table that contains sensitive PII columns. The analysts should be able to run aggregate queries on the entire dataset but must not see individual PII values. Which approach should the team use?

A.Create a user-defined function (UDF) that aggregates the data and grant analysts permission to call the UDF.
B.Use BigQuery row-level security to restrict access to non-PII rows only.
C.Create an authorized view that does not include the PII columns and grant analysts access to the view.
D.Use BigQuery column-level security with data masking to mask the PII columns for the analysts' role.
AnswerD

C is correct because column-level masking dynamically masks data based on user permissions without changing the table structure.

Why this answer

Option D is correct because BigQuery column-level security with data masking allows you to define masking policies on specific PII columns (e.g., using `DEFAULT_MASKING_RULE` or custom policies) that automatically transform the data for analysts' roles while still permitting aggregate queries over the entire dataset. This approach ensures analysts never see individual PII values, yet they can run `COUNT`, `SUM`, `AVG`, etc., on the masked columns, meeting both requirements precisely.

Exam trap

Google Cloud often tests the distinction between row-level security (filtering rows) and column-level security (masking or hiding columns), and candidates mistakenly choose row-level security when the requirement is to hide specific column values across all rows.

How to eliminate wrong answers

Option A is wrong because a UDF that aggregates data would still require analysts to have access to the underlying table to call the UDF, and the UDF cannot prevent analysts from querying the raw table directly if they have table-level permissions. Option B is wrong because row-level security filters entire rows based on a condition (e.g., `user_email = SESSION_USER()`), but here the requirement is to hide specific columns (PII) across all rows, not to exclude entire rows. Option C is wrong because an authorized view that omits PII columns would prevent analysts from seeing those columns, but it also prevents them from running aggregate queries that include PII columns (e.g., `AVG(salary)`), which the requirement explicitly allows as long as individual values are hidden.

43
MCQeasy

A company stores raw data files in Cloud Storage in a bucket named 'raw-data'. After processing, the files are moved to a 'processed' bucket. To reduce costs, they want to automatically delete raw data older than 30 days. What should they do?

A.Enable object versioning on the 'raw-data' bucket and configure a lifecycle rule to delete noncurrent versions.
B.Configure a lifecycle rule on the 'raw-data' bucket to delete objects older than 30 days.
C.Set a retention policy on the 'raw-data' bucket to expire objects after 30 days.
D.Use a bucket policy that denies read access to objects older than 30 days.
AnswerB

A is correct because lifecycle management can automatically delete objects based on age.

Why this answer

Option B is correct because Cloud Storage lifecycle management allows you to set a rule that automatically deletes objects after a specified number of days from their creation time. By configuring a lifecycle rule on the 'raw-data' bucket to delete objects older than 30 days, the company can achieve cost reduction without manual intervention. This directly addresses the requirement to remove raw data files that have been processed and are no longer needed.

Exam trap

The trap here is confusing lifecycle deletion rules with retention policies or versioning: candidates often think retention policies delete data after a period, but they actually prevent deletion, while versioning with noncurrent deletion only removes old versions, not the current object.

How to eliminate wrong answers

Option A is wrong because enabling object versioning and deleting noncurrent versions does not delete the current (original) objects; it only removes older versions, so raw data files would remain in the bucket indefinitely. Option C is wrong because a retention policy (e.g., using Object Hold or Retention Policy) prevents deletion or modification of objects for a specified duration, which would keep the data for at least 30 days, not delete it after 30 days. Option D is wrong because a bucket policy that denies read access does not delete the objects; the files would still exist and incur storage costs, failing to meet the cost-reduction goal.

44
MCQmedium

Your team runs a weekly batch ETL pipeline using Cloud Dataproc. The pipeline reads raw data from Cloud Storage, transforms it with Apache Spark, and writes results to BigQuery. Recently, the pipeline has been failing with the error 'Out of Memory' during the shuffle phase. The cluster uses standard worker nodes (n1-standard-4). What is the most effective way to resolve this without increasing total cost?

A.Increase the number of Spark partitions by setting spark.sql.shuffle.partitions to a higher value.
B.Increase the number of worker nodes by adding more n1-standard-4 instances.
C.Enable dynamic allocation and use preemptible VMs for some workers.
D.Switch worker nodes to n1-highmem-4 instances to provide more memory.
AnswerA

More partitions mean less data per partition, reducing memory usage per task. This can resolve OOM without added cost.

Why this answer

The 'Out of Memory' error during the shuffle phase indicates that individual executor tasks are processing too much data per partition. Increasing `spark.sql.shuffle.partitions` reduces the amount of data each task handles, lowering memory pressure per executor without adding more nodes or upgrading hardware. This directly addresses the shuffle memory bottleneck while keeping the total cluster cost unchanged.

Exam trap

The trap here is that candidates often assume memory errors must be solved by adding more memory (Option D) or more nodes (Option B), ignoring the cost constraint and the fact that repartitioning can resolve the issue without additional resources.

How to eliminate wrong answers

Option B is wrong because adding more worker nodes increases total cost, which violates the constraint of not increasing cost. Option C is wrong because enabling dynamic allocation and using preemptible VMs does not resolve the per-executor memory shortage; it only changes cluster scaling and cost structure, but the shuffle memory issue persists on the existing nodes. Option D is wrong because switching to n1-highmem-4 instances increases per-node memory but also increases cost per node, raising total cost unless the number of nodes is reduced, which is not specified and may not be feasible without losing parallelism.

45
MCQhard

You are implementing a data pipeline that reads from Cloud Storage (parquet files), transforms data with Cloud Dataflow, and writes to BigQuery. The pipeline runs on a batch schedule every hour. You notice that the Dataflow job takes 10 minutes, but the overall pipeline latency is 15 minutes due to file availability and scheduling. The business requires latency under 5 minutes. Which change should you make?

A.Switch to streaming pipeline with .watchForNewFiles() and process files as they arrive
B.Batch the hourly data into a single larger hourly run
C.Use a larger machine type for the Dataflow workers
D.Increase the number of workers and use smaller input files
AnswerA

This reduces latency by triggering processing immediately.

Why this answer

The root cause of the latency is file availability and scheduling delay, not the processing time. Switching to a streaming pipeline with `.watchForNewFiles()` (or the equivalent `FileIO.match().continuously()`) allows Dataflow to process files as soon as they arrive in Cloud Storage, eliminating the batch scheduling wait and reducing overall latency to near the processing time.

Exam trap

Google Cloud often tests the distinction between reducing processing time (compute optimization) and reducing scheduling/availability latency (pipeline architecture change), leading candidates to mistakenly choose worker scaling or batching options.

How to eliminate wrong answers

Option B is wrong because batching the hourly data into a single larger run would increase the processing time and does not address the file availability and scheduling delay that cause the 5-minute overhead. Option C is wrong because using a larger machine type for Dataflow workers would only reduce the 10-minute processing time, not the 5-minute scheduling and file availability delay. Option D is wrong because increasing the number of workers and using smaller input files could reduce processing time but does not eliminate the scheduling wait or the delay waiting for files to become available.

46
MCQmedium

Your Dataflow streaming pipeline is reading from Cloud Pub/Sub and writing to BigQuery. Users report occasional data duplication in the BigQuery table. You verify the pipeline uses exactly-once processing and idempotent writes. The Dataflow monitoring shows no errors, but the pipeline has occasional worker restarts. What is the most likely cause of the duplicates?

A.The pipeline is using a global window with an early trigger, causing late data to be reprocessed.
B.The Pub/Sub subscription is configured with at-least-once delivery, causing duplicate messages.
C.The BigQuery table has a time-based partitioning column that is not aligned with the event timestamp.
D.The pipeline does not set the insertId parameter in the BigQuery streaming output.
AnswerD

BigQuery streaming inserts use insertId for deduplication. Without it, retried inserts may create duplicate rows.

Why this answer

Option D is correct because BigQuery's streaming API uses the `insertId` parameter to deduplicate records within the streaming buffer. Without a unique `insertId`, BigQuery cannot detect and discard duplicate inserts that may occur when Dataflow retries a write after a worker restart. Even with exactly-once processing in the pipeline, the BigQuery streaming endpoint itself is at-least-once, so the `insertId` is essential for deduplication.

Exam trap

Google Cloud often tests the misconception that exactly-once processing in the pipeline (Dataflow) automatically guarantees exactly-once delivery to the sink (BigQuery), ignoring that the sink itself may require explicit deduplication parameters like `insertId`.

How to eliminate wrong answers

Option A is wrong because a global window with an early trigger would cause multiple emissions per window, but the pipeline uses exactly-once processing and idempotent writes, so any late data would be handled without duplication. Option B is wrong because Pub/Sub subscriptions are inherently at-least-once, but Dataflow's exactly-once processing (via checkpointing and deduplication) handles this; the issue is downstream at BigQuery. Option C is wrong because time-based partitioning misalignment would cause data to land in the wrong partition, not duplicate rows; duplication is a separate concern related to insert identification.

47
MCQmedium

Refer to the exhibit. A BigQuery dataset has the IAM policy shown above. An analyst is trying to run a SELECT query on a table in this dataset but receives an 'Access Denied' error. What is the most likely reason?

A.The analyst does not have permission to list datasets in the project.
B.The analyst only has the roles/bigquery.metadataviewer role, which does not allow reading table data.
C.The table is in a different region than the dataset, and the analyst's query is not cross-region compatible.
D.The analyst has not been granted the 'bigquery.jobs.create' permission to run queries.
AnswerB

D is correct because metadataviewer only allows viewing metadata, not querying data.

Why this answer

The roles/bigquery.metadataviewer role grants permissions to view table and dataset metadata (e.g., table names, schemas) but does not include the bigquery.tables.getData permission required to read table rows. Therefore, when the analyst runs a SELECT query, BigQuery denies access because the role lacks the data-reading privilege. This is the most likely reason for the 'Access Denied' error.

Exam trap

Google Cloud often tests the distinction between metadata-viewing roles and data-reading roles, trapping candidates who assume that being able to see table names and schemas implies permission to query the data.

How to eliminate wrong answers

Option A is wrong because listing datasets is not required to run a SELECT query; the error is about reading table data, not dataset enumeration. Option C is wrong because BigQuery does not enforce cross-region compatibility at the dataset-table level; tables reside within the same dataset and region, and cross-region queries are allowed with appropriate permissions. Option D is wrong because the 'bigquery.jobs.create' permission is needed to submit a query job, but the error specifically indicates a data access issue, not a job creation failure; the analyst likely has this permission if they can attempt a query.

48
MCQeasy

Your Cloud Dataflow pipeline is failing due to a 'Permission denied' error when writing to a BigQuery table. The error persists even though the service account has bigquery.dataEditor role. What is the most likely missing permission?

A.pubsub.topics.publish on a notification topic
B.storage.objects.create on the staging bucket
C.bigquery.tables.get on the table
D.bigquery.tables.create on the dataset
AnswerD

Dataflow requires create permission if table is created automatically.

Why this answer

Option A is correct because Dataflow needs bigquery.tables.create if the table doesn't exist. Option B is wrong because read permissions are not needed for writing. Option C is wrong because bucket permissions are for staging, not writing to BigQuery.

Option D is wrong because pub/sub roles are not needed.

49
Multi-Selectmedium

Which TWO are best practices for managing a Cloud Dataflow pipeline in production?

Select 2 answers
A.Always use batch mode for streaming data to reduce cost
B.Disable autoscaling to keep compute costs predictable
C.Set up Cloud Monitoring alerts based on Dataflow job metrics
D.Use pipeline updates (update) to modify running streaming pipelines
E.Restart the pipeline when code changes are needed
AnswersC, D

Alerts help detect issues proactively.

Why this answer

Option C is correct because Cloud Monitoring alerts on Dataflow job metrics (e.g., system lag, watermark delay, or element count) enable proactive detection of pipeline health issues such as backpressure or stuck workers. This is a best practice for production pipelines to ensure reliability and timely intervention.

Exam trap

Google Cloud often tests the misconception that disabling autoscaling or restarting pipelines is acceptable for cost control or simplicity, when in fact these actions violate production best practices for reliability and data integrity.

50
MCQhard

An organization uses Cloud Dataproc to run Spark jobs that process sensitive data. They need to ensure data is encrypted at rest and that only specific service accounts can access the data on cluster disks. What should they do?

A.Rely on the default encryption at rest and use VPC Service Controls to limit data exfiltration.
B.Use customer-supplied encryption keys (CSEK) and write a startup script to mount encrypted disks.
C.Enable encryption at rest using Google-managed encryption keys and grant all users the Dataproc Editor role.
D.Use customer-managed encryption keys (CMEK) for the cluster's persistent disks and assign a dedicated service account to the cluster with minimal IAM roles.
AnswerD

CMEK provides control over keys, and a dedicated service account restricts data access.

Why this answer

Option D is correct because using customer-managed encryption keys (CMEK) allows the organization to control and manage the encryption keys for persistent disks attached to the Dataproc cluster, ensuring data at rest is encrypted. Assigning a dedicated service account with minimal IAM roles ensures that only that service account can access the data on the cluster disks, following the principle of least privilege.

Exam trap

The trap here is that candidates often confuse CSEK (used for Cloud Storage) with CMEK (used for persistent disks), or assume that default encryption combined with VPC Service Controls is sufficient for granular access control to disk data.

How to eliminate wrong answers

Option A is wrong because default encryption at rest uses Google-managed keys, which does not allow the organization to control key access or restrict which service accounts can access data on cluster disks; VPC Service Controls prevent data exfiltration but do not enforce service-account-level access to disk data. Option B is wrong because customer-supplied encryption keys (CSEK) are used for encrypting data in Cloud Storage, not for persistent disks on Dataproc; mounting encrypted disks via a startup script is not a supported or recommended method for Dataproc clusters. Option C is wrong because granting all users the Dataproc Editor role would allow any user to access and modify cluster resources, violating the requirement that only specific service accounts can access data on cluster disks.

51
MCQmedium

A Dataflow pipeline reads log files from Cloud Storage, parses them into LogEvent objects, and writes to BigQuery. The pipeline fails with the above errors. What is the most likely cause?

A.The LogEvent class does not have a no-argument constructor.
B.The pipeline is missing required import statements for LogEvent.
C.The BigQuery table schema does not match the LogEvent fields.
D.The log files are not in the expected format, causing parsing failures.
AnswerA

Beam requires a no-arg constructor for Avro or Serializable coders.

Why this answer

Apache Beam's SDK requires that custom types used as PCollection elements (like LogEvent) have a no-argument constructor so that the framework can deserialize objects during distributed processing, especially when using the Dataflow runner. Without it, the pipeline fails at runtime with a serialization error because Beam's default coder (e.g., SerializableCoder) cannot reconstruct the object.

Exam trap

The trap here is that candidates confuse runtime serialization errors with compile-time import issues or schema mismatches, overlooking the fundamental requirement for a no-argument constructor in Beam's default coders.

How to eliminate wrong answers

Option B is wrong because missing import statements would cause a compile-time error, not a runtime pipeline failure with the described errors. Option C is wrong because a BigQuery table schema mismatch would produce a write-time error (e.g., schema mismatch), not a serialization failure during parsing. Option D is wrong because parsing failures from malformed log files would result in exceptions during the parse step, not a serialization error related to the LogEvent class itself.

52
MCQhard

A Dataflow streaming pipeline processes events from Pub/Sub and writes to BigQuery using a dynamically generated table destination based on the event type. The pipeline is experiencing high latency, and the worker CPU utilization is low. Which action is most likely to reduce latency?

A.Increase the batch size parameter in the BigQuery sink to write larger batches.
B.Reduce the number of workers to increase CPU utilization per worker.
C.Enable Dataflow Streaming Engine to improve throughput and reduce latency.
D.Increase the worker disk size to reduce I/O wait time.
AnswerC

B is correct because Streaming Engine moves state to backend, reducing worker overhead and improving latency.

Why this answer

Option C is correct because Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing per-worker overhead and enabling better resource utilization. This directly addresses the symptom of high latency with low CPU utilization, which indicates workers are bottlenecked on shuffle or state management rather than compute.

Exam trap

The trap here is that candidates often assume low CPU utilization means workers are underutilized and should be scaled down (Option B), when in fact low CPU with high latency indicates a bottleneck in shuffle or state management that is not compute-bound.

How to eliminate wrong answers

Option A is wrong because increasing batch size in the BigQuery sink can actually increase latency for streaming pipelines, as larger batches require more time to fill before writing, and the issue here is not sink throughput but worker inefficiency. Option B is wrong because reducing the number of workers would decrease parallelism and likely worsen latency, and low CPU utilization suggests workers are not compute-bound but rather waiting on I/O or shuffle. Option D is wrong because increasing worker disk size does not reduce I/O wait time for streaming pipelines; disk I/O is not the bottleneck when CPU is low and the pipeline uses Pub/Sub and BigQuery, which are network-bound.

53
Multi-Selecthard

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Select 3 answers
A.Set up monitoring alerts for system lag and data freshness.
B.Use static side inputs that are loaded once at pipeline start.
C.Implement watermark estimation to handle late data.
D.Use global windows with early triggers for low latency.
E.Use idempotent sinks to ensure exactly-once processing.
AnswersA, C, E

Monitoring is critical for streaming pipelines.

Why this answer

Option A is correct because monitoring alerts for system lag and data freshness are essential for maintaining operational visibility in real-time Dataflow pipelines. System lag (the time between data ingestion and processing) and data freshness (how current the processed output is) directly impact the pipeline's ability to meet latency SLAs. Without these alerts, issues like worker backpressure or Pub/Sub subscription backlog can go unnoticed, leading to stale or lost data.

Exam trap

Google Cloud often tests the misconception that static side inputs are acceptable for streaming pipelines, but they are only appropriate for batch or bounded data; real-time pipelines require side inputs that can be periodically refreshed (e.g., via a streaming source or a periodic lookup).

54
MCQmedium

A company uses Cloud Composer to orchestrate a daily ETL pipeline that includes multiple Dataproc jobs. The pipeline processes sensitive financial data. The security team requires that all data in transit be encrypted, and all Cloud Storage buckets used by the pipeline should have uniform bucket-level access enabled and VPC Service Controls. The pipeline currently uses a single Cloud Composer environment in us-east1. The Dataproc clusters are created using the standard image and use custom service accounts with minimal permissions. The pipeline runs successfully during testing, but in production, the Dataproc jobs fail with 'Access Denied' errors when trying to write to a Cloud Storage bucket. The bucket has uniform bucket-level access enabled and is inside a VPC Service Controls perimeter. The Dataproc service account has the Storage Object Admin role at the project level. What is the most likely cause of the access denied error?

A.The service account does not have the Storage Object Admin role on the bucket.
B.Data in transit encryption is not enabled for the Cloud Storage bucket.
C.Uniform bucket-level access prevents writes from service accounts.
D.The Dataproc cluster is not in the VPC Service Controls perimeter.
AnswerD

VPC Service Controls deny access from resources outside the perimeter.

Why this answer

The Dataproc cluster is created outside the VPC Service Controls perimeter, so even though the service account has the Storage Object Admin role at the project level, requests from the cluster are blocked by the perimeter's ingress/egress rules. VPC Service Controls enforce a security boundary that prevents resources outside the perimeter from accessing protected services like Cloud Storage, regardless of IAM permissions. The 'Access Denied' error in production, despite successful testing, strongly indicates a perimeter configuration mismatch.

Exam trap

Google Cloud often tests the distinction between IAM permissions and VPC Service Controls boundaries, tricking candidates into thinking a project-level IAM role is sufficient when the real blocker is network-level perimeter enforcement.

How to eliminate wrong answers

Option A is wrong because the service account has the Storage Object Admin role at the project level, which grants write access to all buckets in the project, including this one; uniform bucket-level access does not override project-level IAM roles. Option B is wrong because data in transit encryption is automatically enforced by Google Cloud for all API calls to Cloud Storage (using HTTPS/TLS), and the question states the pipeline already encrypts data in transit, so this is not the cause of the error. Option C is wrong because uniform bucket-level access does not prevent writes from service accounts; it simply disables ACLs and requires all access decisions to be made via IAM policies, which the service account already has via its project-level role.

55
Multi-Selectmedium

Which TWO actions should be taken to optimize a Dataflow streaming pipeline that is experiencing high system lag and backpressure? (Choose two.)

Select 2 answers
A.Use a higher memory machine type for all workers.
B.Increase the number of worker threads by adjusting the streaming worker's parallelism hint.
C.Enable autoscaling and increase the maximum number of workers.
D.Reduce the number of workers to decrease cost.
E.Set maxNumWorkers to 1 to force single-worker processing.
AnswersB, C

More threads can increase throughput per worker.

Why this answer

Option B is correct because increasing the parallelism hint allows each worker to process more bundles concurrently, which can reduce backpressure by improving throughput without adding more workers. Option C is correct because enabling autoscaling and increasing the maximum number of workers allows the pipeline to dynamically scale out to handle increased load, directly mitigating high system lag and backpressure.

Exam trap

Google Cloud often tests the misconception that simply adding more memory or reducing workers will solve backpressure, when in fact the correct approaches involve increasing parallelism or scaling out the worker pool.

56
Drag & Dropmedium

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

IAP verifies identity and authorization before allowing access to the application.

57
MCQhard

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

A.Increase the number of workers and enable autoscaling to distribute the load.
B.Reduce the number of workers to minimize coordination overhead.
C.Use a global window with a trigger to reduce state size.
D.Change the windowing to a fixed 5-minute window to reduce computations.
AnswerA

More workers can handle the CPU load from streaming inserts.

Why this answer

The 'worker failed to provide a heartbeat' error combined with high CPU usage indicates that workers are overloaded and cannot process data fast enough to maintain their heartbeat to the Dataflow service. Increasing the number of workers and enabling autoscaling distributes the computational load across more machines, reducing per-worker CPU pressure and allowing heartbeats to be sent on time. This directly addresses the root cause of resource exhaustion.

Exam trap

Google Cloud often tests the misconception that reducing workers or changing window types is a universal fix for resource exhaustion, when in fact the immediate solution for heartbeat failures due to high CPU is to scale out the worker pool.

How to eliminate wrong answers

Option B is wrong because reducing the number of workers would concentrate the same workload on fewer machines, increasing per-worker CPU usage and worsening the heartbeat failure. Option C is wrong because using a global window with a trigger does not reduce state size for sliding windows; it would accumulate all events into a single unbounded window, potentially increasing memory pressure and CPU overhead. Option D is wrong because changing to a fixed 5-minute window does not reduce computations compared to a sliding window with a 1-minute period; it actually changes the semantics (non-overlapping windows) and may still cause high CPU if the underlying load is unchanged.

58
MCQmedium

A company uses Cloud Composer to orchestrate data pipelines. One DAG fails intermittently with the error: 'Task received SIGTERM signal.' The task runs a long-running Dataproc job. What is the most likely cause?

A.The Dataproc cluster was preempted by Google Cloud.
B.The Dataproc job failed due to an error in the code.
C.The Cloud Composer environment ran out of disk space.
D.The Airflow task timed out due to the default execution timeout.
AnswerD

SIGTERM indicates the task was killed, possibly due to timeout.

Why this answer

The default Airflow task execution timeout is 28 days in Cloud Composer, but individual tasks can have a shorter `execution_timeout` set in the DAG definition. When a long-running Dataproc job exceeds this timeout, Airflow sends a SIGTERM signal to the task to kill it, resulting in the observed error. This is the most likely cause because the error message directly indicates a forced termination by the Airflow scheduler, not an infrastructure or code failure.

Exam trap

The trap here is that candidates often attribute SIGTERM errors to infrastructure issues like cluster preemption or disk space, when in fact the error is a direct result of Airflow's task timeout mechanism, which is a common misconfiguration in long-running pipeline tasks.

How to eliminate wrong answers

Option A is wrong because Dataproc cluster preemption would cause a different error (e.g., 'Cluster not found' or 'Job failed due to node loss'), not a SIGTERM signal from Airflow. Option B is wrong because a code error in the Dataproc job would produce a job failure status and a different error message (e.g., 'Job failed with exit code 1'), not a SIGTERM from the orchestrator. Option C is wrong because running out of disk space in the Cloud Composer environment would cause worker crashes or DAG parsing errors, not a targeted SIGTERM to a specific task.

59
MCQhard

You are designing a disaster recovery strategy for a critical streaming data processing pipeline. The pipeline reads from Cloud Pub/Sub, processes with Dataflow streaming, and writes to BigQuery. The required RPO is less than 1 minute, and RTO is less than 5 minutes. Which architecture should you implement?

A.Use cross-region replication with two separate Dataflow pipelines reading from a Pub/Sub cross-region subscription and writing to a BigQuery cross-region dataset
B.Run the pipeline using Dataflow batch mode with a 1-minute trigger and store intermediate results in Cloud Storage
C.Deploy resources in a single region with regular backups to Cloud Storage
D.Use a single Dataflow pipeline with a standby cluster in another region, but failover is manual
AnswerA

Cross-region replication ensures data is available in another region with minimal latency, meeting RPO and RTO.

Why this answer

Option A is correct because cross-region replication for Pub/Sub ensures messages are available in a secondary region with sub-second latency, and a separate Dataflow pipeline reading from a cross-region subscription provides active-active processing. BigQuery cross-region dataset replication (using the 'cross-region' dataset location, e.g., EU or US multi-region, or a specific dual-region configuration) ensures data durability and availability within the RPO of <1 minute. This architecture meets both RPO and RTO by eliminating single points of failure and enabling automatic failover without manual intervention.

Exam trap

The trap here is that candidates often assume a single pipeline with a standby cluster is sufficient, but they overlook that manual failover cannot meet the strict RTO of <5 minutes, and that cross-region replication must be active-active (not active-passive) to achieve sub-minute RPO.

How to eliminate wrong answers

Option B is wrong because Dataflow batch mode with a 1-minute trigger cannot achieve sub-minute RPO; batch processing introduces inherent latency and does not provide continuous streaming, so the RPO of <1 minute is not guaranteed. Option C is wrong because deploying in a single region with regular backups to Cloud Storage fails to meet the RTO of <5 minutes; restoring from backups takes significantly longer than 5 minutes, and there is no active standby to fail over to. Option D is wrong because a manual failover process cannot achieve the RTO of <5 minutes; manual intervention introduces unpredictable delays, and a standby cluster without automatic failover violates the RTO requirement.

60
MCQhard

You are designing a data pipeline that must process sensitive customer data with strict access controls. The data is ingested via Cloud Pub/Sub, processed by Cloud Dataflow, and stored in BigQuery. The security team requires that data is encrypted at rest and in transit, and that access is limited to specific service accounts. Which implementation strategy meets all requirements?

A.Use Cloud KMS for BigQuery only; leave Dataflow with default encryption
B.Use VPC Service Controls and Cloud Armor for network security
C.Use default Google-managed encryption keys and IAM roles only
D.Use CMEK for Pub/Sub, Dataflow, and BigQuery, and VPC-SC with per-service service accounts
AnswerD

CMEK ensures encryption control; VPC-SC and service accounts enforce access.

Why this answer

Option D is correct because it combines Customer-Managed Encryption Keys (CMEK) for all three services (Pub/Sub, Dataflow, BigQuery) to ensure data is encrypted at rest with keys controlled by the customer, and uses VPC Service Controls (VPC-SC) with per-service service accounts to enforce network perimeter security and least-privilege access. This meets the requirements for encryption at rest and in transit (CMEK also covers in-transit encryption via TLS) and strict access controls via service accounts and VPC-SC.

Exam trap

Google Cloud often tests the misconception that network security tools like VPC Service Controls or Cloud Armor alone satisfy encryption requirements, or that default encryption is sufficient when customer-managed keys are explicitly required.

How to eliminate wrong answers

Option A is wrong because it only applies Cloud KMS to BigQuery, leaving Dataflow with default Google-managed encryption, which does not meet the requirement for customer-controlled encryption at rest across all services. Option B is wrong because VPC Service Controls and Cloud Armor provide network security and perimeter controls but do not address data encryption at rest or in transit, which is a separate requirement. Option C is wrong because default Google-managed encryption keys and IAM roles alone do not provide customer-controlled encryption keys (CMEK) or the granular access controls enforced by VPC-SC with per-service service accounts.

61
Multi-Selectmedium

Which TWO security best practices should be applied to secure data in transit for a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery? (Choose 2)

Select 2 answers
A.Use Cloud Key Management Service (Cloud KMS) to encrypt data in transit
B.Enable TLS encryption on all endpoints
C.Use VPC Service Controls to create a service perimeter
D.Use Cloud Armor to protect against DDoS
E.Use private IP addresses for Dataflow workers
AnswersB, C

TLS ensures data encryption between Google Cloud services, which is already enabled by default but should be verified.

Why this answer

Option B is correct because TLS (Transport Layer Security) encryption ensures that data is encrypted during transmission between endpoints, such as between Cloud Pub/Sub and Dataflow workers, and between Dataflow workers and BigQuery. This is a fundamental security best practice for protecting data in transit against eavesdropping and man-in-the-middle attacks.

Exam trap

The trap here is that candidates often confuse encryption at rest (Cloud KMS) with encryption in transit, or assume that using private IPs alone secures data in transit without needing TLS.

62
MCQeasy

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A.Dataproc Serverless with PySpark
B.Dataflow with batch mode
C.Cloud Data Fusion
D.BigQuery Data Transfer Service
AnswerA

Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.

Why this answer

Dataproc Serverless with PySpark is the most cost-effective choice because it eliminates cluster management overhead and automatically scales resources based on workload, charging only for the processing time used. For 10 GB CSV files processed daily within a 24-hour window, the serverless model avoids the fixed costs of a persistent cluster, making it ideal for batch, non-time-sensitive jobs. PySpark's native support for CSV parsing and BigQuery integration via the Spark BigQuery connector ensures efficient data loading without additional services.

Exam trap

The trap here is that candidates often choose Dataflow (Option B) because it is a popular batch processing service, but they overlook that Dataproc Serverless is more cost-effective for non-time-sensitive, large CSV batch jobs due to its serverless pricing model and native Spark support for CSV processing.

How to eliminate wrong answers

Option B is wrong because Dataflow with batch mode, while capable, uses a streaming-optimized runner that incurs higher per-job overhead and cost for simple batch CSV processing, especially when the data is not time-sensitive and can tolerate longer processing windows. Option C is wrong because Cloud Data Fusion is a visual ETL tool designed for complex data pipelines and integration scenarios, not for cost-effective batch processing of large CSV files; it adds unnecessary abstraction and cost for a straightforward load operation. Option D is wrong because BigQuery Data Transfer Service is designed for scheduled imports from SaaS applications (e.g., Google Ads, YouTube) or Cloud Storage only when using a predefined schema and format (e.g., Avro, Parquet), and it does not support direct CSV loading with custom transformations or PySpark logic, making it unsuitable for processing raw CSV files before loading.

63
MCQmedium

A company is building a real-time streaming pipeline to ingest clickstream events from web servers, enrich them with user profile data from Cloud Bigtable, and aggregate metrics into BigQuery. The expected throughput is 10,000 events per second with occasional spikes up to 50,000. The data must be processed with low latency (seconds) and exactly-once semantics. Which Google Cloud service should be the core processing engine?

A.Cloud Dataflow (Apache Beam runner)
B.Cloud Pub/Sub with Cloud Functions
C.Cloud Dataproc with Apache Spark Streaming
D.Cloud Data Fusion
AnswerA

Dataflow provides auto-scaling, exactly-once semantics, low latency, and native integration with BigQuery and Bigtable.

Why this answer

Cloud Dataflow, as a managed Apache Beam runner, is the correct choice because it provides exactly-once processing semantics, low-latency streaming (sub-second to seconds), and autoscaling to handle throughput spikes from 10,000 to 50,000 events per second. Its unified batch and streaming model allows you to enrich clickstream events with user profile data from Cloud Bigtable via side inputs or asynchronous lookups, and write aggregated metrics to BigQuery with exactly-once guarantees using the Beam BigQuery I/O connector.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub with Cloud Functions is sufficient for low-latency streaming, but candidates overlook that Cloud Functions lacks stateful processing and exactly-once semantics, making it unsuitable for aggregation and enrichment at high throughput.

How to eliminate wrong answers

Option B (Cloud Pub/Sub with Cloud Functions) is wrong because Cloud Functions has a maximum timeout of 9 minutes and does not support exactly-once processing semantics; it is at-least-once by default and lacks checkpointing for stateful operations like aggregation. Option C (Cloud Dataproc with Apache Spark Streaming) is wrong because Spark Streaming's micro-batch architecture introduces a minimum latency of several seconds (typically 5-10 seconds), which does not meet the 'seconds' low-latency requirement, and managing exactly-once semantics requires additional configuration (e.g., Kafka offsets) that is not natively handled by the managed service. Option D (Cloud Data Fusion) is wrong because it is a visual ETL tool designed for batch-oriented data integration and does not support real-time streaming ingestion or exactly-once processing; its pipelines are not suitable for sub-second latency or high-throughput event streams.

64
Multi-Selectmedium

Which THREE features of Cloud Pub/Sub guarantee at-least-once delivery and enable exactly-once processing downstream? (Choose three.)

Select 3 answers
A.Subscriber-retry policy with exponential backoff.
B.Exactly-once delivery source feature (enabled by default in current gcloud).
C.Message ordering by message key.
D.Cloud Dataproc integration for message replay.
E.Acknowledgment deadlines and message persistence.
AnswersA, B, E

Retries ensure messages are eventually delivered on failure.

Why this answer

Option A is correct because a subscriber-retry policy with exponential backoff ensures that messages that fail to be processed are retried with increasing delays, preventing transient failures from causing message loss. This mechanism, combined with Pub/Sub's persistent storage, guarantees that each message is delivered at least once, as the subscriber will keep retrying until it acknowledges the message.

Exam trap

Google Cloud often tests the misconception that message ordering or replay features contribute to delivery guarantees, when in fact ordering is about sequence and replay is not a native Pub/Sub capability; the key trap is confusing 'exactly-once delivery' (which Pub/Sub does not offer) with 'exactly-once processing' (which requires subscriber-side idempotency).

65
MCQeasy

Your team uses Cloud Dataproc to run a Spark ML training job. The job is failing with an error: 'Container killed by YARN for exceeding memory limits.' What should you do to fix this?

A.Increase the spark.executor.memory property
B.Use preemptible VMs for faster execution
C.Increase the number of worker nodes
D.Enable the external shuffle service
AnswerA

This directly addresses the memory limit for each executor.

Why this answer

The error 'Container killed by YARN for exceeding memory limits' indicates that the Spark executor process is using more memory than the YARN container allows. Increasing `spark.executor.memory` allocates a larger YARN container for each executor, providing the necessary headroom for the Spark application's memory demands, including overhead for off-heap memory and JVM internals.

Exam trap

The trap here is that candidates often confuse scaling horizontally (adding nodes) with scaling vertically (increasing per-node resources), and assume more nodes will fix memory limits when the issue is per-container allocation.

How to eliminate wrong answers

Option B is wrong because preemptible VMs are cheaper but can be terminated at any time, which does not address memory limits and can actually cause more failures due to preemption. Option C is wrong because increasing the number of worker nodes adds more executors but does not increase the memory per executor; the existing executors will still exceed their container limits. Option D is wrong because the external shuffle service helps with shuffle data persistence and reduces executor memory pressure during shuffle operations, but it does not increase the per-executor memory allocation; the root cause is insufficient container memory, not shuffle management.

66
MCQeasy

Your team needs to store time-series data from millions of IoT devices. Each device sends a reading every 5 minutes, and the total data volume is about 2 TB per month. The most common query pattern is retrieving all readings for a specific device over a time range (e.g., last 24 hours). Which storage service should you choose?

A.Cloud Storage (objects per device per time interval)
B.BigQuery
C.Cloud Bigtable
D.Cloud Spanner
AnswerC

Bigtable is ideal for time-series data with high write throughput and row-key-based range scans for device/time.

Why this answer

Cloud Bigtable is a fully managed, scalable NoSQL database designed for high-throughput, low-latency time-series data. It supports single-row key lookups and range scans, making it ideal for retrieving all readings for a specific device over a time range (e.g., last 24 hours) from millions of IoT devices generating 2 TB/month. Its row key design (e.g., device_id + timestamp) enables efficient time-range queries without full table scans, unlike object storage or analytical warehouses.

Exam trap

Google Cloud often tests the misconception that BigQuery is suitable for operational, low-latency time-series queries, but the trap here is that BigQuery is an analytical warehouse optimized for large-scale batch queries, not for repeated, sub-second per-device range scans, which is a classic NoSQL (Bigtable) workload.

How to eliminate wrong answers

Option A is wrong because Cloud Storage (object storage) is optimized for immutable blob storage and lacks native indexing for time-range queries; retrieving all readings for a device over a time range would require listing and filtering millions of objects, which is slow and costly. Option B is wrong because BigQuery is a serverless data warehouse designed for analytical SQL queries on large datasets, not for real-time, high-throughput point lookups or range scans with sub-millisecond latency; it would incur high query costs and latency for repeated per-device time-range retrievals. Option D is wrong because Cloud Spanner is a globally distributed relational database with strong consistency and ACID transactions, which is overkill for time-series IoT data and would be prohibitively expensive and slower for high-volume, simple key-value range scans compared to Bigtable.

67
MCQmedium

A Dataflow batch job fails consistently with the error shown. The job uses a custom container image and runs in a VPC with a private IP. What should the engineer do to resolve the issue?

A.Request a CPU quota increase in the region.
B.Verify that the VPC has Private Google Access enabled and that Cloud NAT is configured for outbound internet access if needed.
C.Rebuild the custom container image and upload it to Container Registry.
D.Check that the custom image is based on the latest Dataflow SDK version.
AnswerB

In a private VPC, workers need connectivity to Dataflow API and container registry.

Why this answer

The error indicates that the Dataflow batch job cannot access required resources (e.g., container image, dependencies) because the VPC with private IPs lacks outbound internet connectivity. Option B is correct because enabling Private Google Access allows the VMs to reach Google APIs (like Container Registry) via the Google network, and Cloud NAT provides outbound internet access for non-Google APIs or external dependencies. Without these, the job fails to pull the custom container image or download necessary artifacts.

Exam trap

The trap here is that candidates often assume the error is due to the container image or SDK version, overlooking the VPC networking prerequisites (Private Google Access and Cloud NAT) that are required for Dataflow jobs using private IPs.

How to eliminate wrong answers

Option A is wrong because a CPU quota increase would not resolve connectivity issues; the error is about network access, not resource limits. Option C is wrong because rebuilding the container image does not fix the underlying network configuration problem; the image itself is not the cause of the failure. Option D is wrong because the Dataflow SDK version in the custom image is irrelevant to VPC networking; the job fails due to lack of outbound connectivity, not SDK compatibility.

68
MCQeasy

A company needs to process streaming data from IoT devices with sub-second latency and exactly-once processing guarantees. Which Google Cloud service should they use?

A.BigQuery
B.Cloud Dataproc
C.Cloud Dataflow
D.Cloud Pub/Sub
AnswerC

Dataflow supports streaming with auto-scaling and exactly-once processing, meeting the requirements.

Why this answer

Cloud Dataflow is the correct choice because it provides a unified stream and batch processing model with exactly-once processing guarantees and sub-second latency via its Apache Beam SDK. It supports event-time processing, watermarks, and triggers to handle out-of-order data from IoT devices while ensuring each record is processed exactly once, even in the case of failures.

Exam trap

Google Cloud often tests the distinction between data ingestion (Pub/Sub) and data processing (Dataflow), so the trap here is that candidates confuse Pub/Sub's streaming ingestion capability with the processing guarantees needed for exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because BigQuery is a serverless data warehouse designed for analytical queries on large datasets, not for real-time stream processing with sub-second latency and exactly-once guarantees; it can ingest streaming data but does not provide the fine-grained per-record processing semantics required. Option B is wrong because Cloud Dataproc is a managed Hadoop/Spark service that can process streaming data via Spark Streaming, but it does not natively guarantee exactly-once processing out of the box and typically has higher latency due to micro-batching. Option D is wrong because Cloud Pub/Sub is a messaging and ingestion service that provides at-least-once delivery by default and does not perform data processing; it is a transport layer, not a processing engine.

69
MCQmedium

A financial services firm uses Cloud Pub/Sub to ingest real-time market data. The data is processed by a Cloud Dataflow streaming pipeline that aggregates trades per symbol and writes to BigQuery. The pipeline currently uses a single global window with a trigger that fires every minute. The firm now needs to support late data up to 5 minutes and also wants to reduce the number of writes to BigQuery to avoid hitting the table limit of 1,500 inserts per second. The current pipeline writes every minute, which is acceptable for inserts per second, but after adding late data handling, the number of writes doubles. How can you redesign the pipeline to handle late data while keeping write volume low?

A.Use fixed windows of 5 minutes with allowed lateness 5 minutes and trigger every 30 seconds
B.Increase the global window duration to 10 minutes and keep the same trigger
C.Discard all late data and keep the current windowing
D.Use session windows with a gap duration of 5 minutes and a count-based trigger that fires after accumulating 1000 elements
AnswerD

Session windows group events; count-based trigger reduces writes by batching.

Why this answer

Option D is correct because session windows naturally group events into bursts of activity separated by a gap duration (5 minutes), which reduces the number of writes by accumulating many trades per symbol before emitting a pane. Adding a count-based trigger that fires after 1000 elements further limits write frequency, keeping the insert rate well below BigQuery's 1,500 per second limit while still allowing late data up to the gap duration. This design handles late data implicitly within the session gap and avoids the write amplification seen with fixed windows and frequent triggers.

Exam trap

The trap here is that candidates assume fixed windows with allowed lateness are the only way to handle late data, overlooking that session windows naturally accommodate late arrivals while reducing write frequency through event grouping and count-based triggers.

How to eliminate wrong answers

Option A is wrong because fixed windows of 5 minutes with a trigger every 30 seconds would increase the number of writes (12 panes per window per key) rather than reduce them, exacerbating the BigQuery insert rate issue. Option B is wrong because increasing the global window duration to 10 minutes does not change the trigger frequency (still every minute), so the number of writes remains the same and late data handling is not addressed. Option C is wrong because discarding late data violates the requirement to support late data up to 5 minutes and is not a valid redesign for the stated need.

70
MCQhard

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

A.Use a Cloud Bigtable table as a side input via a RichSDF.
B.Use a side input from a PCollection and broadcast it.
C.Increase the number of workers to distribute the side input.
D.Increase the worker memory to 16 GB per worker.
AnswerA

Bigtable provides scalable key-value lookups without loading all data into memory.

Why this answer

Option A is correct because using a Cloud Bigtable table as a side input via a RichSDF (Rich Splittable DoFn) allows the pipeline to perform point lookups on the large (10 GB) lookup table without loading it entirely into worker memory. This avoids OOM errors and reduces latency by leveraging Bigtable's low-latency, scalable key-value storage, which is ideal for high-throughput streaming pipelines that require frequent, random access to a large, frequently updated dataset.

Exam trap

The trap here is that candidates often assume increasing resources (memory or workers) is the solution to memory pressure, but the real issue is the architectural pattern of broadcasting a large, frequently updated dataset—requiring a shift to an external, queryable store like Bigtable.

How to eliminate wrong answers

Option B is wrong because broadcasting a 10 GB PCollection as a side input would require every worker to hold the entire lookup table in memory, causing OOM errors and high latency due to serialization and shuffle overhead. Option C is wrong because increasing the number of workers does not reduce the per-worker memory footprint of a broadcast side input; each worker still needs to load the full 10 GB table, so OOM errors persist. Option D is wrong because simply increasing worker memory to 16 GB per worker is a temporary workaround that does not scale—if the lookup table grows or multiple side inputs are used, OOM errors will recur, and it does not address the fundamental issue of loading the entire dataset into memory.

71
MCQeasy

Your team is using Cloud Data Fusion to build batch ETL pipelines that load data from Cloud Storage into BigQuery. You have several pipelines that run daily. Recently, one pipeline started failing with a 'Permission denied' error when trying to read a new CSV file uploaded to a specific Cloud Storage bucket. Other pipelines using the same bucket succeed. The failing pipeline has a Cloud Storage source plugin that uses a service account with the roles/storage.objectViewer role. The bucket has uniform bucket-level access enabled. What is likely causing the issue?

A.Create a custom IAM role with storage.buckets.get and storage.objects.get permissions and assign it to the service account.
B.Check that the service account used by the failing pipeline's Data Fusion instance has the correct permissions, and ensure that the service account is the same as the one used by working pipelines.
C.Disable uniform bucket-level access and add bucket ACLs for the service account.
D.Add the service account as a member of the Cloud Storage bucket with the roles/storage.objectViewer role.
AnswerB

The root cause is likely a different service account or misconfiguration in the failing pipeline's Data Fusion instance.

Why this answer

The correct answer is B because the error is likely due to the Data Fusion instance's service account, not the source plugin's service account. In Cloud Data Fusion, the pipeline execution uses the service account attached to the Data Fusion instance itself to access Cloud Storage, even if the source plugin specifies a different service account. Since other pipelines using the same bucket succeed, the issue is that the failing pipeline's Data Fusion instance uses a service account that lacks the roles/storage.objectViewer role on the bucket, while working pipelines use an instance with the correct permissions.

Exam trap

Google Cloud often tests the misconception that the service account specified in a plugin (e.g., Cloud Storage source) is the one used for authentication, when in fact the Data Fusion instance's service account is the effective identity for all pipeline operations.

How to eliminate wrong answers

Option A is wrong because the roles/storage.objectViewer role already includes storage.objects.get permission, and storage.buckets.get is not required for reading objects; adding a custom role is unnecessary and does not address the root cause. Option C is wrong because disabling uniform bucket-level access and using ACLs is an outdated approach that contradicts best practices; the issue is not about access control mode but about which service account is being used. Option D is wrong because the service account used by the source plugin already has roles/storage.objectViewer on the bucket (as stated), but the pipeline fails because the Data Fusion instance's service account, not the plugin's service account, is the one making the request.

72
MCQmedium

Your Dataflow streaming pipeline is processing financial transactions and writing results to BigQuery. You need to monitor the pipeline for data freshness (end-to-end latency) and alert if it exceeds 5 minutes. The pipeline uses fixed windows of 1 minute. Which metrics should you use for alerting?

A.System Lag metric from Dataflow monitoring.
B.Data Freshness metric from BigQuery monitoring.
C.Element Count metric from Dataflow monitoring.
D.Worker Threads Utilization metric from Dataflow monitoring.
AnswerA

System Lag tracks the delay between event time and processing time; if it exceeds 5 minutes, alert.

Why this answer

Dataflow's 'System Lag' metric measures the difference between event time and processing time, indicating how far behind the pipeline is. For windowed pipelines, this reflects overall latency. Option A (Element Count) shows throughput, not latency.

Option C (Data Freshness) is a BigQuery-specific metric for table currency. Option D (Worker Threads Utilization) relates to parallelism.

73
MCQhard

You are designing a streaming data pipeline that must guarantee exactly-once processing semantics for financial transactions. The pipeline reads from Cloud Pub/Sub and writes to Cloud Bigtable. Each transaction has a unique transaction ID. Which features do you need to implement to ensure exactly-once semantics end-to-end?

A.Use Cloud Pub/Sub with synchronous pull and manually commit offsets after successfully writing to Bigtable.
B.Use Dataflow with exactly-once processing, and ensure the Bigtable sink uses idempotent mutations based on the transaction ID.
C.Use Dataflow with at-least-once processing and implement deduplication in a windowed transform using the transaction ID.
D.Use Cloud Pub/Sub with exactly-once delivery enabled, and write to Bigtable using single-row transactions.
AnswerB

Dataflow deduplicates records using unique identifiers; Bigtable idempotent writes (e.g., using CheckAndMutate) ensure that even if a mutation is retried, the result is the same.

Why this answer

Option B is correct because Dataflow's exactly-once processing guarantees that each record is processed precisely once, and idempotent Bigtable mutations (keyed by transaction ID) ensure that even if a mutation is retried, the result is the same. This combination provides end-to-end exactly-once semantics: Dataflow handles source-side deduplication and checkpointing, while Bigtable's idempotent writes prevent duplicates at the sink.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's 'exactly-once delivery' feature exists or that manual offset management alone can achieve end-to-end exactly-once semantics, when in reality Pub/Sub only offers at-least-once delivery and requires a processing framework like Dataflow to achieve exactly-once end-to-end.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub synchronous pull with manual offset commit does not guarantee exactly-once delivery; Pub/Sub's at-least-once delivery model means duplicates can still occur, and manual offset management does not eliminate duplicates from Pub/Sub itself. Option C is wrong because at-least-once processing in Dataflow inherently allows duplicates, and windowed deduplication using transaction ID is not sufficient for end-to-end exactly-once semantics—it only handles duplicates within a window and does not address failures during checkpointing or sink writes. Option D is wrong because Cloud Pub/Sub does not support exactly-once delivery; its default is at-least-once, and single-row transactions in Bigtable do not prevent duplicates from Pub/Sub redelivery.

74
MCQeasy

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

A.Use a single-node cluster with standard VMs.
B.Use a cluster with local SSDs for faster I/O.
C.Use a cluster with a mix of standard and preemptible VMs.
D.Use a cluster with n1-highmem-32 instances and 1000 cores.
AnswerC

Preemptible VMs reduce cost significantly while providing sufficient compute.

Why this answer

Option C is correct because preemptible VMs cost about 80% less than standard VMs, and mixing them with standard VMs provides fault tolerance for the job's 6-hour duration. Since the job reads and writes to Cloud Storage (not local HDFS), local SSDs are unnecessary, and a single-node cluster would lack the parallelism needed to process 500 TB efficiently within 6 hours. Using a mix of standard (for critical master/worker nodes) and preemptible VMs (for worker nodes) minimizes cost while ensuring job completion.

Exam trap

Google Cloud often tests the misconception that local SSDs always improve performance for data processing jobs, but in Dataproc, when data resides in Cloud Storage, the bottleneck is network throughput, not local disk speed, making SSDs an unnecessary cost.

How to eliminate wrong answers

Option A is wrong because a single-node cluster cannot process 500 TB in 6 hours due to limited CPU and memory resources, and it lacks fault tolerance if the node fails. Option B is wrong because local SSDs add cost without benefit when reading/writing from Cloud Storage, as the bottleneck is network I/O, not disk I/O; Dataproc uses Cloud Storage as the primary data source, not HDFS. Option D is wrong because using 1000 cores with n1-highmem-32 instances is over-provisioned and expensive, and the job's 6-hour runtime does not justify such a large cluster; it also ignores the cost savings of preemptible VMs.

75
MCQeasy

A team is setting up a Dataflow pipeline for a time-sensitive ETL job that must complete within a specific time window. Which monitoring metric should they use to determine if the pipeline is on track to finish on time?

A.The number of failed elements and retries.
B.The system lag metric, which measures the time between event occurrence and processing.
C.The number of elements processed in the current window.
D.The job's estimated time to completion shown in the Dataflow monitoring interface.
AnswerD

This metric directly estimates remaining time based on throughput.

Why this answer

Option D is correct because the Dataflow monitoring interface provides an estimated time to completion for the pipeline, which is the most direct metric for determining if the job will finish within the required time window. This estimate is calculated based on current throughput, backlog, and resource utilization, making it the appropriate choice for time-sensitive ETL jobs. Other metrics like system lag or element counts do not directly predict job completion time.

Exam trap

Google Cloud often tests the distinction between metrics that measure current performance (like system lag or element count) versus metrics that predict future completion (like estimated time to completion), leading candidates to pick a metric that sounds relevant but does not answer the specific question about finishing on time.

How to eliminate wrong answers

Option A is wrong because the number of failed elements and retries indicates data quality or processing errors, not the pipeline's progress toward completion within a time window. Option B is wrong because system lag measures the delay between event occurrence and processing, which is useful for streaming latency but does not provide an estimated finish time for a batch or bounded pipeline. Option C is wrong because the number of elements processed in the current window shows throughput but not whether the remaining workload can be completed before the deadline, as it ignores the backlog and processing rate.

Page 1 of 2 · 104 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Building and operationalizing data processing systems questions.