Knowledge + Practice

CCNA Building and operationalizing data processing systems Questions

29 of 104 questions · Page 2/2 · Building and operationalizing data processing systems · Answers revealed

Practice these questions Domain overview All questions

76

MCQeasy

Your organization has a data lake on Cloud Storage with millions of small files (average 10 KB). You need to build a batch processing pipeline using Cloud Dataproc that runs a Spark job to transform the data and output results to BigQuery. The pipeline currently takes 4 hours to run because Spark spends a large amount of time listing files and managing tasks. You want to reduce the run time without changing the cluster size. Which action should you take?

A.Convert the input files from CSV to Parquet format

B.Use Spark coalesce to reduce the number of output partitions

C.Increase the number of Spark partitions to process more files in parallel

D.Enable the Spark Dynamic Resource Allocation and combine small files using a separate job before the main transformation

AnswerD

Combining files reduces task count and listing overhead.

Why this answer

Option D is correct because the primary bottleneck is the overhead of listing millions of small files and managing many Spark tasks. By combining small files into larger ones using a separate job before the main transformation, you reduce the number of files Spark must list and the number of tasks required, which directly cuts the 4-hour runtime. Enabling Spark Dynamic Resource Allocation ensures resources are used efficiently during this preprocessing step without changing the cluster size.

Exam trap

The trap here is that candidates focus on data format or partitioning tuning (A, B, C) instead of recognizing that the root cause is the sheer number of small files causing excessive file listing and task overhead, which requires a preprocessing step to consolidate files.

How to eliminate wrong answers

Option A is wrong because converting CSV to Parquet improves read performance and compression but does not address the overhead of listing millions of small files or the task management cost; the bottleneck is file count, not format. Option B is wrong because using Spark coalesce reduces the number of output partitions, which only affects the write phase to BigQuery and does nothing to reduce the input file listing or task scheduling overhead. Option C is wrong because increasing the number of Spark partitions would create even more tasks, exacerbating the overhead from managing millions of small files and likely increasing runtime, not reducing it.

Practice this question →

77

MCQhard

Based on the exhibit, what is the most likely cause of duplicate rows despite using the same event_id as insertId?

A.BigQuery's streaming buffer deduplication is best-effort and may not catch duplicates within a short time window.

B.The Dataflow pipeline is retrying inserts due to network errors, and the same event_id is not being used in retries.

C.The pipeline is writing more than 100,000 rows per second, exceeding BigQuery's streaming quota.

D.The table is partitioned by timestamp, so BigQuery cannot deduplicate across partitions.

AnswerA

Duplicate inserts within milliseconds can bypass dedup due to coarseness.

Why this answer

BigQuery's streaming buffer uses best-effort deduplication based on the `insertId` field. When multiple rows are inserted with the same `event_id` mapped to `insertId` within a short time window (typically up to a few minutes), the deduplication mechanism may fail to remove all duplicates, especially under high throughput or network retries. This is a documented limitation of BigQuery streaming, not a guarantee of exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that BigQuery's streaming deduplication is a strong guarantee, when in fact it is best-effort and can fail under concurrent writes or short time windows.

How to eliminate wrong answers

Option B is wrong because if the same `event_id` is not used in retries, BigQuery would treat them as distinct rows and not deduplicate, but the question states the same `event_id` is used as `insertId`; the issue is that deduplication is best-effort, not that the ID is missing. Option C is wrong because exceeding the streaming quota (default 100,000 rows per second per table) would cause ingestion errors or throttling, not duplicate rows; duplicates arise from the buffer's deduplication behavior, not quota limits. Option D is wrong because BigQuery can deduplicate across partitions within the streaming buffer; partitioning does not disable deduplication, and duplicates can occur even in a single partition due to the buffer's best-effort nature.

Practice this question →

78

Multi-Selecthard

Which THREE practices are recommended when designing a Cloud Data Fusion pipeline to ensure efficient execution and monitoring? (Choose three.)

Select 3 answers

A.Manually partition input files to control parallelism.

B.Limit the memory and disk usage per stage to avoid Dataproc node resource exhaustion.

C.Use a dedicated Dataproc cluster for each production pipeline to avoid resource contention.

D.Schedule pipeline runs using Cloud Scheduler and Pub/Sub triggers to avoid manual starts.

E.Set up custom metrics and alerts for pipeline backpressure and latency.

AnswersB, C, E

Resource limits prevent OOM errors and improve stability.

Why this answer

Option B is correct because Cloud Data Fusion pipelines run on Dataproc clusters, and limiting memory and disk usage per stage prevents resource exhaustion on worker nodes. This ensures that no single stage consumes all available resources, which could cause the pipeline to fail or degrade performance. Proper resource limits help maintain stable execution and avoid out-of-memory errors.

Exam trap

Google Cloud often tests the misconception that manual partitioning (Option A) gives better control, but Cloud Data Fusion's auto-partitioning is more efficient and recommended; candidates may also overlook that scheduling (Option D) is about automation, not execution efficiency or monitoring.

Practice this question →

79

MCQeasy

You are designing a streaming Dataflow pipeline that reads from Cloud Pub/Sub. Some data may arrive late due to network delays. You need to ensure that late-arriving data is still processed, but after a certain point, it should be discarded to avoid unbounded state. What is the best practice?

A.Switch to a batch pipeline

B.Use fixed windows without allowed lateness

C.Discard all late-arriving data

D.Set a watermark and allowed lateness

AnswerD

Allowed lateness enables processing of late data within a configurable period, balancing completeness and latency.

Why this answer

Option D is correct because in streaming Dataflow pipelines, setting a watermark and allowed lateness provides a mechanism to handle late-arriving data from Pub/Sub without unbounded state growth. The watermark defines the point after which data is considered late, and allowed lateness specifies how long to wait for late data before discarding it, balancing completeness and state management.

Exam trap

The trap here is that candidates often confuse 'allowed lateness' with simply discarding late data, failing to recognize that it provides a controlled buffer for late arrivals while still bounding state growth.

How to eliminate wrong answers

Option A is wrong because switching to a batch pipeline would lose the streaming, low-latency processing requirement and cannot handle late-arriving data in real time. Option B is wrong because fixed windows without allowed lateness would immediately discard any data arriving after the window end, even if it is only slightly delayed, leading to data loss. Option C is wrong because discarding all late-arriving data is too aggressive and ignores the need to process data that arrives within a reasonable delay, which is common in distributed systems like Pub/Sub.

Practice this question →

80

MCQmedium

A retail company is building a recommendation engine that requires processing customer clickstream data in near real-time. The data is ingested via Pub/Sub, and must be joined with a lookup table of product details (updated daily) before being used for model inference. Which design pattern should they use?

A.Enrich the stream by querying BigQuery for each event using a Cloud Function.

B.Use a Dataflow pipeline that reads from Pub/Sub and uses a side input from a regularly refreshed PCollection loaded from Cloud Storage.

C.Store product details in Cloud Memorystore (Redis) and have the streaming application look up each event.

D.Write events to BigQuery and use scheduled queries to join with the product table in batch.

AnswerB

Side inputs enable efficient streaming-batch joins within Dataflow.

Why this answer

Option B is correct because Dataflow can read streaming data from Pub/Sub and use a side input from a regularly refreshed PCollection loaded from Cloud Storage. This pattern allows the product lookup table (updated daily) to be periodically reloaded into the pipeline as a side input, enabling efficient, low-latency enrichment of each event without per-event external calls or batch delays.

Exam trap

Google Cloud often tests the distinction between streaming enrichment patterns that require external lookups (which add latency and cost) versus using side inputs for static or slowly-changing reference data, leading candidates to mistakenly choose a cache-based solution like Redis when the data is already available in Cloud Storage.

How to eliminate wrong answers

Option A is wrong because querying BigQuery for each event via a Cloud Function would introduce high latency and cost due to per-event query overhead, and BigQuery is not designed for real-time point lookups. Option C is wrong because while Cloud Memorystore (Redis) provides low-latency lookups, it requires managing a separate cache and does not natively integrate with the daily-updated Cloud Storage file; the pattern also lacks the automatic refresh mechanism that side inputs provide. Option D is wrong because writing events to BigQuery and using scheduled queries for batch joins introduces significant latency (minutes to hours), which violates the near real-time requirement for the recommendation engine.

Practice this question →

81

Matchingmedium

Match each data storage term to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Atomicity, Consistency, Isolation, Durability

Basically Available, Soft state, Eventual consistency

Consistency, Availability, Partition tolerance trade-off

Horizontal partitioning of data across databases

Why these pairings

Fundamental concepts in data storage systems.

Practice this question →

82

MCQhard

Your company runs a batch data processing pipeline using Cloud Dataproc and Cloud Composer. The pipeline processes hundreds of terabytes of data daily. Recently, the pipeline has been failing intermittently due to Dataproc cluster creation errors: 'Insufficient resources to create cluster in zone us-central1-f.' The project has a global quota of 1000 vCPUs for Compute Engine. The team usually uses n2-standard-8 (8 vCPU) worker nodes. You notice that the error occurs during peak usage times. You need to ensure the pipeline runs reliably without increasing the global quota. Which action should you take?

A.Increase the global Compute Engine quota to 2000 vCPUs

B.Switch to using preemptible VMs only, which have higher availability

C.Use fewer workers with larger machine types, such as n2-standard-64

D.Configure the Dataproc cluster to use multiple zones via the --zone argument with a zonal list

AnswerD

Spreading across zones avoids zonal capacity issues.

Why this answer

Option D is correct because configuring the Dataproc cluster to use multiple zones via the `--zone` argument with a zonal list distributes worker node creation across several zones in the same region. This avoids the 'Insufficient resources' error by not exhausting capacity in a single zone, without requiring a global quota increase. Cloud Dataproc supports specifying a comma-separated list of zones, and the service will attempt to create the cluster in the first available zone.

Exam trap

The trap here is that candidates often assume the only solution to resource exhaustion is to increase quotas or switch to preemptible VMs, overlooking the zonal distribution feature that directly addresses the 'Insufficient resources' error without changing the global quota.

How to eliminate wrong answers

Option A is wrong because the question explicitly states you must not increase the global quota; raising it to 2000 vCPUs would violate that constraint and does not address the zonal resource exhaustion issue. Option B is wrong because preemptible VMs have lower availability (they can be reclaimed at any time) and are not suitable as the only worker type for a reliable production pipeline processing hundreds of terabytes daily; they also do not solve the zone-specific capacity shortage. Option C is wrong because using fewer workers with larger machine types (e.g., n2-standard-64) does not reduce the total vCPU count required for the workload; it may even increase the risk of hitting the global quota per cluster creation request and does not mitigate the zonal resource exhaustion.

Practice this question →

83

Multi-Selecteasy

Your company is evaluating managed messaging services for a new event-driven application. The application requires pub/sub semantics, high throughput (millions of messages per second), and integration with Google Cloud services like Cloud Functions and Dataflow. Which TWO services should you consider? (Choose two.)

Select 2 answers

A.Cloud Pub/Sub

B.Cloud Scheduler

C.Cloud Pub/Sub Lite

D.Cloud Functions

E.Cloud Tasks

AnswersA, C

Cloud Pub/Sub is a fully managed, scalable pub/sub messaging service with native Google Cloud integration.

Why this answer

Cloud Pub/Sub (A) is the correct choice because it provides fully managed, highly scalable pub/sub messaging with exactly-once delivery semantics and support for millions of messages per second. It integrates natively with Cloud Functions and Dataflow, making it ideal for event-driven architectures requiring high throughput and decoupled communication.

Exam trap

Google Cloud often tests the distinction between managed messaging services (Pub/Sub vs. Pub/Sub Lite) and other Google Cloud services like Cloud Tasks or Cloud Scheduler, where candidates mistakenly select compute or scheduling services for messaging needs.

Practice this question →

84

MCQhard

You are a data engineer at a financial services company. You manage a batch pipeline that processes daily trade settlement reports. The pipeline runs on Cloud Dataproc using PySpark jobs triggered by Cloud Composer (Airflow). Recent trades have increased by 3x, and the pipeline now frequently fails with 'OutOfMemoryError' in the executor logs. You have already increased the executor memory from 4g to 8g, but the problem persists. The cluster uses standard worker nodes (n1-standard-4) with 15 GB RAM per node. You need to make the pipeline stable and cost-efficient. What should you do?

A.Use n1-highmem-4 instances for the cluster to get 26 GB RAM per node and increase executor memory to 12g.

B.Migrate the PySpark jobs to Cloud Dataflow with the Apache Beam SDK to benefit from auto-scaling.

C.Increase the number of executors and reduce the executor memory to 4g, then add preemptible secondary workers to lower cost.

D.Enable cluster autoscaling and set minimum to 5 workers, maximum to 20 workers.

AnswerC

Adding more executor instances distributes memory and reduces per executor load; preemptible workers lower costs.

Why this answer

Option C is correct because the OutOfMemoryError persists even after increasing executor memory to 8g, indicating that the issue is not simply insufficient memory per executor but rather that the total memory across all executors is insufficient for the 3x data volume. By increasing the number of executors (parallelism) and reducing executor memory back to 4g, you distribute the data processing load across more JVMs, reducing the memory pressure per executor. Adding preemptible secondary workers lowers cost while providing the additional compute capacity needed to handle the increased data volume efficiently.

Exam trap

Google Cloud often tests the misconception that increasing executor memory alone solves OutOfMemoryErrors, when in reality the issue is often insufficient parallelism or misconfigured memory overhead, and the correct solution involves balancing executor count, memory, and cost-efficient instance types like preemptible VMs.

How to eliminate wrong answers

Option A is wrong because simply using n1-highmem-4 instances with 26 GB RAM and increasing executor memory to 12g does not address the root cause—the pipeline needs more parallelism, not just more memory per executor; the OutOfMemoryError can still occur if the data skew or shuffle operations overwhelm a single executor. Option B is wrong because migrating to Cloud Dataflow with Apache Beam SDK is a significant architectural change that does not directly solve the memory issue; Dataflow auto-scaling can help with throughput but does not guarantee stability if the pipeline's memory configuration is fundamentally misaligned with the data volume. Option D is wrong because enabling cluster autoscaling with a minimum of 5 workers and maximum of 20 workers does not address the executor memory configuration; autoscaling adds nodes but if the executor memory is still too high per node (e.g., 8g on a 15 GB node), the system may still run out of memory due to overhead from the OS, YARN, and other daemons, and it does not optimize cost by using preemptible instances.

Practice this question →

85

Multi-Selecteasy

A data team uses Cloud Composer to orchestrate Airflow DAGs. They need to ensure that a downstream task runs only if at least two out of three upstream sensor tasks succeed. Which TWO configurations should they combine?

Select 2 answers

A.Set trigger_rule to 'none_failed_or_skipped' and use a condition.

B.Set trigger_rule to 'one_success'.

C.Set trigger_rule to 'all_done'.

D.Set trigger_rule to 'none_failed'.

E.Use a PythonOperator to check the number of successes.

AnswersA, E

Combined with a condition, this ensures at least two succeeded.

Why this answer

Option A is correct because the 'none_failed_or_skipped' trigger rule triggers the downstream task when all upstream tasks have succeeded or been skipped. Combined with a condition (e.g., using a PythonOperator or BranchPythonOperator) that checks whether at least two of the three sensor tasks succeeded, this ensures the downstream task runs only when the required threshold is met. This approach leverages Airflow's built-in trigger rules and conditional logic to implement a quorum-based dependency.

Exam trap

Google Cloud often tests the misconception that a single trigger rule like 'one_success' or 'none_failed' can directly enforce a quorum condition, when in fact you must combine a trigger rule with explicit conditional logic to count successes.

Practice this question →

86

MCQeasy

Your company uses Cloud Dataflow to process streaming data from Pub/Sub. The pipeline occasionally fails with a 'worker terminated unexpectedly' error. What is the most likely cause of this error?

A.Insufficient memory per worker causing OOM errors

B.Incorrect VPC firewall rules blocking internal communication

C.Staging location bucket lacks write permissions

D.Pub/Sub subscription throughput quota exceeded

AnswerA

OOM errors cause workers to terminate unexpectedly.

Why this answer

The 'worker terminated unexpectedly' error in Cloud Dataflow typically indicates that a worker process ran out of memory (OOM) and was killed by the operating system. This occurs when the pipeline's memory requirements exceed the configured worker machine type's memory capacity, often due to large windowing accumulations, skewed data, or inefficient state handling.

Exam trap

Google Cloud often tests the distinction between infrastructure-level errors (like OOM) and configuration or permission errors, so candidates may incorrectly attribute the generic 'worker terminated' message to network or IAM issues rather than resource exhaustion.

How to eliminate wrong answers

Option B is wrong because VPC firewall rules blocking internal communication would cause connectivity errors like 'unable to connect to shuffle service' or 'worker cannot reach Dataflow service', not a generic termination error. Option C is wrong because staging location bucket lacking write permissions would cause a pipeline submission failure with a permission denied error, not a runtime worker termination. Option D is wrong because Pub/Sub subscription throughput quota exceeded would result in Pub/Sub-specific errors such as 'RESOURCE_EXHAUSTED' or backlog buildup, not a worker termination.

Practice this question →

87

MCQhard

You are optimizing a BigQuery query that runs on a large table (hundreds of TB). The table is partitioned by date and frequently queried with filters on a specific customer_id column and date range. Queries are slow even after partitioning. Which optimization should you apply?

A.Increase the number of BigQuery slots

B.Columnar clustering on customer_id

C.Create materialized views for each customer

D.Denormalize the table to reduce joins

AnswerB

Clustering sorts data within each partition by customer_id, enabling block pruning for queries filtering on that column.

Why this answer

Clustering on customer_id within the partition improves query performance because BigQuery can prune blocks based on clustered columns. Partitioning alone doesn't help with non-date filters. Materialized views may help pre-aggregated queries but not ad-hoc customer_id filters.

Denormalization is not an optimization. Increasing slots is expensive and doesn't address data structure.

Practice this question →

88

MCQhard

A company runs a critical real-time data pipeline using Dataflow that ingests events from Cloud Pub/Sub, performs aggregations using sliding windows, and writes results to BigQuery. The pipeline is deployed in us-central1. The pipeline's latency has increased recently, and the Dataflow monitoring shows that the 'system lag' metric is consistently above 5 minutes. The pipeline is using Streaming Engine and has 10 workers with 4 vCPUs each. The pipeline processes approximately 100,000 events per second. The team has verified that the source Pub/Sub topic has sufficient publish throughput and the BigQuery table has no quota issues. The pipeline logs show that some workers are experiencing GC overhead limit exceeded errors. The pipeline code uses stateful processing with a custom keyed state for deduplication. What is the most likely cause of the increased latency?

A.The number of workers is insufficient; increasing to 20 workers will reduce latency.

B.The stateful processing is causing large state sizes that lead to GC overhead; use a more efficient state backend or increase worker memory.

C.The sliding window duration is too long; reducing it to 1 minute will improve performance.

D.The deduplication logic is causing a bottleneck; removing it will reduce latency.

AnswerB

GC overhead indicates memory pressure from large state; increasing memory or using a more efficient state backend like Cloud Bigtable can help.

Why this answer

The GC overhead limit exceeded errors indicate that workers are spending too much time garbage collecting, which is a classic symptom of excessive heap memory usage. Stateful processing with custom keyed state for deduplication can cause large per-key state sizes, especially with sliding windows that maintain overlapping state for each key. This forces the JVM to constantly garbage collect, increasing system lag beyond 5 minutes.

Using a more efficient state backend (e.g., reducing state size or using Dataflow's built-in deduplication) or increasing worker memory directly addresses the root cause.

Exam trap

Google Cloud often tests the misconception that scaling workers (Option A) is the universal fix for latency, when in reality memory-related issues like GC overhead require tuning state management or worker resources, not just parallelism.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers does not fix the GC overhead issue; it may even worsen it by distributing state across more workers without reducing per-worker memory pressure. Option C is wrong because reducing the sliding window duration does not address the state size or GC problem; it could actually increase the number of overlapping windows and state churn. Option D is wrong because removing deduplication would compromise data correctness; the bottleneck is not the logic itself but the memory footprint of the state, which can be mitigated without removing the feature.

Practice this question →

89

MCQhard

A company's Dataflow pipeline uses the PubSubIO source to read messages and writes to BigQuery via the BigQueryIO sink. The pipeline is running in Streaming mode with exactly-once semantics enabled. Occasionally, duplicate rows appear in BigQuery. What is the most likely reason?

A.The user-provided record ID for deduplication in BigQuery's streaming inserts is not being set for all messages, leading to duplicate rows.

B.The pipeline is using the WriteResult method with WRITE_APPEND in batch mode, which can cause duplicates if retries happen.

C.The pipeline is experiencing the 'dataflow streaming log processing' bug, causing duplicate logs to be written.

D.The PubSubIO source is configured with a dead-letter queue and messages are being redelivered without proper deduplication.

AnswerA

BigQueryIO uses insertId for deduplication; if it's missing or inconsistent, duplicates can occur.

Why this answer

In Dataflow streaming pipelines with exactly-once semantics, BigQuery's streaming inserts use user-provided record IDs for deduplication. If the record ID is not set for all messages, BigQuery cannot identify duplicates, and retries or redeliveries from Pub/Sub can result in duplicate rows. This is the most common cause of duplicates in this scenario.

Exam trap

Google Cloud often tests the misconception that exactly-once semantics in Dataflow automatically deduplicates at the sink, but in reality, BigQuery requires explicit user-provided record IDs for deduplication during streaming inserts.

How to eliminate wrong answers

Option B is wrong because WRITE_APPEND in batch mode is not relevant to a streaming pipeline with exactly-once semantics; the question specifies streaming mode, and batch mode duplicates would not explain streaming-specific behavior. Option C is wrong because there is no known 'dataflow streaming log processing' bug that causes duplicate logs; this is a fabricated term. Option D is wrong because a dead-letter queue handles failed messages after retries are exhausted, not redelivery; Pub/Sub redelivery without deduplication is already addressed by the user-provided record ID mechanism, and the dead-letter queue does not cause duplicates.

Practice this question →

90

Multi-Selecthard

Which THREE actions reduce the cost of a Cloud Composer environment?

Select 3 answers

A.Delete old and unused DAG files to reduce scheduler load

B.Use standard network tier instead of premium

C.Set up a maintenance window to shut down the environment during idle hours

D.Use a smaller environment size (e.g., small instead of medium)

E.Increase the number of schedulers for higher throughput

AnswersA, C, D

Less load means fewer resources needed.

Why this answer

Option A is correct because deleting old and unused DAG files reduces the number of DAGs the scheduler must parse and evaluate. The Cloud Composer scheduler scans the DAG folder every 30 seconds by default; fewer DAG files mean lower CPU and memory consumption, directly reducing the cost of the environment's compute resources.

Exam trap

The trap here is that candidates confuse scaling up (Option E) with cost optimization, not realizing that adding schedulers increases resource consumption and cost, while the correct cost-saving actions involve reducing resource usage or shutting down idle capacity.

Practice this question →

91

MCQhard

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

A.Create a single large persistent Dataproc cluster to handle the peak load.

B.Use Cloud Data Fusion to visually design the pipeline and run it on Dataproc.

C.Use a Dataproc cluster with preemptible worker nodes and autoscaling enabled.

D.Migrate the pipeline to Dataflow with Apache Beam and use flexRS for cost savings.

AnswerC

Preemptible VMs are cost-effective, and autoscaling handles growth.

Why this answer

Option C is correct because preemptible worker nodes significantly reduce cost (up to 80% discount) while autoscaling dynamically adjusts cluster size to match the growing workload, ensuring performance without over-provisioning. This combination handles the 10x data growth efficiently by scaling out during peak loads and scaling in during lulls, using preemptible instances for fault-tolerant tasks like transformation.

Exam trap

The trap here is that candidates often choose Dataflow (Option D) assuming it is always the best for cost and performance, but the question specifically involves Dataproc and batch ETL from Cloud Storage to BigQuery, where preemptible nodes with autoscaling provide a more direct and cost-effective solution without requiring a pipeline rewrite.

How to eliminate wrong answers

Option A is wrong because a single large persistent cluster incurs high costs even when idle, and cannot efficiently handle a 10x growth without manual resizing, leading to either underutilization or performance bottlenecks. Option B is wrong because Cloud Data Fusion is a visual design tool that adds complexity and cost (via Dataproc provisioning) without inherent autoscaling or preemptible node benefits, and is not optimized for batch ETL cost control. Option D is wrong because Dataflow with flexRS is designed for batch workloads with flexible scheduling, but it requires rewriting the pipeline in Apache Beam, which adds migration overhead and may not leverage existing Dataproc investments; flexRS offers cost savings but with potential execution delays, making it less balanced for immediate performance needs.

Practice this question →

92

MCQeasy

A company is ingesting real-time sensor data from thousands of devices into Cloud Pub/Sub. They need to process this data with low latency (seconds) and exactly-once semantics. Which data processing service should they use?

A.Cloud Run with Pub/Sub push

B.Cloud Functions triggered by Pub/Sub

C.Dataflow streaming with exactly-once processing

D.Dataproc with Spark Streaming

AnswerC

Dataflow provides exactly-once processing for streaming data with low latency, ideal for real-time sensor data.

Why this answer

Dataflow streaming with exactly-once processing is the correct choice because it provides exactly-once semantics for Pub/Sub sources via checkpointing and idempotent sinks, and it meets the low-latency (seconds) requirement through its streaming engine that minimizes per-element overhead. Cloud Dataflow's integration with Pub/Sub ensures that each message is processed exactly once, even in the presence of failures, by using snapshots and consistent state management.

Exam trap

Google Cloud often tests the misconception that serverless services like Cloud Functions or Cloud Run inherently provide exactly-once processing, when in fact they rely on Pub/Sub's at-least-once delivery and require additional logic to achieve exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because Cloud Run with Pub/Sub push does not guarantee exactly-once processing; Pub/Sub push delivery is at-least-once, and Cloud Run's stateless containers cannot enforce exactly-once semantics without external coordination. Option B is wrong because Cloud Functions triggered by Pub/Sub also uses at-least-once delivery from Pub/Sub and lacks built-in mechanisms for exactly-once processing; it is designed for lightweight, event-driven tasks, not for stateful streaming with exactly-once guarantees. Option D is wrong because Dataproc with Spark Streaming provides at-least-once or exactly-once semantics only with additional configuration (e.g., checkpointing and idempotent sinks), but it introduces higher latency (typically seconds to minutes) due to micro-batching and is not optimized for sub-second or low-latency streaming compared to Dataflow's streaming engine.

Practice this question →

93

MCQmedium

A data engineering team uses Cloud Pub/Sub to ingest clickstream events and Cloud Dataflow to process them. They need to maintain strict event ordering per user session, and the processing output must be written to a BigQuery table with exactly-once semantics. Which configuration should the team implement?

A.Enable message ordering in Pub/Sub with a session ID as the ordering key, and in Dataflow use a global window with a custom trigger that fires on watermark and uses a BigQuery sink with 'exactly-once' mode enabled.

B.Use a Pub/Sub pull subscription with a subscriber that acknowledges messages immediately after processing, and a Dataflow pipeline with a sliding window.

C.Assign a unique session ID as the message ordering key in Pub/Sub, use a Dataflow pipeline with session windows and .withAllowedLateness(0), and write to BigQuery using a batch load.

D.Use a Pub/Sub push subscription with an acknowledgment deadline of 600 seconds and enable exactly-once delivery on the subscription.

AnswerA

D is correct because Pub/Sub ordering keys maintain order per session, and Dataflow's exactly-once sink to BigQuery prevents duplicates when combined with deterministic triggers.

Why this answer

Option A is correct because it combines Pub/Sub message ordering (using a session ID as the ordering key) with Dataflow's exactly-once sink to BigQuery. The global window with a watermark-based trigger ensures all events for a session are processed in order before writing, while the BigQuery 'exactly-once' mode prevents duplicate rows even if the pipeline retries. This satisfies both strict per-session ordering and exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's exactly-once delivery subscription alone guarantees end-to-end exactly-once processing, ignoring that Dataflow's sink configuration and windowing strategy are required for ordering and deduplication in the output.

How to eliminate wrong answers

Option B is wrong because acknowledging messages immediately after processing (auto-ack) can cause message loss if the pipeline fails before writing to BigQuery, breaking exactly-once semantics; sliding windows do not maintain per-session ordering. Option C is wrong because session windows in Dataflow group events by session gaps, not by a fixed ordering key, and .withAllowedLateness(0) drops late events, risking incomplete sessions; batch loads to BigQuery do not provide exactly-once write semantics (they can produce duplicates on retry). Option D is wrong because enabling exactly-once delivery on a Pub/Sub subscription only ensures at-least-once delivery from Pub/Sub, not exactly-once processing downstream; a 600-second acknowledgment deadline does not guarantee ordering or exactly-once writes to BigQuery.

Practice this question →

94

Multi-Selecteasy

Which THREE Google Cloud services are considered fully managed serverless data processing services? (Choose THREE.)

Select 3 answers

A.Cloud Dataproc

B.Cloud Functions

C.Cloud Composer

D.Cloud Data Fusion

E.Cloud Dataflow

AnswersB, D, E

E is correct because Cloud Functions is a serverless compute service often used for data transformation.

Why this answer

Cloud Functions is a fully managed serverless data processing service because it executes code in response to events without requiring any server provisioning or management. It automatically scales from zero to thousands of instances based on incoming requests, and you pay only for compute time used while your code runs. This makes it ideal for lightweight, event-driven data processing tasks such as transforming data in Cloud Storage or reacting to Pub/Sub messages.

Exam trap

Google Cloud often tests the distinction between 'fully managed' and 'serverless'—the trap here is that Cloud Dataproc and Cloud Composer are fully managed (Google handles infrastructure) but still require you to manage cluster resources or worker nodes, so they are not serverless; candidates mistakenly equate 'fully managed' with 'serverless'.

Practice this question →

95

MCQhard

A company is designing a data lake on Google Cloud. The data lake will store raw, curated, and analytics-ready data. Security requirements include: data must be encrypted at rest and in transit, access must be controlled based on data sensitivity (public, internal, confidential), and all access to sensitive data must be audited. The company also wants to minimize data transfer costs for frequently accessed curated datasets. Which combination of services and configurations best meets these requirements?

A.Use Cloud Storage with default encryption, bucket policies, and Cloud Audit Logs. For frequent access, use Cloud CDN.

B.Use Cloud Storage with CMEK, and use Cloud HSM for key storage. Use Cloud Audit Logs. Avoid caching to ensure security.

C.Use Cloud Storage with SSE-C, bucket policies, and Cloud Audit Logs. Use Cloud Load Balancing for caching.

D.Use Cloud Storage with CMEK, bucket-level IAM, and object ACLs. Use Cloud Data Loss Prevention API to classify data. Enable Cloud Audit Logs. Use Cloud CDN to cache curated datasets.

AnswerD

CMEK ensures customer-controlled encryption; IAM+ACLs give granular access; DLP inspects and classifies; audit logs capture access; CDN caches data for lower latency and cost.

Why this answer

Option D is correct because it combines CMEK for encryption at rest (with Cloud HSM for key management), bucket-level IAM and object ACLs for granular access control based on data sensitivity, Cloud Audit Logs for auditing access to sensitive data, and Cloud CDN to cache curated datasets, reducing data transfer costs for frequently accessed data. This configuration meets all security requirements (encryption at rest and in transit, access control, auditing) while optimizing cost for frequent access.

Exam trap

Google Cloud often tests the misconception that caching (Cloud CDN) is inherently insecure or that it cannot be used with sensitive data, but in reality, Cloud CDN can be secured with signed URLs, IAM, and encryption, and it is the correct way to reduce data transfer costs for frequently accessed data.

How to eliminate wrong answers

Option A is wrong because default encryption uses Google-managed keys, not customer-managed keys (CMEK), which may not satisfy compliance requirements for controlling encryption keys; Cloud CDN caches content at edge locations but does not reduce data transfer costs from Cloud Storage to the same region (it reduces egress for global distribution, not for frequent access within a region). Option B is wrong because 'Avoid caching to ensure security' contradicts the requirement to minimize data transfer costs for frequently accessed curated datasets; caching with Cloud CDN is secure when properly configured (e.g., signed URLs, IAM), and avoiding it increases costs. Option C is wrong because SSE-C (Server-Side Encryption with Customer-Provided Keys) requires the client to manage keys and is not integrated with Cloud HSM or Cloud KMS; Cloud Load Balancing does not cache data (it distributes traffic), so it does not reduce data transfer costs for frequent access.

Practice this question →

96

MCQhard

You are designing a streaming pipeline using Cloud Dataflow with exactly-once semantics. The source is Pub/Sub and the sink is Cloud Bigtable. The pipeline must handle late data up to 10 minutes. You need to minimize cost while maintaining correctness. Which configuration should you use?

A.Fixed windows of 1 minute with allowed lateness 10 minutes and accumulating fired panes

B.Sliding windows of 1 minute with allowed lateness 10 minutes and accumulating fired panes

C.Global window with allowed lateness 10 minutes and trigger=afterWatermark with early firings

D.Session windows of 5 minutes with gap duration 1 minute and discarding fired panes

AnswerC

Global window with watermark-based triggers handles late data efficiently.

Why this answer

Option C is correct because a global window with an after-watermark trigger and early firings is the most cost-effective way to handle unbounded data from Pub/Sub with exactly-once semantics, while allowing up to 10 minutes of lateness. Fixed or sliding windows would create many small window states, increasing Bigtable write costs and shuffle overhead. The global window minimizes state and processing, and the trigger ensures results are emitted promptly without accumulating panes.

Exam trap

Google Cloud often tests the misconception that windowing is always required for streaming pipelines, but here the sink (Bigtable) stores individual records, so a global window with triggers is the most efficient and correct choice, not fixed or sliding windows.

How to eliminate wrong answers

Option A is wrong because fixed windows of 1 minute with accumulating panes would create a new window every minute, leading to excessive state and write amplification in Bigtable, increasing cost without benefit for a global sink. Option B is wrong because sliding windows of 1 minute would create overlapping windows, multiplying state and processing overhead even more than fixed windows, which is wasteful for a use case that doesn't require windowed aggregations. Option D is wrong because session windows with a 5-minute gap duration and discarding panes are designed for grouping events by activity sessions, not for a simple streaming pipeline to Bigtable; discarding panes also risks losing late data that arrives within the 10-minute allowed lateness, violating correctness.

Practice this question →

97

MCQhard

A healthcare analytics company runs a nightly Dataproc workflow that reads radiology reports from Cloud Storage (CSV files), transforms them using PySpark, and writes results to BigQuery. The workflow is orchestrated by Cloud Composer. Recently, the job has started failing with 'Disk quota exceeded' errors on the worker nodes. The data volume has grown 5x over the past month. Currently, the cluster uses 5 n1-standard-4 workers (each 10GB persistent disk). The PySpark jobs heavily use intermediate shuffles. You need a cost-effective solution that avoids future failures as data grows. What should you do?

A.Upgrade the worker machine type to n1-standard-8 with local SSDs for shuffle storage.

B.Increase the persistent disk size on each worker node to 100 GB.

C.Add more preemptible workers to the cluster and keep boot disk size at 10GB.

D.Use Cloud Dataflow instead of Dataproc, as it handles disk management transparently.

AnswerB

More disk space per worker allows shuffles to complete without quota errors.

Why this answer

The 'Disk quota exceeded' error occurs because the 10 GB persistent disks on the n1-standard-4 workers are too small to accommodate the intermediate shuffle data, which has grown 5x. Increasing the persistent disk size to 100 GB directly addresses the storage bottleneck without changing the machine type or incurring the cost of local SSDs, making it a cost-effective solution that scales with data growth.

Exam trap

The trap here is that candidates may over-engineer the solution by upgrading machine types or switching to a different service (Dataflow) when the root cause is simply insufficient disk space for shuffle data, which is easily fixed by increasing the persistent disk size.

How to eliminate wrong answers

Option A is wrong because upgrading to n1-standard-8 with local SSDs is overkill and more expensive; the issue is disk space for shuffle data, not CPU or memory, and local SSDs are ephemeral and not cost-effective for persistent storage needs. Option C is wrong because adding more preemptible workers does not increase the persistent disk size per worker; each worker still has only 10 GB, so shuffle data will still exceed the disk quota on those nodes. Option D is wrong because migrating to Cloud Dataflow is a significant architectural change that incurs migration costs and learning curve, and it does not address the immediate disk quota issue in the existing Dataproc workflow; Dataflow also has its own disk management limits.

Practice this question →

98

Multi-Selectmedium

A company uses Cloud Composer to orchestrate data pipelines. They have a DAG that runs hourly and processes files from Cloud Storage. The DAG is triggered by a Pub/Sub message sent from a Cloud Storage bucket notification. Recently, some DAG runs are not starting even though the Pub/Sub messages are published. Which two likely causes should the team investigate? (Choose TWO.)

Select 2 answers

A.The Cloud Storage bucket notification is not sending messages to the correct Pub/Sub topic, or the subscription's ack deadline is too short.

B.The DAG's start_date is set in the past and catchup is set to False, so DAG runs are only triggered on schedule.

C.The total number of DAGs in the environment exceeds the maximum limit of 100, causing DAG processing to stop.

D.The DAG's schedule interval is set too frequently, causing the executor queue to be full and new runs are skipped.

E.The Cloud Composer environment is using a pull subscription instead of a push subscription for the Pub/Sub sensor.

AnswersA, D

C is correct because misconfiguration of the notification or subscription can cause message loss.

Why this answer

Option A is correct because if the Cloud Storage bucket notification is misconfigured to send messages to the wrong Pub/Sub topic, the Pub/Sub sensor in the DAG will never receive the trigger message, causing DAG runs to not start. Additionally, if the subscription's ack deadline is too short, the message may be acknowledged before the sensor processes it, leading to message loss and missed triggers. Both issues directly prevent the DAG from being triggered by Pub/Sub messages.

Exam trap

Google Cloud often tests the misconception that a push subscription is required for Pub/Sub sensors in Cloud Composer, when in fact the sensor uses a pull subscription and the ack deadline is the critical parameter to manage.

Practice this question →

99

Drag & Dropmedium

Drag and drop the steps to set up a BigQuery dataset with a scheduled query into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Scheduled queries allow automating recurring data transformations and loads.

Practice this question →

100

MCQeasy

A company uses BigQuery for real-time analytics. They stream data from IoT devices into a BigQuery table. After a few hours, some of the recent data becomes visible in the table although it was streamed less than 10 minutes ago. The data team confirms that no one ran any manual queries. What is the most likely reason for the data visibility?

A.The data was stored in the streaming buffer for more than 24 hours, and BigQuery automatically flushes it to the table.

B.BigQuery time travel allows querying data from the past, including data still in the streaming buffer.

C.The table has an expiration set, and the data is made visible as soon as the table is about to expire.

D.The streaming buffer reached its maximum capacity (default 90 minutes) and automatically flushed the data to the table.

AnswerD

C is correct because the streaming buffer flushes data approximately every 90 minutes, making it visible.

Why this answer

Option D is correct because BigQuery's streaming buffer has a maximum capacity limit, typically around 90 minutes. When the buffer reaches this capacity, BigQuery automatically flushes the buffered data to the table, making it visible. This explains why data streamed less than 10 minutes ago became visible after a few hours.

Exam trap

The trap here is that candidates often assume streaming data is immediately visible or that time travel is responsible for visibility, but BigQuery's streaming buffer has a finite capacity that triggers automatic flushes, making data visible after a delay.

How to eliminate wrong answers

Option A is wrong because the streaming buffer does not have a 24-hour retention; data is flushed automatically within about 90 minutes or when the buffer reaches capacity, not after 24 hours. Option B is wrong because BigQuery time travel allows querying historical data within a 7-day window, but it does not cause data in the streaming buffer to become visible; it only affects how you query already-committed data. Option C is wrong because table expiration settings control when the table is deleted, not when streaming data becomes visible; data visibility is independent of table expiration.

Practice this question →

101

MCQmedium

A company runs a Dataflow pipeline that reads from Pub/Sub, aggregates events in a 10-minute fixed window, and writes to BigQuery. Recently, the pipeline has been failing with 'high uncommitted bytes' errors during periods of high traffic. What is the most likely cause and recommended action?

A.Reduce the window size from 10 minutes to 1 minute to decrease the amount of data per window.

B.Increase the number of worker machines to handle higher throughput.

C.Use a global window with a trigger that fires early based on element count to reduce the number of open windows.

D.Set a maximum number of workers and use a Pub/Sub flow control setting to limit incoming messages.

AnswerC

A global window with early triggers can reduce the number of panes and mitigate the high uncommitted bytes problem.

Why this answer

The 'high uncommitted bytes' error in Dataflow occurs when the system holds too much data in memory across many open windows, exceeding the default 200 MB limit. Using a global window with an early trigger based on element count reduces the number of simultaneous open windows and allows data to be committed more frequently, preventing memory pressure. This approach is recommended over reducing window size or scaling workers because the root cause is window fan-out, not throughput or parallelism.

Exam trap

Google Cloud often tests the misconception that scaling workers or reducing window size solves memory pressure, when the real issue is the number of open windows in a stateful pipeline.

How to eliminate wrong answers

Option A is wrong because reducing the window size from 10 minutes to 1 minute increases the number of open windows (from 6 per hour to 60 per hour), which would worsen the 'high uncommitted bytes' issue by creating more in-memory state. Option B is wrong because increasing worker machines does not address the fundamental problem of excessive open windows consuming memory; it may temporarily mask the issue but will not reduce the per-worker uncommitted bytes. Option D is wrong because setting a maximum number of workers and Pub/Sub flow control limits incoming messages but does not reduce the number of open windows or the memory used by uncommitted data; it may cause backpressure and data loss without fixing the window state explosion.

Practice this question →

102

MCQmedium

A company is using Dataflow to stream data from Cloud Pub/Sub to BigQuery. The pipeline includes a custom ParDo transformation that enriches the data with external API calls. The pipeline is experiencing high latency and occasional failures due to API timeouts. What strategy should be employed to improve reliability and performance?

A.Remove the enrichment step and store raw data in BigQuery.

B.Use a global window to accumulate all data before enrichment.

C.Use a DoFn with stateful processing and batch API calls using asynchronous HTTP client.

D.Increase the number of workers to parallelize API calls.

AnswerC

Batching and async calls reduce per-element latency and handle timeouts gracefully.

Why this answer

Option C is correct because using a DoFn with stateful processing and an asynchronous HTTP client allows the pipeline to batch API calls and handle timeouts without blocking the main processing thread. This reduces latency by enabling concurrent requests and improves reliability through retry logic and state management, which is essential for external API enrichment in Dataflow.

Exam trap

Google Cloud often tests the misconception that scaling workers (Option D) is a universal fix for performance issues, but the trap here is that API timeouts are often caused by the external service's capacity, not the pipeline's parallelism, and stateful batching with async calls is the correct architectural pattern.

How to eliminate wrong answers

Option A is wrong because removing the enrichment step defeats the purpose of the pipeline and does not address the underlying issue of API call reliability. Option B is wrong because using a global window to accumulate all data before enrichment would introduce unbounded state and memory pressure, and it does not solve API timeout problems; it would also break the streaming nature of the pipeline. Option D is wrong because simply increasing the number of workers does not fix API timeouts; it may even exacerbate the problem by overwhelming the external API with more concurrent requests, leading to more failures.

Practice this question →

103

Multi-Selecteasy

Which TWO options can help reduce costs for a Dataflow batch pipeline that processes 100 GB of data daily from Cloud Storage? (Choose 2)

Select 2 answers

A.Use Dataflow Prime (now Dataflow Runner v2)

B.Use high-memory machine types

C.Use Streaming Engine

D.Use FlexRS (Flexible Resource Scheduling)

E.Use preemptible VMs for Dataflow workers

AnswersD, E

FlexRS offers discounted pricing for batch jobs that are flexible on start time.

Why this answer

FlexRS (Flexible Resource Scheduling) allows you to run batch workloads on a discounted, flexible schedule. It reduces costs by offering lower prices in exchange for the job being able to wait up to 6 hours for resources to become available. This is ideal for a daily 100 GB batch pipeline that can tolerate some scheduling delay.

Exam trap

Google Cloud often tests the distinction between batch and streaming optimizations, so the trap here is that candidates might select Streaming Engine (Option C) thinking it reduces costs in batch pipelines, when it is only relevant for streaming.

Practice this question →

104

MCQmedium

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

A.Dataflow is retrying BigQuery streaming inserts after a timeout, and the retries succeed even though the original insert succeeded.

B.The pipeline uses default triggers instead of after-watermark triggers.

C.The fixed window duration is too short, causing overlapping windows.

D.The pipeline is using too many Dataflow workers, causing load balancing issues.

AnswerA

This is a known scenario: BigQuery streaming inserts are not idempotent, and retries can lead to duplicates.

Why this answer

Option A is correct because Dataflow uses at-least-once semantics for streaming inserts into BigQuery. When a streaming insert times out, Dataflow retries the insert, and if the original insert actually succeeded but the acknowledgment was lost, the retry produces a duplicate row. This is a known behavior of BigQuery streaming inserts with retry logic.

Exam trap

The trap here is that candidates often confuse trigger behavior (Option B) with the root cause of duplicates, not realizing that duplicates stem from retry semantics in the sink, not from windowing or parallelism.

How to eliminate wrong answers

Option B is wrong because default triggers in Dataflow (which fire on element arrival and after watermark) do not cause duplicate rows; they affect when results are emitted, not whether duplicates occur. Option C is wrong because fixed windows of 10 seconds do not overlap by design; overlapping windows would require a sliding window, not a fixed window. Option D is wrong because using too many Dataflow workers can cause resource inefficiency or shuffle issues, but it does not directly cause duplicate rows in BigQuery output.

Practice this question →

← PreviousPage 2 of 2 · 104 questions total

Ready to test yourself?

Try a timed practice session using only Building and operationalizing data processing systems questions.

Start 20-question session