CCNA Design Data Systems Questions

75 of 159 questions · Page 2/3 · Design Data Systems topic · Answers revealed

76
MCQeasy

A team needs to migrate an existing on-premises Hadoop Hive workload to Google Cloud. They want to minimize code changes and use a managed service for transient clusters. Which service should they choose?

A.Cloud Dataflow
B.Cloud Dataprep
C.Cloud Dataproc
D.BigQuery
AnswerC

Dataproc is fully compatible with Hadoop/Hive and offers ephemeral clusters with minimal code changes.

Why this answer

Cloud Dataproc is the correct choice because it is a managed Spark and Hadoop service that supports Hive workloads natively, allowing you to run existing Hive scripts with minimal changes. It also supports transient clusters, which can be automatically scaled up and down, aligning with the requirement for transient clusters.

Exam trap

The trap here is that candidates often confuse Cloud Dataflow's ability to process batch data with Hadoop compatibility, but Dataflow does not support Hive or transient Hadoop clusters, making Dataproc the only correct option for minimizing code changes.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow is a unified stream and batch data processing service based on Apache Beam, not designed for Hive workloads or transient Hadoop clusters. Option B is wrong because Cloud Dataprep is a data preparation and cleaning service (based on Trifacta) that does not run Hive or provide transient clusters. Option D is wrong because BigQuery is a serverless data warehouse that does not support Hive execution engines or transient clusters; migrating Hive to BigQuery would require significant code changes.

77
Multi-Selectmedium

A data engineer is monitoring a Dataflow streaming pipeline and notices that the 'System Lag' metric is increasing. Which TWO actions should be taken to diagnose the issue?

Select 2 answers
A.Check the Dataflow monitoring UI for each stage's throughput and backlog.
B.Cancel the pipeline and restart with a larger initial worker count.
C.Increase the maximum number of workers to handle backlog.
D.Examine the worker logs for error messages or stack traces.
E.Increase the BigQuery quota for streaming inserts.
AnswersA, D

Identifies bottleneck stages.

Why this answer

Option A is correct because the Dataflow monitoring UI provides per-stage metrics such as throughput and backlog, which directly indicate where data is accumulating. By examining these metrics, you can identify the specific stage causing the increasing system lag, enabling targeted troubleshooting without unnecessary pipeline changes.

Exam trap

Google Cloud often tests the distinction between diagnostic actions and remedial actions; the trap here is that candidates confuse scaling up workers (a fix) with diagnosing the root cause of the lag.

78
MCQhard

A Dataflow streaming job is processing high-volume sensor data from thousands of IoT devices. The job uses global windows with a 10-minute processing time trigger. Recently, the job's CPU utilization is nearly 100% and it is falling behind. Which action is most likely to reduce CPU load while maintaining data freshness?

A.Increase the number of workers to distribute the load.
B.Change the trigger to event time with a 10-minute allowed lateness.
C.Replace GroupByKey with Combine.globally and use a fanout.
D.Use side inputs to broadcast a static lookup table to all workers.
AnswerC

Combine.globally with fanout reduces the number of unique keys tracked per worker, lowering CPU usage from grouping large numbers of keys.

Why this answer

Option C is correct because using `Combine.globally` with a fanout reduces the amount of data shuffled and merged in a single worker, lowering CPU load. In Dataflow, `GroupByKey` triggers a full shuffle and per-key aggregation, which is expensive for high-volume sensor data; `Combine.globally` with a fanout performs partial aggregation on each worker before a final merge, reducing network I/O and CPU cycles. This maintains data freshness because the 10-minute processing time trigger still fires on time, but with less per-element overhead.

Exam trap

Google Cloud often tests the misconception that scaling out workers (Option A) is the universal fix for performance issues, but the trap here is that the real bottleneck is the shuffle-heavy `GroupByKey` operation, not worker count.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers distributes load but does not address the root cause—the high CPU cost of per-key grouping and shuffling in `GroupByKey`; it may temporarily reduce backlog but adds cost and can still hit scaling limits. Option B is wrong because changing to event time with allowed lateness does not reduce CPU utilization; it only changes watermark semantics and may increase state size, worsening CPU pressure. Option D is wrong because using side inputs to broadcast a static lookup table does not reduce the CPU cost of the aggregation step; it adds memory overhead and does not address the shuffle bottleneck.

79
Multi-Selecthard

A payment processing company needs to detect fraudulent transactions in real time. The system must have sub-second latency for high-value transactions and use a machine learning model. Which two components should be part of the architecture? (Choose TWO.)

Select 2 answers
A.Cloud Storage for transaction logs
B.Bigtable to store user profiles and transaction history for fast lookups
C.Dataflow for stream processing with sliding windows
D.Cloud SQL to store reference data
E.Cloud Functions for long-running batch model training
AnswersB, C

Bigtable offers sub-millisecond latency for point lookups, essential for real-time fraud scoring.

Why this answer

Bigtable is a fully managed, scalable NoSQL database that provides consistent sub-10ms latency for high-throughput read/write operations, making it ideal for real-time lookups of user profiles and transaction history in fraud detection. Its ability to handle large volumes of data with low latency supports the sub-second requirement for high-value transactions.

Exam trap

Google Cloud often tests the distinction between storage services optimized for real-time access (Bigtable) versus batch/archive (Cloud Storage) and between stream processing (Dataflow) versus batch processing or short-lived compute (Cloud Functions).

80
MCQeasy

A startup wants to build a data lake on Google Cloud using Cloud Storage. They need to store raw data in its original format for future analysis. Which storage class should they use to optimize for cost given that data will be accessed occasionally after the first month?

A.Nearline storage class
B.Coldline storage class
C.Standard storage class
D.Archive storage class
AnswerA

Optimized for data accessed less than once a month, cost-effective.

Why this answer

Nearline storage class is the optimal choice because it offers low-cost storage for data accessed less than once a month, with a 30-day minimum storage duration. Since the data is accessed occasionally after the first month, Nearline provides significant cost savings over Standard while still offering low-latency access (milliseconds) suitable for analytics. Coldline and Archive have lower storage costs but impose higher retrieval fees and minimum storage durations (90 and 365 days respectively), making them more expensive for data that is accessed occasionally within the first year.

Exam trap

Google Cloud often tests the misconception that lower storage cost always means lower total cost, ignoring the impact of retrieval fees and minimum storage duration penalties, which can make Coldline or Archive more expensive for data accessed occasionally within the first year.

How to eliminate wrong answers

Option B (Coldline) is wrong because it is designed for data accessed less than once a quarter (90-day minimum storage duration) and has higher retrieval costs, making it more expensive than Nearline for data accessed occasionally after the first month. Option C (Standard) is wrong because it is optimized for frequently accessed data (no minimum storage duration) and has the highest storage cost, which is not cost-effective for data that is only accessed occasionally. Option D (Archive) is wrong because it is intended for long-term archival data accessed less than once a year (365-day minimum storage duration) and has very high retrieval costs and latency (hours), making it unsuitable for occasional access within a year.

81
MCQmedium

A data engineer is designing a batch data pipeline that reads Avro files from Cloud Storage, transforms data using Apache Beam, and writes to BigQuery. The pipeline must handle daily runs and backfills. Which runner should they use?

A.FlinkRunner
B.DataflowRunner
C.SparkRunner
D.DirectRunner
AnswerB

DataflowRunner is a fully managed service that supports batch pipelines, backfills, and direct integration with GCS and BigQuery.

Why this answer

DataflowRunner is the correct choice because it is the fully managed service runner for Apache Beam on Google Cloud, optimized for batch and streaming pipelines. It automatically handles scaling, resource management, and exactly-once processing semantics, which are essential for reliable daily runs and backfills with Avro files from Cloud Storage and BigQuery sinks.

Exam trap

The trap here is that candidates may confuse the runner with the execution engine, assuming that any distributed runner (Flink, Spark) is suitable for production, when the question specifically tests knowledge of Google Cloud-native services and the need for managed infrastructure for batch pipelines with backfills.

How to eliminate wrong answers

Option A is wrong because FlinkRunner is designed for running Beam pipelines on Apache Flink clusters, which require manual cluster management and are not natively integrated with Google Cloud services like Cloud Storage and BigQuery. Option C is wrong because SparkRunner runs Beam pipelines on Apache Spark, which is not a managed service on Google Cloud and lacks the seamless integration with Cloud Storage and BigQuery that DataflowRunner provides. Option D is wrong because DirectRunner is intended for local testing and development only, not for production workloads or handling large-scale daily runs and backfills.

82
MCQhard

A financial services company uses Cloud Pub/Sub with ordering keys to process transactions in order. Some messages are failing processing and getting stuck. The team wants to ensure that if a message fails, it can be reprocessed later without blocking subsequent messages. What should they implement?

A.Create multiple subscriptions for the same topic
B.Use a pull subscription with flow control settings
C.Configure a dead letter topic and handle the failed message separately
D.Increase the acknowledgment deadline to 600 seconds
AnswerC

Dead letter topics isolate failures, allowing forwarding of messages for later reprocessing.

Why this answer

Option C is correct because a dead letter topic (DLT) allows failed messages to be moved aside after exhausting retry attempts, so they do not block the processing of subsequent ordered messages. In Cloud Pub/Sub, ordering keys require messages with the same key to be delivered in order; if a message fails and is not acknowledged, it blocks all later messages with the same key. By configuring a dead letter topic, the failed message is automatically forwarded to the DLT after a maximum of 5 delivery attempts (default), and the original subscription can continue processing the next messages in order.

The team can then reprocess the failed message from the DLT separately, without affecting the order of other messages.

Exam trap

Google Cloud often tests the misconception that increasing the acknowledgment deadline or adding flow control can resolve stuck messages with ordering keys, but the real solution is to use a dead letter topic to offload the failing message and unblock the ordered stream.

How to eliminate wrong answers

Option A is wrong because creating multiple subscriptions for the same topic does not solve the blocking issue; each subscription independently receives all messages, but within a single subscription, ordering keys still cause a failed message to block subsequent messages with the same key. Option B is wrong because pull subscriptions with flow control settings only limit the rate of message delivery and do not handle failed messages that are stuck; they do not provide a mechanism to move failed messages out of the way to unblock ordering. Option D is wrong because increasing the acknowledgment deadline to 600 seconds only gives the subscriber more time to process a message before it is redelivered, but it does not prevent a persistently failing message from blocking subsequent ordered messages indefinitely.

83
MCQmedium

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

A.Session windowing is configured with too short a gap duration.
B.Bigtable schema design causes row key collisions.
C.Dataflow's default behavior discards late-arriving data.
D.Pub/Sub provides at-least-once delivery, and Dataflow does not deduplicate by default.
AnswerD

At-least-once delivery leads to duplicates without dedup in pipeline.

Why this answer

D is correct because Pub/Sub offers at-least-once delivery, meaning the same message may be delivered multiple times. Dataflow does not automatically deduplicate messages unless explicitly configured (e.g., using idempotent sinks or custom deduplication logic). Without deduplication, the same session event can be processed more than once, leading to duplicate session data in Bigtable.

Exam trap

Google Cloud often tests the misconception that Pub/Sub provides exactly-once delivery or that Dataflow automatically deduplicates messages from Pub/Sub, when in fact Pub/Sub is at-least-once and Dataflow requires explicit deduplication for idempotent processing.

How to eliminate wrong answers

Option A is wrong because a short gap duration would cause sessions to be split prematurely, leading to missing data (events not grouped into the same session), not duplicates. Option B is wrong because row key collisions in Bigtable would cause overwrites or errors, not missing or duplicate session data; Bigtable uses lexicographic ordering and row keys are unique per write. Option C is wrong because Dataflow's default behavior for late-arriving data depends on the windowing strategy; with session windows, late data can be included if within the allowed lateness, and Dataflow does not discard late data by default—it uses a default allowed lateness of 0 seconds, which would cause late data to be dropped, but this would result in missing data, not duplicates.

84
Multi-Selectmedium

A data engineer is designing a BigQuery table for time-series data that will be queried frequently by time range and also by a customer_id. Which TWO design decisions will improve query performance and manage costs? (Choose two.)

Select 2 answers
A.Partition the table by day on the timestamp column
B.Cluster the table on customer_id
C.Disable automatic reclustering to save costs
D.Set partition expiration to 1 year
E.Use nested repeated fields for customer data
AnswersA, B

Enables partition pruning for time-range queries.

Why this answer

Partitioning the table by day on the timestamp column allows BigQuery to prune partitions when queries filter by a time range, scanning only the relevant partitions instead of the entire table. This directly reduces the amount of data read, improving query performance and lowering costs.

Exam trap

Google Cloud often tests the misconception that disabling automatic reclustering saves costs, but in reality it is free and essential for maintaining clustering benefits, while partition expiration is a lifecycle management feature, not a performance optimization.

85
MCQeasy

A company needs to process large files (100GB each) from Cloud Storage using Dataproc. They want to minimize job execution time. Which configuration is most appropriate?

A.Use a single-node cluster
B.Use a cluster with preemptible worker nodes and high-CPU machine types
C.Use HDFS for input data to avoid network latency
D.Use a cluster with many standard worker nodes
AnswerB

Preemptible VMs reduce cost, high-CPU machines improve speed.

Why this answer

Option B is correct because preemptible worker nodes are significantly cheaper than standard nodes, allowing you to scale out the cluster with many more workers for the same cost, which directly reduces job execution time for embarrassingly parallel data processing tasks. High-CPU machine types are ideal for compute-intensive Dataproc jobs like data transformation or machine learning, as they provide more vCPUs per core for parallel processing. This combination maximizes parallelism and minimizes wall-clock time for large-scale batch jobs.

Exam trap

The trap here is that candidates often assume standard worker nodes are always better for performance, ignoring the cost-benefit of preemptible nodes that allow scaling to many more workers for the same budget, which directly reduces execution time for parallelizable jobs.

How to eliminate wrong answers

Option A is wrong because a single-node cluster lacks parallelism, so processing 100GB files would be severely bottlenecked by a single machine's CPU and memory, leading to long execution times. Option C is wrong because HDFS is not used for input data from Cloud Storage; Dataproc reads directly from Cloud Storage via the gs:// connector, and using HDFS would require copying data first, adding network latency and storage overhead. Option D is wrong because using many standard worker nodes is less cost-effective than using preemptible nodes; standard nodes are more expensive, so for the same budget you can provision fewer workers, resulting in longer job execution times compared to a larger cluster of preemptible nodes.

86
Multi-Selecthard

A company uses Cloud Composer to orchestrate Dataproc and BigQuery jobs. They need to implement retry logic for transient failures. Which THREE features can help?

Select 3 answers
A.Dataflow pipeline retries
B.DAG retry_delay
C.BigQuery job retries
D.Cloud Composer high availability
E.Task retries and retry_delay
AnswersB, C, E

Composer can retry the entire DAG on failure with a delay.

Why this answer

Option B is correct because Cloud Composer (Apache Airflow) allows setting `retry_delay` at the DAG level to define the time delay between task retries. This is a native Airflow feature that helps handle transient failures by automatically retrying failed tasks after a specified delay, reducing manual intervention.

Exam trap

The trap here is confusing infrastructure-level high availability (Option D) with application-level retry logic, leading candidates to select HA as a retry mechanism when it only ensures environment uptime, not task-level failure recovery.

87
MCQmedium

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

A.Use DataFrame.join with broadcast hint on the dimension DataFrame
B.Read the fact table and dimension table into separate DataFrames and use standard join
C.Read the dimension table as an RDD and collect as a map, then use map-side join
D.Increase the spark.sql.autoBroadcastJoinThreshold to a large value
AnswerA

Forces broadcast join regardless of table size.

Why this answer

Option A is correct because broadcasting the small dimension table using the broadcast hint (e.g., `broadcast(dimensionDF)`) forces Spark to replicate the dimension data to all executor nodes, eliminating the need for a shuffle during the join. This is ideal when the dimension table is small enough to fit in executor memory, and since the pipeline reads the latest CSV daily, the broadcast will automatically use the updated data without additional code changes.

Exam trap

The trap here is that candidates may think increasing `spark.sql.autoBroadcastJoinThreshold` is a safe global fix, but it can cause memory pressure and does not guarantee a broadcast join if the table size fluctuates, whereas the explicit broadcast hint provides deterministic behavior.

How to eliminate wrong answers

Option B is wrong because a standard join without any hint or optimization will trigger a full shuffle of both datasets, which is exactly the performance problem described. Option C is wrong because manually collecting the dimension table as an RDD and using a map-side join is an outdated, error-prone approach that bypasses Spark SQL's Catalyst optimizer and broadcast join optimizations; it also requires manual handling of updates and memory management. Option D is wrong because increasing `spark.sql.autoBroadcastJoinThreshold` globally may cause the dimension table to be broadcast automatically, but it does not guarantee the join uses a broadcast if the table size exceeds the threshold, and it can lead to out-of-memory errors if the threshold is set too high without considering executor memory limits.

88
MCQeasy

A data engineer tries to grant a service account read access to a Cloud Storage bucket using the IAM policy above. The service account still cannot read objects. What is the most likely reason?

A.The role does not include the necessary permission
B.The condition prevents access because the request time is after 2023
C.The service account is misspelled
D.The role should be roles/storage.admin
AnswerB

The condition expression requires request.time before 2023, which is likely no longer true.

Why this answer

Option B is correct because the IAM condition explicitly restricts access to requests made before January 1, 2023. Since the current time is after that date, the condition evaluates to false, denying the service account's read access regardless of the role binding. IAM conditions are evaluated at request time, and if the condition is not met, the permission is not granted.

Exam trap

Google Cloud often tests the subtlety that IAM conditions are evaluated at request time and can override a valid role binding, leading candidates to mistakenly focus on the role's permissions rather than the condition's effect.

How to eliminate wrong answers

Option A is wrong because roles/storage.objectViewer includes the storage.objects.get permission required to read objects, so the role does include the necessary permission. Option C is wrong because a misspelled service account would result in the role not being bound at all, but the question states the policy was applied, implying the service account name is correct. Option D is wrong because roles/storage.admin is an overly permissive role that includes many additional permissions beyond read access; the issue is not the role's permissions but the condition blocking access.

89
Multi-Selectmedium

Which TWO statements are correct about designing a data pipeline using Cloud Dataflow for processing unbounded data?

Select 2 answers
A.Watermarks are used to measure the progress of event time.
B.Triggers can only emit results at the end of a window.
C.Dataflow guarantees exactly-once processing for streaming pipelines.
D.Cloud Pub/Sub is the recommended source for streaming pipelines.
E.Fixed windows are always based on processing time.
AnswersA, D

Watermarks track event time progress.

Why this answer

Watermarks in Cloud Dataflow measure the progress of event time, indicating when all data up to a certain timestamp is expected to have arrived. This allows the pipeline to handle late-arriving data and determine when to close windows for unbounded data streams.

Exam trap

Google Cloud often tests the misconception that triggers only fire at window boundaries, when in fact Dataflow supports early, on-time, and late firings for flexible result emission.

90
MCQmedium

A data team runs regular analytical queries on a BigQuery table that stores 2 years of sales data (approximately 10 TB). Queries frequently filter on a `sale_date` column and also group by `product_id`. To optimize cost and performance, which design approach is most effective?

A.Do not partition; only cluster by `sale_date`.
B.Partition by `sale_date` and set a table expiration of 90 days.
C.Partition the table by `sale_date` and cluster by `product_id`.
D.Partition by `product_id` and cluster by `sale_date`.
AnswerC

Partitioning by date enables partition elimination on date filters; clustering by product_id co-locates rows with the same product_id within each partition, improving GROUP BY performance.

Why this answer

Option C is correct because partitioning by `sale_date` allows BigQuery to perform partition pruning, eliminating scans of irrelevant date ranges, while clustering by `product_id` physically co-locates rows with the same product ID within each partition. This combination minimizes the data scanned for queries that filter on `sale_date` and group by `product_id`, directly reducing both cost (bytes billed) and query latency.

Exam trap

Google Cloud often tests the misconception that partitioning by a high-cardinality column like `product_id` is acceptable, but the trap here is that BigQuery enforces a hard limit of 4,000 partitions per table, making such a design infeasible and forcing candidates to recognize that clustering is the correct mechanism for high-cardinality grouping columns.

How to eliminate wrong answers

Option A is wrong because without partitioning, BigQuery must scan the entire 10 TB table even for queries filtering on a narrow date range, leading to unnecessarily high costs and slower performance. Option B is wrong because setting a table expiration of 90 days would delete historical data needed for 2-year analysis, and partitioning alone without clustering does not optimize the GROUP BY on `product_id` within each partition. Option D is wrong because partitioning by `product_id` (a high-cardinality column) would create millions of tiny partitions, exceeding BigQuery's partition limit (4,000 partitions per table) and causing poor performance and management overhead.

91
Drag & Dropmedium

Drag and drop the steps to create a Cloud Storage bucket with uniform bucket-level access into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Uniform bucket-level access simplifies permissions by using IAM policies at the bucket level instead of ACLs.

92
Multi-Selecthard

A company is designing a data lake on Cloud Storage for analytics. They need to store data in various formats (Avro, Parquet, CSV) and enable efficient querying with BigQuery and Dataproc. Which THREE practices should they follow?

Select 3 answers
A.Use BigLake to create BigQuery tables that reference Cloud Storage data.
B.Store data in columnar formats like Parquet for analytics workloads.
C.Disable encryption on the bucket to improve read performance.
D.Partition data by date in a logical folder structure (e.g., /data/yyyy/mm/dd).
E.Store all data in CSV format for simplicity.
AnswersA, B, D

Enables querying data without loading.

Why this answer

BigLake allows you to create BigQuery tables that reference data stored in Cloud Storage, enabling unified governance and fine-grained access control without moving data. This is essential for a data lake architecture where BigQuery and Dataproc need to query the same underlying data in various formats like Avro, Parquet, and CSV.

Exam trap

Google Cloud often tests the misconception that disabling encryption improves performance, but Cloud Storage encryption is transparent and has no measurable impact on read throughput, so candidates should recognize that security controls are non-negotiable in cloud data lakes.

93
MCQeasy

An organization wants to automate their batch data processing pipeline using Cloud Composer. The pipeline consists of multiple tasks: extract from Cloud Storage, transform with Dataflow, and load into BigQuery. Which Airflow operator should be used to run Dataflow jobs?

A.BigQueryInsertJobOperator
B.DataflowCreatePythonJobOperator
C.GCSToBigQueryOperator
D.DataprocSubmitJobOperator
AnswerB

This operator submits a Dataflow job written in Python.

Why this answer

B is correct because the DataflowCreatePythonJobOperator is specifically designed to submit and manage Apache Beam pipelines written in Python as Dataflow jobs in Google Cloud. This operator handles the creation of a Dataflow job from a Python file, which aligns with the requirement to run Dataflow transformations within a Cloud Composer DAG.

Exam trap

Google Cloud often tests the distinction between Dataflow and Dataproc operators, so the trap here is that candidates might confuse DataprocSubmitJobOperator (for Hadoop/Spark) with Dataflow operators, especially when the question mentions 'transform' without specifying the processing framework.

How to eliminate wrong answers

Option A is wrong because BigQueryInsertJobOperator is used to run BigQuery jobs (e.g., queries, load jobs), not to submit Dataflow pipelines. Option C is wrong because GCSToBigQueryOperator loads data directly from Cloud Storage to BigQuery without using Dataflow for transformation, bypassing the required transform step. Option D is wrong because DataprocSubmitJobOperator submits jobs to Dataproc (Hadoop/Spark clusters), not to Dataflow, which is a different processing service.

94
MCQmedium

A BigQuery table contains streaming data from Cloud Pub/Sub. The table is partitioned by ingestion time. A user runs a query that accesses data from the last 5 minutes and gets correct results. After 90 minutes, the user runs the same query again but notices that some rows are missing. What is the most likely cause?

A.The query is using time travel to a snapshot before the streaming buffer was committed
B.The query is using cached results that exclude recent data
C.The schema of the table was modified after the initial query
D.The table has a partition expiration of 30 days
AnswerA

Time travel queries return data from a snapshot; if the snapshot is before the buffer is flushed, recent data is missing.

Why this answer

Option A is correct because BigQuery's streaming buffer provides low-latency access to recently ingested data, but this data is not immediately committed to managed storage. After the streaming buffer is flushed (typically within 90 minutes), the data becomes available in the table's base storage. If the user runs a query using time travel (e.g., `FOR SYSTEM_TIME AS OF`) to a snapshot taken before the buffer was committed, the query will only see data that was in managed storage at that snapshot time, missing rows that were still in the streaming buffer at that point.

Exam trap

Google Cloud often tests the misconception that cached results or schema changes are responsible for data inconsistencies, when the real issue is the separation between BigQuery's streaming buffer and managed storage, and how time travel queries only see committed data.

How to eliminate wrong answers

Option B is wrong because BigQuery caches query results only for identical queries within a 24-hour period, but the user ran the same query after 90 minutes; if cached results were used, they would include the same rows as the initial query, not missing rows. Option C is wrong because schema modifications do not cause rows to disappear from query results; they may affect column access or data types but do not remove existing rows. Option D is wrong because a partition expiration of 30 days would only remove partitions older than 30 days, not affect data from the last 5 minutes or 90 minutes.

95
MCQhard

A team is using BigQuery to analyze petabyte-scale data. They notice that queries are slow and expensive due to full table scans. They have already partitioned by date. What additional optimization should they implement?

A.Use materialized views
B.Cluster by frequently filtered columns
C.Convert to native tables
D.Use query caching
AnswerB

Clustering reduces bytes read when filtering on those columns.

Why this answer

Clustering by frequently filtered columns (option B) organizes data within each partition based on the sort order of those columns. This allows BigQuery to prune blocks during query execution, significantly reducing the amount of data scanned and improving both performance and cost. Since the table is already partitioned by date, clustering adds a secondary ordering that targets the most common filter predicates, avoiding full table scans within each partition.

Exam trap

Google Cloud often tests the distinction between partitioning and clustering, where candidates mistakenly believe partitioning alone is sufficient for all filtering scenarios, but clustering is required to avoid full scans on non-date columns.

How to eliminate wrong answers

Option A is wrong because materialized views precompute and store query results, which can speed up repeated aggregations but do not reduce the scan cost of ad-hoc filters on raw data; they are not a substitute for physical data organization like clustering. Option C is wrong because BigQuery tables are already native (managed) tables; converting to native tables is not a valid operation and does not address scan efficiency. Option D is wrong because query caching only returns results for identical queries run within 24 hours, but it does not reduce the scan cost or improve performance for new or slightly different queries that still trigger full table scans.

96
Multi-Selecteasy

Which TWO approaches are recommended for handling late-arriving data in a streaming Dataflow pipeline?

Select 2 answers
A.Use side inputs to provide default values for late data.
B.Use fixed windows with a duration of 1 second to minimize lateness.
C.Configure allowed lateness on the window to accept late data.
D.Set the trigger to fire only at the end of the window.
E.Use a filter transform to drop late-arriving elements.
AnswersA, C

Side inputs can supply missing data.

Why this answer

Option A is correct because side inputs in Apache Beam (the programming model underlying Dataflow) allow you to provide default values or supplementary data to handle late-arriving elements gracefully. When a late element arrives after the window has been emitted, a side input can supply a fallback value, ensuring the pipeline can still process the data without discarding it. This approach is recommended for handling late data in streaming pipelines where completeness is not critical.

Exam trap

Google Cloud often tests the misconception that simply using small windows or dropping late data is a valid handling strategy, when in fact the recommended approaches involve configuring allowed lateness and using side inputs for graceful fallback.

97
MCQhard

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

A.Check Stackdriver logging for error messages.
B.Disable exactly-once processing in Dataflow.
C.Increase the number of Dataflow workers.
D.Switch to BigQuery streaming inserts.
AnswerA

Identifies root cause.

Why this answer

Option A is correct because Stackdriver (now Cloud Logging) is the first place to investigate when a Dataflow pipeline experiences high latency and data loss. Dataflow automatically logs errors, worker failures, and system messages to Cloud Logging, which can reveal root causes such as insufficient resources, stuck steps, or Pub/Sub subscription issues. Checking logs first avoids premature scaling or configuration changes that may not address the actual problem.

Exam trap

Google Cloud often tests the principle of 'diagnose before you optimize' — the trap here is that candidates jump to scaling or switching technologies (options C and D) without first checking logs, which is the fundamental first step in any troubleshooting workflow.

How to eliminate wrong answers

Option B is wrong because disabling exactly-once processing in Dataflow would not fix high latency or data loss; it could actually increase data duplication and make debugging harder, while the core issue remains unaddressed. Option C is wrong because increasing the number of Dataflow workers without first diagnosing the bottleneck (e.g., a hot key, slow transform, or Pub/Sub backlog) can waste resources and may not resolve the underlying cause of latency or loss. Option D is wrong because switching to BigQuery streaming inserts does not address pipeline-level failures; streaming inserts have their own quotas, error handling, and latency characteristics, and the problem likely lies in the Dataflow processing logic or resource allocation, not the sink.

98
MCQmedium

A company uses BigQuery to run reporting queries on a table that is partitioned by date and clustered by customer_id. Queries filtering by customer_id and a date range are performing poorly. What is the most likely cause?

A.The project lacks sufficient BigQuery slot capacity
B.The table is too large for BigQuery
C.Clustering column order should be date first, then customer_id
D.The date range filter is too wide, causing scans of many partitions
AnswerD

Wide date ranges nullify the benefit of clustering; BigQuery scans many partitions.

Why this answer

Option D is correct because when a table is partitioned by date and clustered by customer_id, queries that filter on both columns can still perform poorly if the date range filter is too wide, causing BigQuery to scan many partitions. Even with clustering, scanning a large number of partitions negates the benefit of clustering, as clustering only reduces the data scanned within each partition. The query optimizer must read all partitions that fall within the date range, and if that range is broad, the scan overhead dominates.

Exam trap

The trap here is that candidates often assume clustering alone guarantees fast queries on any filter combination, without understanding that partition pruning happens first and a wide date range undermines the benefit of clustering.

How to eliminate wrong answers

Option A is wrong because insufficient slot capacity would cause slow query execution or queuing, not specifically poor performance on partitioned and clustered tables; the issue here is data scanning inefficiency, not resource contention. Option B is wrong because BigQuery is designed to handle tables of any size, and 'too large' is not a meaningful limitation; the problem is query design, not table size. Option C is wrong because the clustering column order is already correct for the typical query pattern (filtering by customer_id and date range); clustering by date first would not improve performance for queries that filter on customer_id, as clustering only benefits the first column in the order.

99
Multi-Selecthard

A streaming pipeline uses Cloud Pub/Sub and Dataflow to process financial transactions. The pipeline must guarantee that each transaction is processed exactly once and in order per customer key. Which two configurations are necessary? (Choose two.)

Select 2 answers
A.Use a session window with max gap duration
B.Use a keyed state with a value state per customer
C.Use Dataflow stateful processing with event time ordering
D.Use a Pub/Sub topic with ordering keys
E.Use a global window with a trigger
AnswersC, D

Dataflow stateful processing with event time ordering allows processing events per key in the order they were generated, with exactly-once guarantees.

Why this answer

Options A and B are correct. Pub/Sub ordering keys (A) ensure messages with the same ordering key are delivered in order. Dataflow stateful processing with event time ordering (B) allows processing events in order while maintaining exactly-once semantics.

Option C (global window with trigger) does not guarantee order. Option D (keyed state) is required but is encompassed by B. Option E (session window) is not about ordering.

100
MCQhard

A company runs a streaming data pipeline on Google Cloud using Cloud Pub/Sub, Cloud Dataflow, and BigQuery. The pipeline processes real-time sensor data for predictive maintenance. Recently, the Dataflow job's lag has increased from seconds to minutes, and the system shows backpressure. The pipeline uses fixed windows of 1 minute and writes results to BigQuery. The data volume has doubled. The team has already increased the number of workers. What should they do next? Options: A. Use session windows instead of fixed windows. B. Enable Streaming Engine and use Upsert to BigQuery. C. Decrease the window duration. D. Use Cloud Storage as temporary sink.

A.Enable Streaming Engine and use Upsert to BigQuery
B.Decrease the window duration
C.Use session windows instead of fixed windows
D.Use Cloud Storage as temporary sink
AnswerA

Streaming Engine reduces overhead and Upsert makes BigQuery writes more efficient.

Why this answer

The correct answer is A because enabling Streaming Engine offloads the heavy shuffle and state management from the worker VMs to the backend service, reducing the impact of backpressure. Using Upsert to BigQuery allows the pipeline to handle late-arriving data within the fixed windows without requiring a full table rewrite, which is critical when data volume has doubled and lag has increased.

Exam trap

The trap here is that candidates often assume increasing workers or changing window sizes will fix backpressure, but the real bottleneck is often the shuffle and state management in Dataflow, which Streaming Engine directly addresses.

How to eliminate wrong answers

Option B is wrong because decreasing the window duration would increase the number of windows and the frequency of writes, exacerbating the backpressure and lag rather than solving it. Option C is wrong because session windows are designed for grouping events based on gaps of inactivity, which is not relevant to the fixed-window requirement for predictive maintenance sensor data; they would not reduce backpressure. Option D is wrong because using Cloud Storage as a temporary sink adds an extra write step and does not address the root cause of backpressure in the Dataflow pipeline; it would increase latency and complexity.

101
MCQeasy

A streaming Dataflow job is processing messages from Cloud Pub/Sub. The job is underutilizing resources and the throughput is lower than expected. Which parameter should be adjusted to increase parallelism?

A.Change the workerMachineType to a higher CPU machine
B.Increase the number of workers via maxNumWorkers
C.Set the streaming engine to Dataflow Streaming Engine
D.Set autoscalingAlgorithm to THROUGHPUT_BASED
AnswerB

More workers allow more parallelism.

Why this answer

The job is underutilizing resources, meaning the existing workers are not fully loaded. Increasing the number of workers via maxNumWorkers directly increases parallelism by allowing Dataflow to distribute work across more VMs, which can increase throughput without changing the per-worker resource profile. This parameter controls the upper bound on the number of workers, enabling the autoscaler to scale out when there is backlog.

Exam trap

Google Cloud often tests the misconception that increasing per-worker resources (CPU/memory) is the primary way to improve throughput in a streaming job, when in fact underutilization indicates the need to scale out workers rather than scale up individual workers.

How to eliminate wrong answers

Option A is wrong because changing workerMachineType to a higher CPU machine increases per-worker compute capacity but does not address underutilization; if workers are idle, adding more CPU per worker will not increase parallelism or throughput. Option C is wrong because Dataflow Streaming Engine is a service that offloads shuffle and state management to the backend, reducing per-worker overhead and improving scalability, but it does not directly increase parallelism; it changes the execution model. Option D is wrong because setting autoscalingAlgorithm to THROUGHPUT_BASED is already the default for streaming jobs; it enables autoscaling based on throughput metrics, but without adjusting maxNumWorkers, the autoscaler cannot scale beyond the default limit, so throughput remains capped.

102
MCQmedium

A data engineering team needs to process a large volume of CSV files stored in Cloud Storage using Dataproc. The files are generated hourly and each contains millions of rows. They want to minimize the number of Dataproc cluster nodes to reduce cost while processing within an hour. Which configuration should they recommend?

A.Use a cluster with preemptible worker nodes only.
B.Use a cluster with local SSDs for temporary storage.
C.Use a cluster with a few large worker nodes and use Spark static allocation.
D.Use a cluster with many small worker nodes and use Spark dynamic allocation.
AnswerD

Dynamic allocation adjusts resources based on workload; small nodes provide granular scaling.

Why this answer

Option D is correct because using many small worker nodes with Spark dynamic allocation allows the cluster to scale resources precisely to the workload, minimizing idle capacity and cost. Dynamic allocation enables executors to be added or removed based on the processing demands of the hourly CSV files, ensuring the job completes within the hour without over-provisioning nodes.

Exam trap

Google Cloud often tests the misconception that larger nodes are always more cost-effective for big data processing, but in practice, many small nodes with dynamic allocation reduce idle resource waste and better match the parallelism needs of distributed file processing.

How to eliminate wrong answers

Option A is wrong because preemptible worker nodes only can be terminated at any time by Google Cloud, risking job failure or delays when processing millions of rows per hour, and they cannot be the sole worker nodes for a reliable Dataproc cluster. Option B is wrong because local SSDs improve I/O performance for shuffle operations but do not directly reduce the number of nodes or cost; they add cost per node and are not a configuration for minimizing node count. Option C is wrong because using a few large worker nodes with Spark static allocation reserves a fixed number of executors regardless of actual workload, leading to underutilization and higher cost if the job does not need all resources, and it does not adapt to the hourly data volume variations.

103
Drag & Dropmedium

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Push subscriptions send messages to a configured HTTPS endpoint.

104
Multi-Selecthard

You are designing a streaming pipeline that must guarantee exactly-once processing. Which three services or features can help achieve this? (Choose THREE.)

Select 3 answers
A.Cloud Functions for post-processing
B.BigQuery streaming inserts with a unique key for deduplication
C.Cloud Spanner for deduplication state across the pipeline
D.Cloud Pub/Sub with duplicate detection (using message IDs)
E.Dataflow with idempotent write operations to BigQuery
AnswersC, D, E

Using Cloud Spanner as a global state store allows tracking processed event IDs for deduplication.

Why this answer

Cloud Spanner is correct because it provides globally distributed, strongly consistent transactions that can be used to maintain deduplication state across the entire streaming pipeline. By storing a unique key for each processed event in Spanner, the pipeline can atomically check and record whether an event has already been handled, ensuring exactly-once semantics even in the face of retries or failures.

Exam trap

Google Cloud often tests the misconception that BigQuery streaming inserts can guarantee exactly-once processing via a unique key, when in fact BigQuery only supports at-least-once delivery and requires external deduplication mechanisms like Cloud Spanner or Dataflow with idempotent writes.

105
MCQeasy

A company processes CSV files that are uploaded to Cloud Storage by external partners. Each file is around 500 MB, and they need to be parsed and loaded into BigQuery. The processing must start as soon as the file arrives. What is the most efficient serverless architecture?

A.Cloud Storage triggers a Cloud Function that publishes events to Pub/Sub; a Dataflow streaming pipeline reads from Pub/Sub and writes to BigQuery.
B.Use Cloud Scheduler to periodically check for new files and process them with Dataflow batch jobs.
C.Cloud Storage triggers a Dataproc job that reads the file and loads it into BigQuery.
D.Cloud Storage triggers a Cloud Function that directly loads the data into BigQuery using the BigQuery API.
AnswerA

Serverless and scales well with file uploads.

Why this answer

Option A is correct because it combines Cloud Storage event-driven triggers with Pub/Sub for reliable asynchronous message delivery, and uses Dataflow streaming with autoscaling to handle 500 MB files efficiently. This serverless architecture ensures processing starts immediately upon file arrival, scales to handle large files without manual intervention, and leverages BigQuery's streaming inserts for near-real-time data loading.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle large file processing directly, but the 9-minute timeout and memory limits make them unsuitable for files over a few hundred MB, pushing candidates toward the seemingly simpler Option D.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler polling introduces latency and inefficiency, as it checks for new files on a fixed schedule rather than reacting instantly, which violates the requirement that processing must start as soon as the file arrives. Option C is wrong because Dataproc is a managed Hadoop/Spark service that requires cluster provisioning and startup time, adding overhead for a simple CSV-to-BigQuery load; it is not serverless and not the most efficient for this use case. Option D is wrong because Cloud Functions have a 9-minute timeout and 2 GB memory limit, making them unsuitable for parsing and loading a 500 MB CSV file directly via the BigQuery API, which would likely exceed these constraints and cause failures.

106
MCQmedium

A gaming company uses Avro schemas for its streaming event data. They anticipate adding new optional fields to events over time. They need to ensure backward compatibility so that existing pipelines continue to work. Which strategy should they adopt?

A.Use Avro with a schema registry that enforces backward-compatible changes
B.Use JSON instead of Avro and ignore unknown fields
C.Use Protocol Buffers with breaking changes
D.Use FlatBuffers for performance
AnswerA

Avro's schema evolution rules allow adding optional fields without breaking existing consumers, and a schema registry enables version management.

Why this answer

Option A is correct because Avro, combined with a schema registry, allows schema evolution with backward compatibility. The registry enforces rules such as adding optional fields with defaults, ensuring that consumers using older schemas can still deserialize new data without breaking. This directly addresses the requirement for existing pipelines to continue working as new optional fields are added.

Exam trap

Google Cloud often tests the misconception that any serialization format (like JSON or Protocol Buffers) inherently supports backward compatibility, but the key is the combination of a schema registry with enforced evolution rules, which only Avro explicitly provides in this context.

How to eliminate wrong answers

Option B is wrong because JSON lacks a schema enforcement mechanism; while ignoring unknown fields is possible, JSON does not provide built-in compatibility guarantees or schema evolution rules, making it error-prone in large-scale streaming systems. Option C is wrong because Protocol Buffers can support backward compatibility, but the option specifies 'breaking changes,' which would violate the requirement for backward compatibility. Option D is wrong because FlatBuffers prioritize performance (zero-copy deserialization) but do not inherently enforce backward-compatible schema evolution, and they are less suited for streaming event data with frequent schema changes.

107
MCQeasy

A company wants to implement a data lake on Google Cloud to store raw sensor data (unstructured binary files) and allow data scientists to run SQL queries on processed data. They expect to store terabytes of data and have different access patterns. Which combination of GCP services best meets these requirements?

A.Bigtable for raw data and Cloud Spanner for processed data
B.Cloud Storage for both raw and processed data
C.Cloud SQL for raw data and Cloud Dataproc for processing
D.Cloud Storage for raw data and BigQuery for processed data
AnswerD

Cloud Storage stores any file type cost-effectively, and BigQuery provides fast SQL queries on structured data.

Why this answer

Cloud Storage is the ideal service for storing raw, unstructured binary sensor data at petabyte scale, offering low-cost, durable object storage with multiple access tiers. BigQuery is a serverless, highly scalable data warehouse that allows data scientists to run SQL queries on processed data, with features like columnar storage and automatic optimization for analytical workloads. This combination directly addresses the need for raw storage and SQL-based analytics on processed data.

Exam trap

Google Cloud often tests the misconception that Cloud Storage can serve as a queryable database for SQL, when in fact it requires an external query engine like BigQuery or Dataproc for SQL access.

How to eliminate wrong answers

Option A is wrong because Bigtable is a NoSQL wide-column database optimized for real-time, low-latency access, not for storing raw unstructured binary files, and Cloud Spanner is a globally distributed relational database for transactional workloads, not for analytical SQL queries on processed data. Option B is wrong because while Cloud Storage can store both raw and processed data, it does not natively support SQL queries; data scientists would need an additional service like BigQuery or Dataproc to run SQL. Option C is wrong because Cloud SQL is a relational database for structured data, not designed for raw unstructured binary files, and Cloud Dataproc is a managed Spark/Hadoop service for processing, not a SQL query engine for processed data.

108
MCQmedium

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A.Use Cloud Dataproc Serverless for all Spark jobs.
B.Migrate jobs to Cloud Dataflow.
C.Run Spark on Compute Engine instances with startup scripts.
D.Use Dataproc clusters with auto-scaling and preemptible VMs.
AnswerD

Reduces cost and operational overhead.

Why this answer

Option D is correct because Dataproc clusters with auto-scaling and preemptible VMs directly address the need to reduce operational overhead and minimize costs for on-premises Spark migrations. Auto-scaling dynamically adjusts cluster size based on workload, while preemptible VMs (which cost 60-80% less than standard VMs) handle fault-tolerant tasks, making this the most cost-effective and operationally efficient architecture for Spark on Dataproc.

Exam trap

The trap here is that candidates often choose Cloud Dataproc Serverless (Option A) thinking it eliminates all operational overhead, but they overlook that it lacks the cost-saving benefits of preemptible VMs and may not support all Spark features, making auto-scaling clusters with preemptible VMs the more appropriate choice for minimizing costs in a migration scenario.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc Serverless is designed for batch Spark workloads without cluster management, but it lacks the flexibility and cost optimization of preemptible VMs for long-running or complex jobs, and may not support all Spark configurations or libraries used in on-premises environments. Option B is wrong because Cloud Dataflow is a different processing engine (Apache Beam) that requires rewriting Spark jobs into Beam pipelines, adding migration complexity and operational overhead, not reducing it. Option C is wrong because running Spark on Compute Engine instances with startup scripts requires manual cluster management, scaling, and fault tolerance, increasing operational overhead and negating the benefits of a managed service like Dataproc.

109
MCQhard

What is the root cause of this error and the correct solution?

A.The BigQuery table requires authorized view access.
B.The user running the job needs the BigQuery Admin role.
C.The Dataflow service account needs the BigQuery User role.
D.The Dataflow worker service account needs the BigQuery Data Viewer role.
AnswerD

BigQuery Data Viewer includes the required getData permission.

Why this answer

Option D is correct because Dataflow workers execute under a specific service account (compute engine default or custom), and that service account must have the BigQuery Data Viewer role to read data from BigQuery tables. Without this permission, the workers cannot access the source data, causing the job to fail with access errors. The BigQuery User role is insufficient for reading table data, and the BigQuery Admin role is overly permissive and not required for this task.

Exam trap

Google Cloud often tests the distinction between the Dataflow controller service account (which manages the job) and the Dataflow worker service account (which performs data operations), leading candidates to incorrectly assign permissions to the controller account instead of the worker account.

How to eliminate wrong answers

Option A is wrong because authorized view access is a mechanism to share query results without granting direct table access, but the error here is about the Dataflow service account lacking read permissions on the BigQuery table, not about view authorization. Option B is wrong because the BigQuery Admin role grants full control over BigQuery resources, which is excessive and not necessary; the user running the job does not need admin rights—only the worker service account needs read access. Option C is wrong because the BigQuery User role allows running queries and creating datasets but does not grant read access to table data; the Dataflow service account (which orchestrates the job) does not directly read data—the worker service account does.

110
MCQhard

A data pipeline uses Cloud Pub/Sub to ingest events, then a Dataflow job writes to Cloud Storage in Avro format. The Dataflow job uses Global windows with a 10-minute trigger. The data is later loaded into BigQuery. They notice duplicate rows in BigQuery because the trigger produced multiple panes. What should the Dataflow pipeline change to eliminate duplicates?

A.Enable exactly-once sink to BigQuery via Dataflow
B.Use a sharded output to Cloud Storage with unique filenames
C.Write to a staging table and use a MERGE statement in BigQuery
D.Use a session window instead of global window
AnswerA

Dataflow's exactly-once sink to BigQuery uses record IDs to deduplicate, preventing duplicates caused by trigger panes.

Why this answer

Option A is correct because enabling exactly-once sinks in Dataflow ensures that each record is written to the sink only once, even if the pipeline produces multiple panes due to triggers. In this scenario, the 10-minute trigger on a global window causes multiple output panes, leading to duplicate rows in BigQuery. Exactly-once sinks use idempotent writes and deduplication mechanisms to prevent duplicates, directly addressing the issue without changing the windowing or trigger logic.

Exam trap

Google Cloud often tests the misconception that changing windowing or output file naming can solve duplicate data issues, when the real solution is to enable exactly-once processing guarantees at the sink level.

How to eliminate wrong answers

Option B is wrong because sharded output with unique filenames only prevents file-level collisions in Cloud Storage, but does not eliminate duplicate rows within the Avro files; duplicates from multiple panes still exist. Option C is wrong because writing to a staging table and using a MERGE statement is a workaround that does not fix the root cause in the Dataflow pipeline; it adds complexity and latency, and is not a Dataflow-native solution. Option D is wrong because session windows group events based on activity gaps, not time intervals; they do not prevent duplicate panes from triggers and are inappropriate for a global-windowed pipeline that needs to deduplicate across all data.

111
MCQeasy

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

A.roles/bigquery.admin
B.roles/bigquery.user
C.roles/bigquery.jobUser
D.roles/bigquery.dataEditor
AnswerD

Includes bigquery.tables.get.

Why this answer

The error indicates the service account lacks the `bigquery.tables.get` permission, which is required to read table metadata. `roles/bigquery.dataEditor` includes this permission along with `bigquery.tables.get`, `bigquery.tables.update`, and `bigquery.tables.export`, making it the minimal role that resolves the access denied error for a Dataflow job reading from a BigQuery table.

Exam trap

Google Cloud often tests the misconception that `roles/bigquery.user` or `roles/bigquery.jobUser` provide sufficient read access for Dataflow jobs, when in fact they lack the specific `bigquery.tables.get` permission needed for table metadata retrieval.

How to eliminate wrong answers

Option A is wrong because `roles/bigquery.admin` grants full control over BigQuery resources, including dataset deletion and IAM policy management, which is excessive and violates the principle of least privilege for a Dataflow job that only needs to read table data. Option B is wrong because `roles/bigquery.user` provides `bigquery.datasets.get` and `bigquery.jobs.create` but does not include `bigquery.tables.get`, so it would not resolve the specific permission error. Option C is wrong because `roles/bigquery.jobUser` only allows creating and managing jobs (e.g., queries) but does not grant any direct table read permissions like `bigquery.tables.get`.

112
MCQhard

A company uses Cloud Dataproc to run Spark jobs on ephemeral clusters. The input data is in Cloud Storage and output is also to Cloud Storage. The cluster is created and deleted daily. The cost is high due to spinning up nodes. Which change can reduce cost without sacrificing performance?

A.Use standard VMs with a larger number of smaller machines
B.Use Cloud Dataflow instead
C.Use a combination of standard and preemptible VMs for worker nodes
D.Use preemptible VMs for all nodes
AnswerC

Preemptible VMs for workers reduce cost significantly; standard VMs for the master and a few worker nodes ensure reliability.

Why this answer

Option C is correct because using a combination of standard and preemptible VMs for worker nodes reduces cost significantly while maintaining performance. Preemptible VMs are up to 80% cheaper than standard VMs, and since Spark is fault-tolerant and can handle node preemptions via speculative execution, the job can complete without performance degradation. Standard VMs for master nodes ensure cluster stability, while preemptible workers handle the bulk of data processing.

Exam trap

Google Cloud often tests the misconception that preemptible VMs can be used for all nodes, but the trap here is that the master node must be a standard VM to avoid cluster instability, while workers can safely use preemptible VMs due to Spark's fault tolerance.

How to eliminate wrong answers

Option A is wrong because using a larger number of smaller machines increases overhead from inter-node communication and task scheduling, potentially degrading performance and not necessarily reducing cost. Option B is wrong because Cloud Dataflow is a different service for batch and stream processing, not a direct replacement for Spark on Dataproc; migrating would require rewriting jobs and may not preserve existing Spark-specific logic or performance characteristics. Option D is wrong because using preemptible VMs for all nodes, including the master node, risks cluster failure if the master is preempted, as Dataproc does not automatically recover the master; this sacrifices reliability and can cause job failures.

113
MCQeasy

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

A.Migrate the pipeline to Cloud Dataflow with Apache Beam for auto-scaling
B.Read inventory data from BigQuery and pre-join in BigQuery, then export to Cloud Storage as ORC files
C.Write intermediate results to Cloud SQL instead of BigQuery for faster access
D.Increase the number of worker nodes in the Dataproc cluster
AnswerB

Reduces data shuffle in Spark and speeds up processing.

Why this answer

Option B is correct because it offloads the join operation to BigQuery, which is optimized for large-scale analytics and can process the join much faster than Spark. By pre-joining and exporting the result as ORC files (a columnar format optimized for Spark), the pipeline avoids the expensive shuffle and data transfer between Cloud Storage and BigQuery, significantly reducing the overall processing time to meet the 30-minute target.

Exam trap

The trap here is that candidates often assume that simply scaling up the existing infrastructure (more workers or auto-scaling) is the most effective optimization, but Cisco tests the understanding that architectural changes to reduce data movement and leverage service-specific strengths (like BigQuery for joins) are far more impactful than brute-force scaling.

How to eliminate wrong answers

Option A is wrong because migrating to Cloud Dataflow with Apache Beam introduces auto-scaling but does not address the fundamental bottleneck of joining large datasets across Cloud Storage and BigQuery; the join operation would still require significant data movement and processing, likely not achieving the required speedup. Option C is wrong because writing intermediate results to Cloud SQL instead of BigQuery would actually slow down the pipeline, as Cloud SQL is a transactional database not designed for high-throughput batch writes, and it would introduce additional latency and potential contention. Option D is wrong because simply increasing the number of worker nodes in the Dataproc cluster may improve parallelism but does not eliminate the costly shuffle and data transfer inherent in the join between Cloud Storage and BigQuery; it would also increase costs without guaranteeing the 6x performance improvement needed.

114
MCQeasy

An e-commerce company processes real-time clickstream data using Pub/Sub and Dataflow. They want to ensure that if a Dataflow worker fails, the pipeline can resume processing from the point of failure without data loss. Which feature should they enable?

A.At-least-once delivery mode
B.Exactly-once processing mode
C.Snapshot-based recovery
D.Streaming engine
AnswerC

Allows periodic saving of pipeline state and resumption from saved snapshots.

Why this answer

Snapshot-based recovery (Option C) is the correct feature because Dataflow snapshots capture the entire pipeline state, including the current position in each Pub/Sub subscription and the state of all transforms. If a worker fails, the pipeline can be resumed from the exact snapshot point, ensuring no data loss and exactly-once processing semantics for the recovered data.

Exam trap

Google Cloud often tests the misconception that exactly-once processing alone guarantees failure recovery, but it only prevents duplicates during normal operation, not resumption after a worker crash.

How to eliminate wrong answers

Option A is wrong because at-least-once delivery mode ensures messages are delivered at least once but does not provide a mechanism to resume from a specific point of failure; it may cause duplicate processing but not lossless recovery. Option B is wrong because exactly-once processing mode is a processing guarantee that prevents duplicates but does not inherently provide a recovery mechanism to resume from a failure point; it relies on other features like snapshots for stateful resumption. Option D is wrong because Streaming Engine is a Dataflow feature that moves state and shuffle data to a backend service to reduce worker resource usage, but it does not directly provide a point-of-failure recovery mechanism; snapshots are required for that.

115
MCQeasy

A company runs a nightly Dataproc batch job to process large log files. The job is idempotent and can tolerate node failures if restarted. Minimizing cost is critical. What is the most cost-effective cluster design?

A.Use preemptible instances for all nodes and enable automatic restart
B.Use standard instances with autoscaling based on YARN memory
C.Use all preemptible instances and configure the cluster to delete after the job completes
D.Use a single-node cluster with a high-memory machine type
AnswerA

Preemptible instances are 60-80% cheaper, and automatic restart allows the job to continue after a preemption.

Why this answer

Preemptible instances cost about 80% less than standard instances, making them the most cost-effective choice for fault-tolerant, idempotent batch jobs. Enabling automatic restart ensures that if a preemptible instance is terminated (which can happen at any time), Dataproc will automatically recreate it, maintaining cluster capacity without manual intervention. This design minimizes cost while preserving the job's ability to complete despite node failures.

Exam trap

Google Cloud often tests the misconception that deleting the cluster after the job completes is the primary cost-saving measure, but the trap here is that without automatic restart, preemptible instances alone can cause job failure due to node preemption, negating cost benefits.

How to eliminate wrong answers

Option B is wrong because standard instances are significantly more expensive than preemptible instances, and autoscaling based on YARN memory does not reduce cost as effectively as using preemptible instances for a fault-tolerant batch job. Option C is wrong because configuring the cluster to delete after the job completes is a good practice for cost savings, but using all preemptible instances without enabling automatic restart risks job failure if preemptible instances are reclaimed, as the cluster may lose nodes and become unable to complete the job. Option D is wrong because a single-node cluster with a high-memory machine type is not cost-effective for processing large log files; it lacks fault tolerance and scalability, and high-memory instances are expensive compared to using multiple preemptible instances.

116
MCQeasy

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

A.Add a secondary worker group using preemptible VMs and increase the number of workers.
B.Enable local SSDs on all worker nodes.
C.Increase the master node's machine type to n1-highmem-32.
D.Use Cloud Composer to schedule the job with a higher priority.
AnswerA

Preemptible VMs are cost-effective and add parallelism.

Why this answer

Adding a secondary worker group with preemptible VMs is the most cost-effective way to improve performance because it allows you to scale out the cluster horizontally with compute instances that are significantly cheaper (up to 80% discount) than regular VMs. This directly addresses the bottleneck of processing capacity during peak hours without requiring any pipeline redesign, as Cloud Dataproc can automatically distribute work across additional workers.

Exam trap

The trap here is that candidates assume scaling up the master node or improving local storage will help, but the exam tests understanding that horizontal scaling with cheap, ephemeral workers is the most cost-effective approach for batch processing workloads that are CPU-bound and fault-tolerant.

How to eliminate wrong answers

Option B is wrong because enabling local SSDs on all worker nodes improves I/O performance for intermediate data, but the pipeline reads from Cloud Storage and writes to BigQuery, which are network-based operations; the bottleneck is CPU/memory for transformation, not local disk speed, making this an expensive upgrade with minimal impact. Option C is wrong because increasing the master node's machine type to n1-highmem-32 only improves the coordination and management of the cluster, not the actual data processing capacity; the master node does not perform data transformation work, so this does not address the performance bottleneck. Option D is wrong because Cloud Composer is a workflow orchestration tool that schedules and monitors jobs, but it does not directly improve the runtime performance of the ETL pipeline; setting a higher priority only affects scheduling order, not execution speed.

117
MCQmedium

A company is designing a streaming pipeline using Dataflow to process real-time clickstream data. The pipeline reads from Pub/Sub, performs user sessionization using Apache Beam's Session window, and writes to BigQuery. The team notices that the pipeline's lag is growing and the worker utilization is low. What is the most likely cause and recommended fix?

A.Too many workers are created; reduce the number of workers.
B.The pipeline is not using autoscaling; enable autoscaling.
C.Insufficient disk space per worker; increase the boot disk size.
D.The session window gap duration is too large, causing excessive state per key; reduce the gap duration.
AnswerD

Large gap leads to long-lived state, causing lag and low utilization.

Why this answer

D is correct because a large session window gap duration causes Dataflow to maintain excessive state per key (user session), leading to high memory pressure and slow processing. This results in growing pipeline lag despite low worker utilization, as workers spend more time managing state than processing data. Reducing the gap duration limits the state size and improves throughput.

Exam trap

Google Cloud often tests the misconception that low worker utilization means too many workers, but the real cause is often state bloat from session windows, not resource overprovisioning.

How to eliminate wrong answers

Option A is wrong because low worker utilization indicates workers are underutilized, not overprovisioned; reducing workers would worsen lag. Option B is wrong because autoscaling is enabled by default in Dataflow streaming pipelines, and low utilization suggests the issue is not scaling but state management. Option C is wrong because insufficient disk space typically causes worker failures or OOM errors, not low utilization with growing lag; the symptom here points to state size, not disk I/O.

118
MCQhard

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

A.Set the pipeline option --maxNumWorkers to a value between 5 and 10.
B.Increase the window duration to 30 seconds to reduce the number of windows.
C.Redesign the pipeline to use a side input for the dimension table instead of a lookup.
D.Increase the number of Bigtable nodes to reduce lookup latency.
AnswerA

Prevents over-scaling and shuffle overhead.

Why this answer

The increasing system lag and unemitted windows in a streaming pipeline with autoscaling (2–20 workers) and a 3-node Bigtable cluster indicate that the pipeline is bottlenecked by the number of workers, not by Bigtable performance. With default autoscaling, Dataflow may not scale up aggressively enough to handle the sustained load, causing backlog and window expiration. Capping maxNumWorkers to 5–10 ensures sufficient parallelism without over-provisioning, allowing the pipeline to catch up and emit windows reliably.

Exam trap

Google Cloud often tests the misconception that Bigtable or side inputs are the bottleneck when the real issue is insufficient worker parallelism, leading candidates to choose scaling Bigtable or redesigning the join strategy instead of adjusting autoscaling limits.

How to eliminate wrong answers

Option B is wrong because increasing the window duration to 30 seconds would only delay window emission, not resolve the root cause of increasing system lag or data loss; it could even worsen latency by accumulating more data per window. Option C is wrong because using a side input for a slowly-changing dimension table would require periodic re-reading of the entire table, increasing memory pressure and shuffle overhead, and would not fix a worker-scaling bottleneck. Option D is wrong because Bigtable CPU is below 50%, indicating the lookup latency is not the issue; adding nodes would be unnecessary and would not address the pipeline’s inability to keep up with the streaming throughput.

119
MCQhard

A company stores IoT sensor readings in BigQuery. The table is partitioned by day and clustered by sensor_id. Query performance has degraded as data grows; many queries filter by a date range and a single sensor_id. Which optimization should be applied first?

A.Remove clustering on sensor_id as it may cause overhead.
B.Add a WHERE clause to filter by partition date even if the query already filters by a date range.
C.Increase the number of BigQuery slots assigned to the project.
D.Recluster the table to ensure data is sorted by sensor_id within each partition.
AnswerD

Clustering improves filter performance by reducing scanned data.

Why this answer

Option D is correct because reclustering the table ensures that within each daily partition, the data is physically sorted by sensor_id. This optimizes the performance of queries that filter by a date range and a single sensor_id, as BigQuery can use the clustering metadata to prune blocks and read only the relevant data, reducing the amount of data scanned and improving query speed.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (slots) or redundant WHERE clauses will fix performance issues caused by poor data layout, when the correct first step is to optimize data organization through clustering and partitioning.

How to eliminate wrong answers

Option A is wrong because removing clustering on sensor_id would eliminate the physical sorting that helps prune blocks for queries filtering by sensor_id, likely worsening performance. Option B is wrong because adding a WHERE clause to filter by partition date is redundant if the query already filters by a date range; BigQuery automatically performs partition pruning based on the date filter, so this would not improve performance. Option C is wrong because increasing the number of BigQuery slots addresses compute resource contention, not the underlying data layout issue; if the query is scanning too much data due to poor clustering, more slots will not reduce the bytes processed.

120
MCQhard

The query above runs slowly on the 10 TB table. Which optimization would most improve performance?

A.Use a subquery to filter item.category first
B.Cluster the table by customer_id
C.Create a materialized view that pre-aggregates by customer_id and item category
D.Partition the table by order_date
AnswerC

A materialized view pre-computes the COUNT for each (customer_id, category), so the query reads a small pre-aggregated table.

Why this answer

Option C is correct because a materialized view can pre-compute and store the aggregated results by customer_id and item category, eliminating the need to scan the full 10 TB table for each query. This dramatically reduces I/O and computation time, especially when the underlying aggregation is expensive and the query pattern is predictable.

Exam trap

Google Cloud often tests the misconception that partitioning or clustering alone can accelerate arbitrary aggregation queries, when in fact they only help with filter-based pruning or specific join patterns, not with reducing the full scan required for grouping without a WHERE clause.

How to eliminate wrong answers

Option A is wrong because using a subquery to filter item.category first does not reduce the scan size; the database still must read the entire 10 TB table to evaluate the subquery, and the optimizer may not push the filter down effectively. Option B is wrong because clustering by customer_id improves range scans and joins on that column, but it does not help with aggregation queries that group by customer_id and item category; the table still must be fully scanned to compute the aggregates. Option D is wrong because partitioning by order_date only prunes partitions when queries filter on order_date; the query in question does not filter by date, so all partitions would be scanned, providing no performance benefit.

121
MCQhard

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

A.Use a different Spark example class.
B.Increase the number of worker nodes in the cluster.
C.Add --properties spark.executor.memory=8g to the command.
D.Add --driver-memory 8g to the command.
AnswerC

Increases executor heap space.

Why this answer

The out-of-memory error indicates that the Spark executors do not have enough memory to process the data. Adding `--properties spark.executor.memory=8g` increases the memory allocated to each executor, directly addressing the root cause. This property overrides the default executor memory (typically 1g or 4g depending on the cluster configuration) and is the standard way to tune executor memory in Spark on Dataproc.

Exam trap

Google Cloud often tests the distinction between driver memory and executor memory, and candidates mistakenly choose `--driver-memory` because they confuse the driver's role with the executors' memory needs, or they assume that increasing cluster size (more nodes) automatically increases per-executor memory.

How to eliminate wrong answers

Option A is wrong because changing the Spark example class does not affect memory allocation; the error is a resource exhaustion issue, not a logic or classpath problem. Option B is wrong because increasing the number of worker nodes distributes the workload across more machines but does not increase the memory per executor; the existing executors would still run out of memory if the data partitions are too large. Option D is wrong because `--driver-memory` controls the memory of the Spark driver process, not the executors; the out-of-memory error occurs in the executors (task execution), not in the driver (which handles scheduling and results collection).

122
MCQmedium

A data engineer is responsible for a batch ETL pipeline that runs daily using Cloud Composer and Dataproc. The pipeline extracts data from Cloud SQL, transforms it with Spark, and loads to BigQuery. Last night, the pipeline failed because the Spark job ran out of memory. The team needs a solution that prevents future failures without manual intervention. Options: A. Use a larger machine type for Dataproc. B. Enable Dataproc autoscaling and configure memory-based scaling. C. Split the Spark job into multiple stages. D. Use Cloud Functions to retry the job.

A.Enable Dataproc autoscaling and configure memory-based scaling
B.Use Cloud Functions to retry the job
C.Use a larger machine type for Dataproc
D.Split the Spark job into multiple stages
AnswerA

Autoscaling adjusts cluster size based on memory usage, preventing OOM.

Why this answer

Option A is correct because Dataproc autoscaling with memory-based scaling dynamically adjusts the cluster size based on the memory utilization of running jobs. This prevents out-of-memory failures by automatically adding worker nodes when memory pressure increases, without requiring manual intervention or pre-provisioning oversized clusters. It directly addresses the root cause—insufficient memory during peak processing—while maintaining cost efficiency.

Exam trap

Google Cloud often tests the misconception that retrying a failed job or manually resizing resources is a sufficient solution, when in fact dynamic, automated scaling is required to handle variable workloads without manual intervention.

How to eliminate wrong answers

Option B is wrong because retrying the failed job with Cloud Functions does not fix the underlying memory issue; the job will simply fail again on retry if the same memory constraints persist. Option C is wrong because using a larger machine type is a static, manual fix that may waste resources during normal operation and still fail if future data volumes exceed the chosen machine's capacity. Option D is wrong because splitting the Spark job into multiple stages does not inherently reduce memory usage per stage; it only reorganizes execution steps and may even increase overhead without addressing memory pressure.

123
MCQhard

A Dataflow streaming pipeline reads from Pub/Sub, applies a ParDo that uses a side input from a BigQuery table (refreshed hourly), and writes to BigQuery. The side input is large and causes increased latency and worker OOM errors. Which design change solves this?

A.Use a stateful ParDo and store the lookup data in an external cache like Cloud Bigtable, performing lookups per element.
B.Increase the side input broadcast frequency to update more often.
C.Split the pipeline into two: one to load the side input, the other to process main input.
D.Use smaller worker machine types to distribute memory across more workers.
AnswerA

External cache reduces per-worker memory footprint and scales well.

Why this answer

Option A is correct because moving the large lookup data to an external cache like Cloud Bigtable offloads memory pressure from workers, eliminating OOM errors. The side input broadcast approach keeps the entire dataset in each worker's memory, which causes OOM when the data is large. Using an external cache allows per-element lookups without storing the entire dataset in memory, reducing latency by avoiding broadcast overhead.

Exam trap

Google Cloud often tests the misconception that increasing resources (like worker size or frequency) solves memory issues, when the real solution is to avoid storing large datasets in memory altogether by using an external lookup service.

How to eliminate wrong answers

Option B is wrong because increasing the broadcast frequency would make the OOM and latency problems worse, as it would reload the large dataset into memory more often without reducing memory footprint. Option C is wrong because splitting the pipeline into two pipelines does not solve the fundamental issue of storing the large side input in memory; the side input would still need to be broadcast or cached, and the two pipelines would require coordination, adding complexity without addressing memory pressure. Option D is wrong because using smaller worker machine types reduces available memory per worker, which would exacerbate OOM errors and increase latency due to more frequent garbage collection and slower processing.

124
MCQmedium

A team is designing an event-driven data pipeline. They need to process messages from Cloud Pub/Sub, transform them, and write to BigQuery. The messages have variable volume and spikes. What is the best serverless compute option for this workload?

A.Cloud Functions triggered by Pub/Sub
B.Compute Engine with a Pub/Sub client library
C.Cloud Run invoked via Eventarc
D.Cloud Dataflow with a streaming pipeline
AnswerD

Dataflow can handle variable volume, autoscale, and directly read from Pub/Sub and write to BigQuery.

Why this answer

Cloud Dataflow with a streaming pipeline is the best serverless compute option because it is purpose-built for unbounded, variable-volume data streams from Pub/Sub and provides exactly-once processing semantics, auto-scaling, and built-in BigQuery sink integration via the Beam SDK. Unlike simpler compute options, Dataflow handles backpressure, windowing, and state management natively, making it ideal for spikes and high-throughput transformations without manual scaling or idempotency concerns.

Exam trap

Google Cloud often tests the misconception that any serverless compute (like Cloud Functions or Cloud Run) can handle streaming data pipelines, but the trap here is that these services lack native support for unbounded data, stateful processing, and automatic scaling under variable volume, which only Dataflow provides as a fully managed stream processor.

How to eliminate wrong answers

Option A is wrong because Cloud Functions triggered by Pub/Sub is designed for lightweight, short-lived event processing (max 9 minutes timeout) and cannot handle sustained high-throughput streaming transformations or complex stateful operations like windowing and joins, leading to data loss or timeouts under spikes. Option B is wrong because Compute Engine with a Pub/Sub client library is not serverless—it requires manual provisioning, scaling, and management of VMs, and it lacks native integration with BigQuery for streaming writes, adding operational overhead. Option C is wrong because Cloud Run invoked via Eventarc is a request-response compute model with a 60-minute timeout and concurrency limits; it does not natively support unbounded streaming, checkpointing, or exactly-once processing for Pub/Sub messages, making it unsuitable for variable-volume data pipelines.

125
MCQmedium

A data pipeline uses Cloud Composer to orchestrate Dataflow and BigQuery jobs. The pipeline fails intermittently with dependency errors. Which design change can improve reliability?

A.Use retries with exponential backoff
B.Switch to Cloud Functions for orchestration
C.Increase worker count in Dataflow
D.Use a simpler DAG with fewer dependencies
AnswerA

Retries with backoff handle transient failures, improving reliability.

Why this answer

Cloud Composer (Apache Airflow) tasks can fail due to transient issues like API rate limits or resource contention. Implementing retries with exponential backoff allows the DAG to automatically re-attempt failed tasks with increasing delays, reducing the impact of intermittent failures without manual intervention. This is a standard Airflow pattern for improving reliability in orchestrated pipelines.

Exam trap

Google Cloud often tests the distinction between scaling compute resources (Dataflow workers) and improving orchestration reliability (retries), leading candidates to mistakenly choose option C when the problem is transient task failures, not resource bottlenecks.

How to eliminate wrong answers

Option B is wrong because Cloud Functions is a serverless compute service, not a workflow orchestrator; it lacks built-in support for managing task dependencies, retries, and scheduling across multiple services like Dataflow and BigQuery. Option C is wrong because increasing the Dataflow worker count addresses throughput and latency, not dependency errors in the orchestration layer; dependency errors stem from task sequencing or transient failures in Airflow, not from Dataflow parallelism. Option D is wrong because simplifying the DAG reduces complexity but does not handle intermittent failures; the core issue is transient errors, not the number of dependencies, and removing dependencies may break business logic.

126
MCQhard

A Dataflow pipeline reads from Cloud Pub/Sub and writes to Cloud Storage. The pipeline needs to guarantee exactly-once processing despite worker failures. Which configuration ensures exactly-once semantics?

A.Use a side input from a deduplication dataset
B.Set the pipeline to use a global window with no early triggers
C.Insert a Reshuffle transform after reading
D.Enable exactly-once delivery on the Pub/Sub subscription and use an idempotent sink
AnswerD

Pub/Sub exactly-once delivery and an idempotent Storage write (e.g., using file naming) ensure no duplicates.

Why this answer

Option D is correct because Pub/Sub subscriptions can be configured with exactly-once delivery (using the `enableExactlyOnceDelivery` flag), which ensures that each message is delivered to the subscriber exactly once. Combining this with an idempotent sink (e.g., Cloud Storage with unique filenames or deduplication logic) guarantees that even if a worker fails and the pipeline retries, the output will not contain duplicates. This is the only option that directly addresses both the source and sink to achieve end-to-end exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that a single transform (like Reshuffle) or windowing strategy can guarantee exactly-once processing, when in reality it requires both source-level exactly-once delivery and an idempotent sink to handle retries from worker failures.

How to eliminate wrong answers

Option A is wrong because using a side input from a deduplication dataset does not prevent duplicate processing at the source; it only attempts to deduplicate after the fact, which is not a guarantee of exactly-once processing and adds complexity and latency. Option B is wrong because a global window with no early triggers controls when results are emitted, but it does not prevent duplicate messages from being processed due to worker failures or retries. Option C is wrong because a Reshuffle transform (which inserts a GroupByKey and an UngroupByKey) can help with fault tolerance by breaking fusion, but it does not provide exactly-once semantics; it only ensures that elements are redistributed, not that duplicates are eliminated.

127
MCQeasy

A data engineer needs to design a data processing system that ingests large volumes of sensor data from IoT devices. The data should be stored in a schema-less format and allow for real-time analytics. Which Google Cloud service is most appropriate?

A.Cloud Spanner
B.Firestore
C.Cloud Bigtable
D.Cloud SQL
AnswerC

Bigtable is schema-less, highly scalable, and ideal for time-series sensor data.

Why this answer

Cloud Bigtable is the most appropriate choice because it is a fully managed, scalable NoSQL database designed for large-scale analytical and operational workloads. It supports schema-less storage of time-series sensor data and integrates with real-time analytics tools like BigQuery and Dataflow via the HBase API, meeting the requirements for high-throughput ingestion and low-latency queries.

Exam trap

The trap here is that candidates often confuse Cloud Bigtable with Firestore or Cloud SQL because they all offer NoSQL or relational storage, but fail to recognize that Bigtable is purpose-built for high-throughput, schema-less time-series data and real-time analytics, while the others are optimized for transactional or mobile workloads.

How to eliminate wrong answers

Option A is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database that enforces a fixed schema, making it unsuitable for schema-less IoT data and overkill for real-time analytics at scale. Option B is wrong because Firestore is a document-oriented NoSQL database optimized for mobile and web app real-time synchronization, not for high-throughput ingestion of large volumes of sensor data or analytical workloads. Option D is wrong because Cloud SQL is a managed relational database service (MySQL, PostgreSQL, SQL Server) that requires a predefined schema and cannot handle the petabyte-scale, high-write throughput demands of IoT sensor data without significant performance degradation.

128
MCQeasy

A data engineer is running a Dataproc cluster for a batch ETL job that needs to process 10 TB of data. The job is memory-intensive. The cluster currently uses n1-standard-4 workers. Performance is poor. What is the most cost-effective change to improve performance?

A.Use high-memory machine types (n1-highmem-4)
B.Use preemptible workers to reduce cost
C.Switch to n2-standard-4 machine types
D.Add more n1-standard-4 workers
AnswerA

High-memory machines provide more memory per core, better for memory-bound jobs.

Why this answer

The job is memory-intensive, and n1-standard-4 workers have 15 GB of RAM, which may be insufficient for the workload, causing excessive disk spill or OOM errors. Switching to n1-highmem-4 provides 26 GB of RAM per worker (a 73% increase) without increasing vCPU count, directly addressing the memory bottleneck at a lower cost than adding more workers. This is the most cost-effective change because it improves performance without incurring the overhead of additional vCPUs or licensing costs.

Exam trap

The trap here is that candidates often assume adding more workers (scaling out) is always the best way to improve performance, but for memory-intensive jobs, scaling up (using high-memory instances) is more cost-effective because it addresses the root cause—per-worker memory pressure—without wasting resources on additional vCPUs.

How to eliminate wrong answers

Option B is wrong because preemptible workers reduce cost but do not improve performance for a memory-intensive job; they are suitable for fault-tolerant, stateless workloads, not for memory-bound ETL tasks that may fail if preempted. Option C is wrong because n2-standard-4 machine types offer similar memory (16 GB) to n1-standard-4 (15 GB) and only provide a modest CPU performance improvement via newer architecture, which does not address the memory bottleneck. Option D is wrong because adding more n1-standard-4 workers increases total vCPUs and cost but does not increase per-worker memory, so the memory-intensive job will still suffer from the same per-worker memory constraints, leading to inefficient resource utilization.

129
MCQhard

A company uses Cloud Dataflow to process financial transactions from Pub/Sub to BigQuery. The pipeline must ensure exactly-once semantics. Recently, they noticed duplicate rows in BigQuery. The source publishes with at-least-once. The Dataflow pipeline uses idempotent writes. What is the most likely cause? Options: A. The pipeline uses GlobalWindows. B. The pipeline has autoscaling enabled. C. The pipeline uses file loads as a sink. D. The pipeline's watermark is misconfigured.

A.The pipeline uses file loads as a sink
B.The pipeline's watermark is misconfigured
C.The pipeline uses GlobalWindows
D.The pipeline has autoscaling enabled
AnswerB

A misconfigured watermark can cause late data to be processed again, producing duplicates.

Why this answer

The most likely cause is a misconfigured watermark. In Dataflow, the watermark tracks event time progress and determines when to trigger window results. If the watermark is misconfigured (e.g., too aggressive or based on incorrect timestamps), late-arriving data may be processed in multiple windows, leading to duplicate rows even with idempotent writes.

Since the source uses at-least-once delivery, late data can be re-published, and a faulty watermark can cause it to be written again.

Exam trap

The trap here is that candidates assume idempotent writes alone guarantee exactly-once, but they overlook that watermark misconfiguration can cause the same event to be processed in multiple windows, leading to duplicates despite idempotent sinks.

How to eliminate wrong answers

Option A is wrong because GlobalWindows do not cause duplicates; they aggregate all data into a single window, and duplicates would still be prevented by idempotent writes. Option C is wrong because autoscaling adjusts worker count but does not inherently cause duplicate writes; Dataflow handles state and checkpointing correctly during scaling. Option D is wrong because file loads as a sink can cause duplicates if the load job is retried, but the question states the pipeline uses idempotent writes, and file loads are not mentioned as the sink; the sink is BigQuery, and Dataflow's streaming inserts to BigQuery are idempotent by default.

130
MCQmedium

A data engineering team uses Cloud Data Fusion to build ETL pipelines. They have a pipeline that reads from Cloud SQL, transforms data using Wrangler, and writes to BigQuery. The pipeline fails intermittently with a 'connection timeout' error from Cloud SQL. What is the best way to handle this?

A.Use Cloud NAT to provide a static IP for Data Fusion to whitelist.
B.Configure the Cloud SQL connector in Data Fusion to use retry logic and increase the connection timeout.
C.Increase the number of Data Fusion nodes to distribute the load.
D.Migrate Cloud SQL to Cloud Spanner to handle higher concurrency.
AnswerB

Retries and longer timeouts handle transient failures.

Why this answer

Option B is correct because Cloud Data Fusion's Cloud SQL connector can be configured with retry logic and an increased connection timeout to handle transient network issues. This directly addresses the intermittent 'connection timeout' error without requiring architectural changes, as the error is likely due to brief network latency or resource contention, not a persistent connectivity problem.

Exam trap

The trap here is that candidates often assume connectivity issues require network-level fixes (like static IPs or NAT) or scaling, rather than recognizing that transient timeouts are best handled by application-level retry and timeout configuration.

How to eliminate wrong answers

Option A is wrong because using Cloud NAT to provide a static IP for whitelisting addresses IP-based access control, but the error is a connection timeout, not an authorization failure; whitelisting does not resolve transient network delays. Option C is wrong because increasing the number of Data Fusion nodes distributes compute load but does not fix connection timeouts to Cloud SQL, which are caused by network or database-side issues, not pipeline parallelism. Option D is wrong because migrating to Cloud Spanner is an overengineered solution for a transient timeout; it introduces unnecessary complexity and cost, and does not address the root cause of intermittent connectivity.

131
MCQmedium

A media company ingests video files from partners via a REST API. Files are stored in Cloud Storage, and metadata is written to Firestore. A Cloud Function is triggered on object finalize to transcode video using Transcoder API. Sometimes, the function fails because the file is still being uploaded when triggered. How should this be fixed?

A.Implement a Cloud Composer workflow to poll for file existence.
B.Require partners to use resumable uploads.
C.Increase the Cloud Functions timeout to allow time for the upload to finish.
D.Use Cloud Pub/Sub notifications for Cloud Storage and trigger the function from the subscription.
AnswerD

Pub/Sub notifications are sent after object finalization.

Why this answer

Option D is correct because Cloud Storage object finalize notifications are sent only after the entire file has been written and committed. By using Pub/Sub notifications for Cloud Storage and triggering the Cloud Function from the subscription, you decouple the trigger from the upload process, ensuring the function only runs when the file is fully available. This eliminates the race condition where the function is triggered before the upload completes.

Exam trap

The trap here is that candidates assume 'object finalize' means the upload is complete, but in practice, the event can fire before the upload is fully committed, leading to the misconception that increasing timeouts or changing upload methods will fix the issue.

How to eliminate wrong answers

Option A is wrong because implementing a Cloud Composer workflow to poll for file existence adds unnecessary complexity, latency, and cost; polling is an inefficient solution compared to event-driven triggers. Option B is wrong because requiring partners to use resumable uploads does not change the fact that the Cloud Function is triggered on object finalize before the upload is fully committed; resumable uploads affect the upload mechanism, not the timing of the finalize event. Option C is wrong because increasing the Cloud Functions timeout does not address the root cause—the function is triggered prematurely; the function will still fail if the file is incomplete, regardless of how long it runs.

132
MCQhard

A multinational e-commerce company runs a real-time recommendation system. The architecture: user click events are sent via HTTP to a Cloud Run service, which publishes them to a Cloud Pub/Sub topic. A Dataflow streaming pipeline reads from the subscription, joins with user profile data from Firestore, computes recommendations using a TensorFlow model (loaded as a side input), and writes results to a Redis cache (Memorystore) for low-latency serving. The pipeline is deployed in us-central1. Recently, the team noticed that recommendation latency has increased from 50ms to 500ms, and the pipeline's backlog is growing. The Dataflow monitoring shows high CPU utilization on workers, and the SystemLag metric is 2 minutes and increasing. The Redis cluster shows no performance issues. The Firestore queries are within normal latency. The team suspects the TensorFlow model inference is the bottleneck. The model is a large neural network (500MB) loaded in each worker's memory. The pipeline uses 10 n1-standard-4 workers. The pipeline is using Dataflow's streaming engine. The team wants to reduce latency without increasing cost significantly. What should they do?

A.Increase the number of workers by adding a secondary worker group with preemptible VMs.
B.Switch to a batch pipeline that runs every minute to reduce frequency of inference.
C.Increase the machine type of workers to n1-highmem-8 to provide more memory for the model.
D.Remove the model side input and call Cloud Run for inference using a separate service.
AnswerA

More workers parallelize inference, preemptible VMs keep cost low.

Why this answer

Option A is correct because adding preemptible VMs as a secondary worker group allows horizontal scaling at lower cost, distributing the TensorFlow model inference load across more workers. This reduces CPU utilization per worker and decreases the SystemLag without significantly increasing cost, as preemptible VMs are much cheaper than regular instances. The bottleneck is CPU-bound model inference, not memory, so more workers directly address the high CPU utilization and growing backlog.

Exam trap

The trap here is that candidates assume a memory issue (option C) because the model is large, but the real bottleneck is CPU utilization from repeated inference, not memory exhaustion.

How to eliminate wrong answers

Option B is wrong because switching to a batch pipeline would increase latency (from seconds to minutes) and is unsuitable for real-time recommendations; the team needs low-latency streaming, not batch. Option C is wrong because the issue is high CPU utilization, not memory pressure; the model is 500MB and n1-standard-4 has 15GB RAM, which is sufficient, so increasing memory does not address the CPU bottleneck and increases cost unnecessarily. Option D is wrong because removing the model side input and calling Cloud Run for inference adds network latency and cost per request, likely worsening latency and increasing cost, and does not leverage Dataflow's in-memory model loading for efficiency.

133
Matchingmedium

Match each Google Cloud monitoring/logging service to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Metrics and alerting for cloud resources

Centralized log storage and analysis

Aggregates and analyzes application errors

Records administrative and data access activities

Why these pairings

Services for observability and compliance.

134
MCQmedium

A data pipeline uses Cloud Pub/Sub to ingest events and Cloud Functions to transform and write to BigQuery. The system is experiencing data loss during Pub/Sub subscription outages. Which design change improves reliability?

A.Use Dataflow with at-least-once delivery and checkpointing
B.Use a pull subscription with a custom app that polls frequently
C.Use long ack deadlines to keep messages in the subscription
D.Increase the timeout in Cloud Functions
AnswerA

Dataflow provides exactly-once semantics with checkpointing to prevent data loss.

Why this answer

Dataflow with at-least-once delivery and checkpointing ensures that messages are not lost during Pub/Sub subscription outages because Dataflow tracks processing progress via checkpoints and can replay unacknowledged messages from the last checkpoint. This decouples the processing from the subscription's transient failures, providing fault-tolerant, exactly-once or at-least-once semantics depending on the sink.

Exam trap

Google Cloud often tests the misconception that increasing timeouts or ack deadlines alone can prevent data loss, when in reality they only delay the inevitable loss without a replay mechanism like checkpointing or a persistent buffer.

How to eliminate wrong answers

Option B is wrong because a pull subscription with a custom app that polls frequently does not inherently provide reliability during subscription outages; the app would still lose messages if the subscription itself is unavailable or if the app fails to acknowledge before the ack deadline. Option C is wrong because long ack deadlines only keep messages in the subscription for a longer time, but they do not prevent data loss if the subscriber crashes or the subscription becomes unavailable; messages can still be dropped if the deadline expires without ack. Option D is wrong because increasing the timeout in Cloud Functions does not address data loss from subscription outages; it only allows the function to run longer before timing out, but does not provide replay or checkpointing mechanisms.

135
MCQmedium

A Dataflow pipeline reads events from Pub/Sub and transforms them. Some events contain invalid product IDs that should be filtered out. The list of valid product IDs is stored in a frequently updated BigQuery table. What is the best approach to filter out invalid events?

A.Read the BigQuery table as a side input and refresh it periodically using a global window with a periodic trigger
B.Use a Combine.PerKey to group by product ID and then filter
C.Use a custom pipeline option to read the valid IDs at startup and cache them
D.Use a ParDo with a side input that is a MapSideInput of valid IDs, and refresh it on each element
AnswerA

This approach allows the side input to be updated without restarting the pipeline, and the trigger ensures periodic refresh.

Why this answer

Option A is correct because reading the BigQuery table as a side input with a global window and periodic trigger allows the pipeline to refresh the list of valid product IDs at a configurable interval without reprocessing the entire stream. This pattern is idiomatic for Beam/Dataflow when the reference data changes frequently and must be kept reasonably current while maintaining low latency for streaming events.

Exam trap

Google Cloud often tests the misconception that side inputs are static or that per-element refresh is feasible, leading candidates to choose Option D, but in reality side inputs are materialized once per window/trigger and cannot be efficiently updated per element.

How to eliminate wrong answers

Option B is wrong because Combine.PerKey is designed for aggregating values per key (e.g., summing counts), not for filtering based on an external lookup; it would not incorporate the BigQuery table at all. Option C is wrong because custom pipeline options are evaluated at pipeline construction time and cannot be updated during pipeline execution, so the cached list would become stale as soon as the BigQuery table is updated. Option D is wrong because refreshing the side input on each element would cause excessive BigQuery read operations, leading to high latency and cost; MapSideInput is read-only once materialized and does not support per-element refresh.

136
MCQhard

A company needs to process sensitive data in BigQuery with column-level security. They want to allow analysts to see aggregated data but not individual records. What approach?

A.Use table-level access controls
B.Use column-level access controls with masking
C.Use authorized views with aggregation functions
D.Use Cloud Data Loss Prevention to de-identify data
AnswerC

Authorized views can present aggregated data while hiding raw details.

Why this answer

Option C is correct because authorized views in BigQuery allow you to define SQL queries that aggregate data (e.g., using SUM, COUNT, AVG) and expose only the aggregated results to analysts, while hiding individual records. This approach enforces column-level security by granting access to the view rather than the underlying table, ensuring analysts cannot query the raw data directly. It meets the requirement of seeing aggregated data without seeing individual records, leveraging BigQuery's native authorization and SQL capabilities.

Exam trap

Google Cloud often tests the distinction between column-level masking (which still allows row-level access) and authorized views (which enforce aggregation at the query level), leading candidates to pick B because they confuse masking with aggregation-based security.

How to eliminate wrong answers

Option A is wrong because table-level access controls grant access to entire tables, which would allow analysts to see individual records, not just aggregated data, violating the requirement. Option B is wrong because column-level access controls with masking can hide specific column values (e.g., by replacing them with NULL or a mask), but they still allow analysts to query individual rows and see non-masked columns, potentially exposing record-level details; they do not inherently restrict access to only aggregated results. Option D is wrong because Cloud Data Loss Prevention (DLP) is used for de-identifying data at rest or in transit (e.g., via inspection and transformation jobs), but it does not provide real-time, query-level aggregation controls within BigQuery; analysts would still have access to the underlying de-identified table, which could contain individual records.

137
MCQeasy

A company needs to stream data from a fleet of IoT devices to BigQuery for near-real-time analytics. The data volume is unpredictable and can spike during certain events. Which Google Cloud service should be used as the ingestion point to handle variable throughput with minimal operational overhead?

A.Cloud Datastore
B.Cloud Functions
C.Cloud Storage
D.Cloud Pub/Sub
AnswerD

Cloud Pub/Sub ingests variable-volume data and decouples producers from consumers.

Why this answer

Cloud Pub/Sub is the correct choice because it is a fully managed, scalable messaging service designed to decouple data producers from consumers, handling unpredictable and spiky throughput without requiring manual scaling. It can ingest millions of messages per second and buffer them until BigQuery is ready to consume, ensuring near-real-time analytics with minimal operational overhead.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can serve as a direct ingestion point for streaming data, but candidates overlook that Cloud Functions lacks durable buffering and automatic scaling for high-throughput spikes, making Pub/Sub the correct decoupling layer.

How to eliminate wrong answers

Option A is wrong because Cloud Datastore is a NoSQL document database for storing structured data, not a streaming ingestion service; it cannot handle variable-throughput message ingestion or buffer spikes. Option B is wrong because Cloud Functions is a serverless compute platform for event-driven code execution, not a durable ingestion buffer; it lacks built-in buffering and would require custom scaling logic to handle throughput spikes. Option C is wrong because Cloud Storage is an object storage service for batch data, not designed for near-real-time streaming ingestion; it introduces latency and requires additional components (e.g., Cloud Functions or Pub/Sub notifications) to trigger downstream processing.

138
MCQeasy

Based on the exhibit, what is the most likely cause of the out-of-memory error?

A.The BigQuery output table schema does not match the transformed data, causing write failures.
B.The Pub/Sub subscription is not acknowledging messages quickly enough, causing a backlog.
C.The worker machine type has insufficient memory for the message size and throughput.
D.The fixed window duration of 1 minute is too short, causing excessive state overhead.
AnswerC

Large messages (50 KB) and high throughput (1000/sec) require more memory; n1-standard-4 may be undersized.

Why this answer

The out-of-memory error in a Dataflow pipeline is most likely caused by the worker machine type having insufficient memory for the message size and throughput. When messages are large or the throughput is high, each worker must hold data in memory for processing, windowing, and shuffling. If the worker's memory is too small, the JVM heap runs out of memory, leading to an OOM error.

Exam trap

Google Cloud often tests the misconception that OOM errors are caused by schema mismatches or Pub/Sub backlogs, but the real cause is almost always insufficient worker memory for the data volume.

How to eliminate wrong answers

Option A is wrong because a schema mismatch between the BigQuery output table and the transformed data would cause write failures or errors in the BigQuery IO connector, not an out-of-memory error on the worker. Option B is wrong because a Pub/Sub subscription not acknowledging messages quickly enough would cause a backlog and increase unacknowledged message count, but it would not directly cause an out-of-memory error on the Dataflow worker; the pipeline would still process messages at its own pace, and the backlog would be in Pub/Sub, not in worker memory. Option D is wrong because a fixed window duration of 1 minute being too short would increase state overhead only if the pipeline uses stateful processing or triggers that accumulate state across windows; for a simple streaming pipeline, shorter windows actually reduce the amount of data held in memory per window, not cause OOM.

139
MCQmedium

A company stores IoT sensor data in BigQuery. Queries that filter on a timestamp column and a device_id column are slow even though the table is partitioned by day. What should the data engineer do to improve query performance?

A.Increase the partition size to monthly
B.Switch to ingestion-time partitioning instead of column-based
C.Enable automatic query rewriting with BI Engine
D.Cluster the table on device_id
AnswerD

Clustering organizes data within partitions, improving filter performance.

Why this answer

Clustering on device_id organizes the data within each day partition by device_id, allowing BigQuery to prune blocks during queries that filter on that column. This reduces the amount of data scanned and improves query performance without changing the partitioning scheme. Partitioning alone only limits scans by time range; clustering adds intra-partition sorting for non-time-based filters.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes by time) and clustering (which prunes by non-time columns), and the trap here is assuming that partitioning alone is sufficient for all filter columns, leading candidates to choose an option that changes the partition strategy rather than adding clustering.

How to eliminate wrong answers

Option A is wrong because increasing partition size to monthly would reduce the number of partitions, making each partition larger and actually increasing the data scanned for queries that filter on a specific day, worsening performance. Option B is wrong because ingestion-time partitioning is equivalent to partitioning on a pseudo-column (_PARTITIONTIME) and does not address the need to optimize filtering on device_id; it would not improve performance for queries filtering on device_id. Option C is wrong because BI Engine accelerates sub-second queries on small to medium datasets by caching results, but it does not reduce the amount of data scanned for large tables or optimize filtering on device_id; it is designed for interactive analytics, not for improving slow queries due to full table scans.

140
MCQeasy

The push endpoint is returning 500 errors. What is the most likely cause?

A.The push endpoint requires authentication but none is set
B.The topic has no messages
C.The push endpoint is not a valid HTTPS URL
D.The ack deadline is too short
AnswerA

If the endpoint expects an Authorization header, requests without it will fail with 500 or 401.

Why this answer

The push endpoint likely requires authentication, but none is configured, causing the 500 errors.

141
MCQhard

A company processes financial transactions using Cloud Dataflow. They need to ensure that late-arriving data is handled correctly for fraud detection. The pipeline uses event time processing. Which approach should they use to handle late data?

A.Sliding windows with early firing
B.Session windows with gap duration
C.Fixed windows with allowed lateness
D.Global windows with triggers
AnswerC

Allowed lateness includes late events in the correct window.

Why this answer

Option C is correct because fixed windows with allowed lateness are the standard approach in Cloud Dataflow (Apache Beam) for handling late-arriving data in event-time processing. By specifying an allowed lateness duration, the pipeline retains the window state for that period, allowing late events to be correctly assigned to their original window and triggering recomputation of results. This ensures fraud detection pipelines can account for delayed transactions without missing or misordering data.

Exam trap

Google Cloud often tests the misconception that sliding or session windows inherently handle late data, when in fact only explicit allowed lateness (or a similar mechanism) provides the necessary state retention and watermark adjustment for late-arriving events.

How to eliminate wrong answers

Option A is wrong because sliding windows with early firing are designed to produce speculative results before the window closes, not to handle late-arriving data; early firing does not extend the window to accept late events. Option B is wrong because session windows with gap duration are used to group events into sessions based on inactivity gaps, not to manage late data; they do not provide a mechanism to accept events that arrive after the session has closed. Option D is wrong because global windows with triggers are typically used for unbounded aggregations where all data belongs to a single window, but they do not naturally handle late-arriving data within specific time boundaries required for fraud detection; they lack the per-window lateness cutoff that fixed windows offer.

142
MCQeasy

A data engineer needs to automatically delete objects from a Cloud Storage bucket after 30 days and archive them to nearline storage after 7 days. Which configuration should they use?

A.Set a lifecycle rule to SetStorageClass to nearline after 30 days only
B.Set a lifecycle rule to delete objects after 7 days only
C.Set a lifecycle rule to SetStorageClass to nearline after 7 days and delete after 30 days
D.Set a lifecycle rule to delete objects after 7 days and SetStorageClass to nearline after 30 days
AnswerC

Correct: archive after 7 days, delete after 30.

Why this answer

Option C is correct because it implements a lifecycle rule that first transitions objects to Nearline storage after 7 days (reducing costs for infrequently accessed data) and then deletes them after 30 days. This matches the requirement to archive after 7 days and delete after 30 days, using the `SetStorageClass` and `Delete` actions in the correct chronological order.

Exam trap

Google Cloud often tests the order of lifecycle actions: candidates mistakenly think deletion should come before archiving, but the correct sequence is to archive first (to reduce cost) and delete later, as objects cannot be archived after deletion.

How to eliminate wrong answers

Option A is wrong because it only sets the storage class to Nearline after 30 days, missing the deletion requirement entirely and incorrectly archiving after 30 days instead of 7. Option B is wrong because it only deletes objects after 7 days, ignoring the archive-to-Nearline step and deleting data too early. Option D is wrong because it reverses the order: it deletes objects after 7 days (before they can be archived) and then attempts to set storage class to Nearline after 30 days, which is impossible since the objects are already deleted.

143
MCQeasy

A company uses Dataflow to process streaming data from Pub/Sub. They notice increased processing latency. What is the most likely cause?

A.Insufficient workers
B.Pub/Sub subscription issue
C.Too many shards
D.Wrong machine type
AnswerA

Insufficient workers create backpressure and increased latency as the pipeline cannot keep up with throughput.

Why this answer

In Dataflow, processing latency increases most commonly due to insufficient workers, as the streaming pipeline cannot keep up with the incoming data rate when the number of Compute Engine instances is too low. This causes backpressure from Pub/Sub, leading to growing unacknowledged messages and higher end-to-end latency. Autoscaling may be delayed or limited by max worker count settings, making manual or configuration-based worker scaling the primary corrective action.

Exam trap

Google Cloud often tests the misconception that Pub/Sub subscription issues (like ack deadline) are the primary cause of latency, but the trap here is that latency in Dataflow is almost always a worker scaling problem, not a Pub/Sub configuration issue.

How to eliminate wrong answers

Option B is wrong because a Pub/Sub subscription issue (e.g., expired pull request or misconfigured ack deadline) would cause message delivery failures or duplicates, not a gradual increase in processing latency across the pipeline. Option C is wrong because too many shards (i.e., excessive parallelism) can cause overhead but typically leads to underutilization or increased cost, not increased latency; latency from too many shards is rare and usually secondary to worker count. Option D is wrong because the wrong machine type (e.g., low CPU or memory) could degrade per-worker performance, but the most likely and direct cause of increased latency in a streaming Dataflow job is insufficient worker count, not machine type, as Dataflow’s autoscaling primarily adjusts worker count rather than machine type.

144
MCQhard

A healthcare company processes patient data using a Dataflow pipeline that reads from Cloud Storage, transforms data, and writes to BigQuery. They need to ensure that the processing is idempotent to handle failures and retries without duplicating records. The data arrives in daily batches and may be re-delivered if earlier processing failed. What approach should they take to guarantee exactly-once processing in BigQuery?

A.Use BigQuery's streaming inserts with InsertId to deduplicate
B.Ingest data via Pub/Sub and use a Dataflow pipeline with exactly-once processing
C.Use Dataflow's built-in exactly-once semantics and write to BigQuery via load jobs
D.Write data to a staging BigQuery table, then use a MERGE statement to upsert into the final table
AnswerD

MERGE ensures idempotency by matching on unique keys.

Why this answer

Option D is correct because BigQuery load jobs are not idempotent by default; if a load job is retried, it can create duplicate rows. By writing to a staging table first and then using a MERGE statement (or INSERT IF NOT EXISTS) to upsert into the final table, you can deduplicate based on a unique key. This approach guarantees exactly-once semantics even when the same batch is re-delivered, as the MERGE operation will only insert rows that do not already exist in the target table.

Exam trap

The trap here is that candidates often assume Dataflow's exactly-once semantics automatically extend to the sink (BigQuery), but in reality, BigQuery load jobs are not idempotent, so you must implement a deduplication strategy like staging + MERGE to guarantee exactly-once processing.

How to eliminate wrong answers

Option A is wrong because BigQuery streaming inserts with InsertId provide best-effort deduplication within the streaming buffer, but duplicates can still occur if the InsertId is reused after the deduplication window (typically a few minutes) or if the insert fails and is retried with a different InsertId. Option B is wrong because Pub/Sub with Dataflow's exactly-once processing ensures that each message is processed exactly once within the pipeline, but it does not guarantee idempotent writes to BigQuery; if the pipeline fails after writing to BigQuery but before acknowledging the message, a retry could cause duplicate rows. Option C is wrong because Dataflow's built-in exactly-once semantics apply to the pipeline's internal state and shuffle operations, but BigQuery load jobs are not idempotent; if a load job is retried (e.g., due to a worker failure), the same data can be loaded multiple times, resulting in duplicates.

145
MCQmedium

What is the most likely cause of data duplication after this command?

A.The Pub/Sub source is not exactly-once.
B.The pipeline uses at-least-once semantics.
C.The snapshot was taken before scaling.
D.The BigQuery sink is not idempotent.
AnswerD

If the sink is not idempotent, duplicate data can be written when workers are re-added or when job state is replayed.

Why this answer

Option D is correct because BigQuery sinks in Dataflow are not idempotent by default; if the pipeline retries writes (e.g., due to worker failures or checkpoint issues), duplicate rows can be inserted into the BigQuery table. This is a known limitation: BigQuery does not support deduplication at the sink level unless you implement custom deduplication logic or use a staging table with merge operations. The command likely triggered a retry scenario, and the non-idempotent sink caused the duplication.

Exam trap

Google Cloud often tests the misconception that at-least-once semantics alone cause duplication, but the real trap is that the sink's idempotency (or lack thereof) is the decisive factor when retries occur.

How to eliminate wrong answers

Option A is wrong because Pub/Sub sources in Dataflow can be configured for exactly-once delivery using the 'exactly-once' flag (e.g., with Pub/Sub Lite or by enabling the 'enable_exactly_once' option), and the question does not indicate that the source is the cause. Option B is wrong because at-least-once semantics are a pipeline processing mode, not a direct cause of data duplication; they can lead to duplicates if the sink is not idempotent, but the question asks for the 'most likely cause' and the sink's idempotency is the immediate factor. Option C is wrong because taking a snapshot before scaling does not inherently cause data duplication; snapshots preserve pipeline state for resumption, and scaling only affects parallelism, not data integrity.

146
MCQhard

A financial services company needs to process high-frequency trading data with strict ordering guarantees. They use Pub/Sub with ordering keys and Dataflow. The pipeline occasionally produces out-of-order results. What is the most likely cause?

A.Dataflow does not preserve order when using multiple workers
B.Dataflow uses at-least-once processing, which can reorder events
C.Pub/Sub does not guarantee message ordering
D.The window trigger allows late data to be included after the main output
AnswerD

Late data can be emitted in a different pane, causing apparent out-of-order results.

Why this answer

Option D is correct because Dataflow's default window trigger behavior allows late data to arrive after the main pane is emitted. When using Pub/Sub with ordering keys, late-arriving events (e.g., due to network delays or retries) can be assigned to the correct window but emitted in a separate pane, causing the final output to appear out-of-order relative to the event time. This is a known behavior when combining event-time windows with late data handling.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's lack of ordering guarantees is the primary cause of out-of-order results in Dataflow, when in fact the issue is typically the window trigger and late data handling within Dataflow itself.

How to eliminate wrong answers

Option A is wrong because Dataflow can preserve order within a key when using a single worker per key, but the question's scenario involves ordering keys and the issue is not about multiple workers reordering events—Dataflow's shuffle and grouping operations maintain order per key. Option B is wrong because at-least-once processing guarantees delivery but does not inherently reorder events; reordering is caused by late data or window triggers, not by the processing semantics alone. Option C is wrong because Pub/Sub does guarantee message ordering when messages are published to the same ordering key and within the same region, as long as the subscriber acknowledges messages in order; the question states they use ordering keys, so Pub/Sub ordering is not the root cause.

147
Multi-Selecteasy

Which TWO roles are required to allow a service account to run a Dataflow job and write results to BigQuery? (Choose two.)

Select 2 answers
A.roles/pubsub.subscriber
B.roles/dataflow.worker
C.roles/bigquery.dataEditor
D.roles/storage.objectAdmin
E.roles/dataflow.admin
AnswersB, C

Required for the worker service account to run the job.

Why this answer

Option B is correct because the roles/dataflow.worker role grants the service account the necessary permissions to execute Dataflow worker tasks, such as reading from sources and writing to sinks. Option C is correct because roles/bigquery.dataEditor allows the service account to insert rows into BigQuery tables, which is required for the Dataflow job to write results.

Exam trap

The trap here is that candidates often select roles/dataflow.admin thinking it is needed to run a job, but the exam tests that the worker role is sufficient for execution, while admin is for management tasks like creating or updating jobs.

148
MCQeasy

A company wants to ingest IoT sensor data from thousands of devices into BigQuery for near-real-time analytics. The data volume is approximately 10 GB per hour. Which combination of Google Cloud services should they use for a cost-effective and scalable solution?

A.Pub/Sub → Dataflow → BigQuery
B.Cloud IoT Core → Cloud Functions → BigQuery
C.Cloud IoT Core → Cloud Dataproc → BigQuery
D.Cloud IoT Core → Cloud Storage → BigQuery load jobs
AnswerA

Pub/Sub ingests events, Dataflow streams them to BigQuery, scaling automatically.

Why this answer

Pub/Sub provides a scalable, managed ingestion layer for high-volume IoT data, decoupling producers from consumers. Dataflow (Apache Beam) processes the streaming data in near-real-time with exactly-once semantics and auto-scaling, writing directly to BigQuery for analytics. This combination minimizes operational overhead and cost by avoiding intermediate storage and manual scaling.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle streaming workloads, but its synchronous nature and timeout limit make it unsuitable for sustained high-throughput ingestion, whereas Pub/Sub + Dataflow is the standard pattern for near-real-time analytics.

How to eliminate wrong answers

Option B is wrong because Cloud Functions has a 9-minute timeout and is not designed for sustained high-throughput streaming (10 GB/hour), leading to timeouts and data loss. Option C is wrong because Cloud Dataproc (managed Spark/Hadoop) is optimized for batch processing, not near-real-time streaming; it adds latency and complexity compared to Dataflow's native streaming. Option D is wrong because Cloud Storage load jobs are batch-oriented, introducing minutes-to-hours latency and requiring manual orchestration, which fails the near-real-time requirement.

149
Multi-Selecthard

Which THREE considerations are important when designing a data lake on Google Cloud using Cloud Storage?

Select 3 answers
A.Use Cloud Storage's eventual consistency model for cost savings.
B.Define a schema when writing data to enforce data quality.
C.Choose the appropriate storage class based on access patterns.
D.Enable encryption at rest using CMEK or CSEK.
E.Use object lifecycle management to transition data to colder storage classes.
AnswersC, D, E

Storage class impacts cost and latency.

Why this answer

Option C is correct because selecting the appropriate storage class (e.g., Standard, Nearline, Coldline, Archive) based on data access patterns directly optimizes cost and performance in Cloud Storage. For a data lake, where data may be accessed frequently initially and rarely later, matching the storage class to the access pattern avoids paying premium rates for infrequently accessed data.

Exam trap

Google Cloud often tests the misconception that Cloud Storage uses eventual consistency, but since 2020 it offers strong consistency for all operations, making option A a trap for those not updated on the change.

150
MCQeasy

A company wants to analyze server logs stored in Cloud Storage using SQL. They need to get results in seconds without setting up any clusters. Which service should they use?

A.Cloud Dataflow
B.Cloud Logging
C.BigQuery
D.Cloud Dataproc
AnswerC

BigQuery supports federated queries on Cloud Storage using SQL, providing fast results without clusters.

Why this answer

BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. It allows you to analyze petabytes of data using standard SQL without needing to provision or manage any clusters, making it ideal for querying server logs stored in Cloud Storage directly via external tables or loading data into BigQuery for sub-second query performance.

Exam trap

Google Cloud often tests the distinction between serverless SQL analytics (BigQuery) and managed compute frameworks (Dataflow, Dataproc), where candidates mistakenly choose Dataflow or Dataproc for SQL-like analysis without recognizing the need for cluster management or pipeline setup.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow is a unified stream and batch data processing service that requires setting up and managing pipelines (though serverless, it is not primarily for ad-hoc SQL queries on stored logs). Option B is wrong because Cloud Logging is a real-time log management and analysis service for monitoring and debugging, not designed for complex SQL analytics on large historical log datasets stored in Cloud Storage. Option D is wrong because Cloud Dataproc is a managed Spark and Hadoop service that requires provisioning clusters (even if ephemeral) and is not serverless SQL querying.

← PreviousPage 2 of 3 · 159 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Design Data Systems questions.