CCNA Designing data processing systems Questions

75 of 159 questions · Page 1/3 · Designing data processing systems · Answers revealed

1
MCQhard

A company processes IoT sensor data in near real-time. They ingest data via Cloud Pub/Sub, then a Dataflow streaming pipeline writes to Bigtable for low-latency queries. Recently, they observed increased Pub/Sub message backlog during traffic spikes. What is the most effective scaling strategy?

A.Increase Pub/Sub subscription throughput by increasing the number of partitions
B.Increase Dataflow worker count and adjust autoscaling configuration
C.Use a Cloud Scheduler to throttle Pub/Sub publishing
D.Add a Cloud Function to pre-process messages before they are consumed by Dataflow
AnswerB

Dataflow autoscaling can handle backlogs if enough workers are provisioned; increasing the max number of workers allows the pipeline to catch up during spikes.

Why this answer

The correct answer is B because the increased Pub/Sub backlog during traffic spikes indicates that the Dataflow pipeline is unable to consume messages as fast as they are being published. Increasing the Dataflow worker count and adjusting autoscaling configuration allows the pipeline to scale horizontally, processing more messages per second and reducing the backlog. Pub/Sub itself is designed to handle high throughput, so the bottleneck is the consumer (Dataflow), not the ingestion layer.

Exam trap

The trap here is that candidates mistakenly think Pub/Sub's throughput is limited by partitions (like Kafka) or that throttling the publisher is a valid scaling strategy, when in fact the bottleneck is the streaming pipeline's processing capacity, which must be scaled horizontally.

How to eliminate wrong answers

Option A is wrong because Pub/Sub does not use partitions like Kafka; increasing partitions is not a valid concept for Pub/Sub subscriptions, and throughput is managed by the subscriber's ability to pull messages, not by partitioning. Option C is wrong because throttling Pub/Sub publishing with Cloud Scheduler would reduce the incoming data rate, but this is counterproductive for near real-time processing and does not address the root cause of insufficient consumer capacity. Option D is wrong because adding a Cloud Function to pre-process messages would introduce an additional processing step that could further increase latency and does not directly solve the Dataflow pipeline's inability to keep up with the message volume.

2
MCQeasy

A company is designing a streaming data pipeline to process real-time clickstream events. They need to aggregate events by session window with a 5-minute gap and enable exactly-once processing semantics. Which Google Cloud service should they use?

A.Cloud Pub/Sub with Cloud Functions
B.Cloud Dataflow with Apache Beam
C.Cloud Dataproc with Spark Streaming
D.Cloud Bigtable with Dataflow templates
AnswerB

Dataflow with Beam natively supports session windows and exactly-once processing via its processing guarantees.

Why this answer

Cloud Dataflow with Apache Beam is the correct choice because it provides native support for session windows with a 5-minute gap duration and exactly-once processing semantics via its sink and source integrations. Dataflow's Beam SDK allows you to define session windows using `Window.into(Sessions.withGapDuration(Duration.standardMinutes(5)))`, and its checkpointing and idempotent writes ensure exactly-once delivery even in failure scenarios.

Exam trap

Google Cloud often tests the distinction between stateless serverless services (like Cloud Functions) and stateful stream processing engines (like Dataflow), leading candidates to incorrectly choose Cloud Pub/Sub with Cloud Functions because they overlook the need for session window state management and exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub with Cloud Functions does not support session windowing natively; Cloud Functions are stateless and cannot maintain session state across invocations, and Pub/Sub offers at-least-once delivery, not exactly-once. Option C is wrong because Cloud Dataproc with Spark Streaming can implement session windows but requires manual state management and does not provide built-in exactly-once semantics; Spark Streaming's checkpointing can lead to duplicate outputs in failure recovery. Option D is wrong because Cloud Bigtable with Dataflow templates is a storage and template combination, not a processing service; Dataflow templates can be used for streaming but the question asks for the service to use, and Bigtable is a NoSQL database, not a stream processing engine.

3
MCQmedium

A data pipeline processes streaming data from Pub/Sub to BigQuery. The pipeline needs to handle late-arriving data that is up to 1 hour late. Which Dataflow feature should be used?

A.Global windows with watermark
B.Session windows
C.Sliding windows with allowed lateness
D.Fixed windows with allowed lateness
AnswerD

Fixed windows with allowed lateness (set to 1 hour) ensure late events are processed in the correct window.

Why this answer

Fixed windows with allowed lateness are the correct choice because the pipeline needs to handle late-arriving data up to 1 hour late while processing data in fixed time intervals (e.g., 1-hour windows). The `allowedLateness` parameter in Dataflow (Apache Beam) allows late data to be included in the appropriate fixed window for up to the specified duration after the watermark passes the window end. This ensures that late Pub/Sub messages are correctly joined with their original window in BigQuery.

Exam trap

Google Cloud often tests the distinction between window types and lateness handling, and the trap here is that candidates confuse 'allowed lateness' as a feature exclusive to sliding windows or global windows, when in fact it is a parameter that can be applied to fixed windows to handle late data within a bounded delay.

How to eliminate wrong answers

Option A is wrong because global windows with watermark process all data in a single unbounded window and rely on watermark to trigger output, but they cannot segment data into fixed time intervals for BigQuery loading, and late data handling is not as precise for per-window aggregation. Option B is wrong because session windows group events based on gaps of inactivity, which is not suitable for processing data in fixed time intervals as required by the pipeline. Option C is wrong because sliding windows produce overlapping windows that emit multiple outputs per element, which is unnecessary and inefficient for a simple fixed-interval pipeline, and allowed lateness is a property of fixed windows, not sliding windows in this context.

4
Drag & Dropmedium

Drag and drop the steps to configure a VPC network with private Google access for on-premises connectivity using Cloud VPN into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Private Google Access allows on-premises hosts to reach Google APIs via VPN without public IPs.

5
Multi-Selecteasy

A data pipeline uses Cloud Pub/Sub to ingest events, then a Cloud Dataflow job writes to BigQuery. The Dataflow job is failing with 'deadline exceeded' errors. Which TWO actions can resolve this? (Choose TWO.)

Select 2 answers
A.Increase the number of Dataflow workers.
B.Switch to BigQuery Storage Write API.
C.Decrease the batch size for writes to BigQuery.
D.Set the --maxStreamingRowsToBundle parameter to a higher value.
E.Change the windowing from fixed to global.
AnswersA, D

Reduces load per worker.

Why this answer

Increasing the number of Dataflow workers (Option A) is correct because 'deadline exceeded' errors typically indicate that the pipeline is falling behind on processing due to insufficient parallelism. By adding more workers, the workload is distributed across more virtual machines, reducing the per-worker load and allowing the pipeline to keep up with the incoming Pub/Sub stream, thereby avoiding timeouts when writing to BigQuery.

Exam trap

Google Cloud often tests the misconception that 'deadline exceeded' errors are always caused by slow writes to the sink, leading candidates to choose options like decreasing batch size or switching write APIs, when the real issue is insufficient parallelism in the streaming pipeline.

6
MCQhard

A financial services firm processes sensitive transactions using Cloud Dataflow. The pipeline reads from Pub/Sub, performs stateful processing (e.g., fraud detection), and writes to Cloud Spanner. Compliance requires exactly-once processing semantics. Which configuration ensures exactly-once processing?

A.Configure Pub/Sub to use exactly-once delivery mode.
B.Use Pub/Sub with at-least-once delivery and Dataflow with at-least-once processing mode.
C.Set Dataflow pipeline to exactly-once mode and design Spanner writes to be idempotent.
D.Enable Dataflow's streaming engine and use Spanner's built-in retry logic.
AnswerC

Exactly-once mode with idempotent sinks prevents duplicates.

Why this answer

Option C is correct because exactly-once processing in a Dataflow pipeline requires the pipeline itself to be set to exactly-once mode (which uses consistent snapshots and transactional sinks) and the output writes to Spanner to be idempotent. This combination ensures that even if a record is reprocessed due to failures, the final state in Spanner remains consistent, satisfying compliance requirements.

Exam trap

The trap here is that candidates often assume Pub/Sub's exactly-once delivery alone is sufficient, but they overlook that Dataflow's internal processing and output writes must also be idempotent or transactional to achieve end-to-end exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because Pub/Sub's exactly-once delivery mode only guarantees that a message is delivered exactly once to the subscriber, but it does not prevent duplicate processing within the Dataflow pipeline due to retries or checkpoint recovery. Option B is wrong because using at-least-once delivery in Pub/Sub combined with at-least-once processing in Dataflow inherently allows duplicates, violating exactly-once semantics. Option D is wrong because enabling Dataflow's streaming engine improves scalability and latency but does not enforce exactly-once processing, and Spanner's built-in retry logic only handles transient failures, not duplicate writes from reprocessing.

7
MCQmedium

A company uses Cloud Dataproc to run nightly Spark ETL jobs that process about 500 GB of data each night. The jobs currently take 4 hours to complete. The company wants to reduce the runtime to under 2 hours to meet a new SLA. The cluster is configured with 10 worker nodes (n1-standard-4) and 1 master node (n1-standard-4). The jobs are CPU-bound and use only default settings. The cluster is deleted after each job and recreated. The data is stored in Cloud Storage. The company is open to increasing cost but wants the most cost-effective solution to meet the SLA. Which approach should they take?

A.Use a regional Cloud Storage bucket to improve read throughput.
B.Replace worker nodes with n1-highmem-16 instances to increase memory.
C.Increase the number of worker nodes to 20 and use preemptible VMs for half of them.
D.Change machine type to n2-standard-8 for all nodes.
AnswerC

Doubles processing power cost-effectively.

Why this answer

Option C is correct because adding more worker nodes (from 10 to 20) directly increases parallelism for CPU-bound Spark jobs, and using preemptible VMs for half of them reduces cost while still meeting the SLA. Since the job is CPU-bound and uses default settings, scaling horizontally with a mix of standard and preemptible VMs is the most cost-effective way to halve runtime, as Spark can efficiently distribute the workload across more cores.

Exam trap

The trap here is that candidates may assume CPU-bound jobs require faster CPUs (Option D) or more memory (Option B), but horizontal scaling with preemptible VMs is the most cost-effective way to increase parallelism in Cloud Dataproc.

How to eliminate wrong answers

Option A is wrong because using a regional Cloud Storage bucket improves data durability and availability but does not significantly increase read throughput for a single job; the bottleneck is CPU, not I/O. Option B is wrong because the job is CPU-bound, not memory-bound; increasing memory with n1-highmem-16 instances does not address the CPU bottleneck and adds unnecessary cost. Option D is wrong because changing to n2-standard-8 (8 vCPUs per node) doubles vCPUs per node but only increases total vCPUs from 40 to 80, which may not halve runtime, and is less cost-effective than using 20 n1-standard-4 nodes (80 vCPUs) with preemptible VMs for half.

8
MCQeasy

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

A.Cloud Pub/Sub, Cloud Dataproc, Cloud Storage
B.Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner
C.Cloud Pub/Sub, Cloud Dataflow, BigQuery
D.Cloud Pub/Sub, Cloud Dataflow, Cloud Storage
AnswerC

Best for real-time SQL analytics.

Why this answer

Option C is correct because Cloud Pub/Sub ingests real-time clickstream data, Cloud Dataflow processes it with low latency, and BigQuery provides a serverless, SQL-based data warehouse that is cost-effective for moderate data volumes due to its pay-per-query pricing and automatic scaling. This combination avoids the overhead of managing clusters (Dataproc) or expensive storage (Cloud Spanner) while directly supporting SQL analytics.

Exam trap

Google Cloud often tests the misconception that Cloud Storage is a suitable destination for analytics-ready data, but it lacks native SQL querying, forcing candidates to overlook BigQuery's direct integration with Dataflow for real-time analytics.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc requires a running cluster (even with preemptible VMs) and is optimized for batch processing, not real-time streaming, and Cloud Storage is not a SQL-queryable data warehouse, forcing additional ETL steps. Option B is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database designed for transactional workloads, not cost-effective for analytics at moderate data volumes; its per-node pricing makes it expensive compared to BigQuery's serverless model. Option D is wrong because Cloud Storage is an object store, not a data warehouse; storing processed data there would require additional services (e.g., BigQuery external tables or Dataproc) to run SQL analytics, increasing complexity and cost.

9
MCQmedium

A financial company processes transactions in real-time and requires exactly-once processing semantics. They also need to reprocess historical data for backtesting. Which Google Cloud service should they use?

A.Cloud Pub/Sub
B.Cloud Functions
C.Cloud Dataproc
D.Cloud Dataflow
AnswerD

Supports exactly-once and batch/streaming.

Why this answer

Cloud Dataflow (D) is correct because it provides exactly-once processing semantics via its distributed snapshot mechanism (based on the MillWheel paper) and supports both real-time streaming and batch processing for historical backtesting under a unified programming model. This allows the company to reprocess historical data using the same pipeline code, ensuring consistency across real-time and batch modes.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub (A) provides exactly-once delivery, but in reality it offers at-least-once delivery, and candidates overlook Dataflow's unified batch/streaming model for reprocessing historical data.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub is a messaging service that offers at-least-once delivery by default, not exactly-once processing, and it lacks built-in capabilities for reprocessing historical data in a unified batch/streaming manner. Option B is wrong because Cloud Functions is an event-driven serverless compute service that does not provide exactly-once processing guarantees or native support for reprocessing large historical datasets; it is designed for lightweight, stateless functions. Option C is wrong because Cloud Dataproc is a managed Hadoop/Spark service that does not natively guarantee exactly-once processing semantics and requires manual handling of state and reprocessing logic, unlike Dataflow's automatic checkpointing.

10
MCQmedium

What is the most likely cause of this error?

A.The BigQuery table is not partitioned
B.The Dataflow worker does not have the correct time zone
C.The pipeline is using a fixed window but the data is out of order
D.The schema of the BigQuery table expects a TIMESTAMP but the pipeline is sending a STRING
AnswerD

The error clearly shows an attempt to convert a string to a timestamp, indicating a schema mismatch.

Why this answer

Option D is correct because the error message indicates a type mismatch: BigQuery expects a TIMESTAMP column, but the pipeline is sending a STRING. Dataflow's BigQuery sink performs automatic schema validation, and if the source data type (STRING) does not match the target column type (TIMESTAMP), the write operation fails with a mismatch error. This is a common issue when pipeline code or source data formats timestamps as strings without explicit conversion.

Exam trap

Google Cloud often tests the distinction between schema type mismatches and data ordering or partitioning issues, so candidates may confuse a type error with a windowing or time zone problem.

How to eliminate wrong answers

Option A is wrong because a non-partitioned BigQuery table would not cause a type mismatch error; it would instead cause performance issues or quota errors on large writes. Option B is wrong because the Dataflow worker's time zone affects timestamp interpretation, not the data type of the field being written; the error is about schema type, not time zone conversion. Option C is wrong because out-of-order data with a fixed window causes late data handling or watermark issues, not a schema type mismatch; the error is specifically about the data type sent to BigQuery.

11
Multi-Selecthard

Which TWO statements about designing a data processing pipeline on Google Cloud are correct? (Choose 2.)

Select 2 answers
A.Pub/Sub guarantees message ordering across all subscribers globally.
B.Cloud Bigtable is ideal for data warehousing and SQL analytics.
C.Dataproc is the best choice for fully managed data warehousing and analytics.
D.Cloud Data Fusion allows you to build and manage data pipelines visually without writing code.
E.Dataflow supports both batch and streaming modes in a single pipeline model.
AnswersD, E

Cloud Data Fusion provides a visual UI for designing pipelines.

Why this answer

Cloud Data Fusion provides a visual, no-code interface for building and managing data pipelines, enabling users to design ETL/ELT workflows through a drag-and-drop UI. It abstracts the underlying complexity of Apache Spark and Cloud Dataproc, making it suitable for users who prefer a graphical approach over writing code.

Exam trap

Google Cloud often tests the distinction between fully managed services (like BigQuery for warehousing) and managed cluster services (like Dataproc), as well as the limitations of Pub/Sub ordering guarantees, to see if candidates confuse operational databases with analytical systems.

12
MCQeasy

A data engineer needs to design a batch pipeline that processes daily log files from Cloud Storage and writes aggregated results to BigQuery. Which service is most appropriate for this ETL job?

A.Cloud Pub/Sub with Cloud Functions
B.Cloud Composer
C.Cloud Data Fusion
D.Dataproc with PySpark
AnswerD

Dataproc handles large batch processing efficiently with Spark.

Why this answer

Dataproc with PySpark is the most appropriate choice because it provides a managed Spark/Hadoop environment that can efficiently process large daily log files stored in Cloud Storage using distributed computing. PySpark's native integration with BigQuery via the Spark BigQuery connector allows direct writing of aggregated results, making it ideal for batch ETL workloads that require complex transformations and high throughput.

Exam trap

The trap here is that candidates often confuse orchestration (Cloud Composer) with execution, or assume serverless options like Cloud Functions can handle heavy batch ETL, but the question specifically requires a service that performs the ETL processing, not just schedules or triggers it.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub with Cloud Functions is designed for event-driven, real-time streaming pipelines, not for batch processing of daily log files; Cloud Functions has a timeout limit (9 minutes for HTTP functions) and is not suited for heavy ETL jobs. Option B is wrong because Cloud Composer is a workflow orchestration tool (based on Apache Airflow) that schedules and monitors jobs, but it does not perform the actual data processing or transformation itself. Option C is wrong because Cloud Data Fusion is a visual data integration service for building pipelines, but it is more suited for low-code ETL and may lack the flexibility and performance of PySpark for large-scale batch log processing with custom transformations.

13
MCQeasy

A company needs to stream real-time user click events from a web application to BigQuery for analysis. Which Google Cloud architecture is most suitable?

A.App Engine -> Pub/Sub -> Dataflow -> BigQuery
B.Cloud Scheduler -> BigQuery
C.Compute Engine -> Cloud Storage -> BigQuery
D.Cloud Functions -> BigQuery
AnswerA

This architecture supports real-time streaming with decoupled components.

Why this answer

Option A is correct because it provides a fully managed, scalable, and decoupled architecture for ingesting real-time click events. Pub/Sub acts as a durable, asynchronous message buffer that can handle high-throughput streams, Dataflow (Apache Beam) processes the events in near real-time with exactly-once semantics, and BigQuery serves as the analytics warehouse. This pattern is the recommended Google Cloud approach for streaming analytics, as it decouples producers from consumers and supports auto-scaling.

Exam trap

The trap here is that candidates often choose Cloud Functions (Option D) thinking it is sufficient for real-time ingestion, but they overlook its execution timeout and lack of built-in streaming semantics, which makes it unsuitable for sustained high-throughput event pipelines.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler is a cron job service for triggering actions on a schedule, not a real-time event ingestion mechanism; it cannot stream continuous click events. Option C is wrong because Compute Engine and Cloud Storage are batch-oriented; writing events directly to Cloud Storage introduces latency and requires additional batch processing to load into BigQuery, making it unsuitable for real-time streaming. Option D is wrong because Cloud Functions has a 9-minute timeout and is designed for short-lived, event-driven compute, not for continuous, high-throughput streaming; it would also require custom code to buffer and batch writes to BigQuery, losing the managed streaming capabilities of Dataflow.

14
MCQeasy

A data engineer runs this Dataflow template to load CSV files from Cloud Storage into BigQuery. The job fails with a 'File pattern not matching any files' error. What is the most likely cause?

A.The bucket name is incorrectly spelled
B.The CSV files are stored in a subdirectory that is not matched by the pattern
C.The template has a bug
D.The output table does not exist
AnswerB

The pattern `*.csv` in a prefix does not include files in nested subdirectories.

Why this answer

The error 'File pattern not matching any files' indicates that the file pattern specified in the Dataflow template does not resolve to any existing objects in Cloud Storage. If the CSV files are stored in a subdirectory (e.g., gs://bucket/subdir/*.csv) but the pattern only references the root (e.g., gs://bucket/*.csv), no files will be matched. This is the most likely cause because the pattern must explicitly include the subdirectory path.

Exam trap

Google Cloud often tests the distinction between file pattern matching errors and bucket-level errors, trapping candidates who confuse a missing subdirectory in the pattern with a misspelled bucket name.

How to eliminate wrong answers

Option A is wrong because an incorrectly spelled bucket name would result in a 'bucket not found' or 'access denied' error, not a 'file pattern not matching any files' error. Option C is wrong because the template is a well-tested Google-provided template; a bug is unlikely and would typically cause different errors (e.g., runtime exceptions). Option D is wrong because the output table not existing would cause a BigQuery table creation or write error, not a file pattern matching error in Cloud Storage.

15
MCQhard

A team configured a garbage collection rule on a Cloud Bigtable column family with max_age of 100 seconds. After 2 minutes, they notice that data older than 100 seconds is still present. What is the most likely reason?

A.They need to apply the rule using a different API
B.Garbage collection runs only periodically (e.g., once per day)
C.The max_age must be at least 1 hour
D.The rule is applied only to new data, not existing data
AnswerB

Bigtable GC runs in the background at intervals (by default once per day), so newly set rules may not take effect immediately.

Why this answer

Cloud Bigtable garbage collection (GC) is not applied in real time; it runs as a background process that typically executes once per day. Even though the max_age rule is set to 100 seconds, the actual deletion of expired data occurs only during the next scheduled GC cycle, which may not happen for up to 24 hours. Therefore, observing data older than 100 seconds after only 2 minutes is expected behavior.

Exam trap

The trap here is that candidates assume garbage collection is immediate or near-real-time, but Cloud Bigtable's GC is a batch process with a long interval (typically daily), so data persists until the next scheduled run.

How to eliminate wrong answers

Option A is wrong because Cloud Bigtable garbage collection rules are configured via the standard Cloud Bigtable API (e.g., gcloud bigtable instances tables update or the client library's modify_column_family method); no different API is required. Option C is wrong because Cloud Bigtable does not enforce a minimum max_age of 1 hour; the max_age can be set to any positive duration, including 100 seconds. Option D is wrong because garbage collection rules apply to both existing and new data; the rule is not limited to new data only—it governs all data in the column family once the rule is set.

16
Multi-Selectmedium

A data engineer is designing a streaming pipeline with Cloud Pub/Sub and Cloud Dataflow. They need to guarantee at-least-once delivery and handle occasional duplicates. Which TWO configurations should they implement?

Select 2 answers
A.Use idempotent sinks
B.Use global windows with triggers
C.Use fixed windows
D.Use at-least-once Pub/Sub subscription
E.Enable Dataflow Streaming Engine
AnswersA, D

Idempotent sinks allow safe duplicate writes, ensuring exactly-once effect despite duplicates.

Why this answer

Option A is correct because idempotent sinks (e.g., BigQuery with insertId, Cloud Storage with object generation numbers) allow the pipeline to safely process duplicate records without causing data corruption or double-counting. In a streaming pipeline with at-least-once semantics, duplicates are inevitable, and idempotent sinks ensure that repeated writes produce the same result as a single write, maintaining data consistency.

Exam trap

Google Cloud often tests the misconception that windowing strategies (global or fixed) or execution engine features (Streaming Engine) can substitute for explicit delivery guarantees and idempotent sinks, when in fact they address entirely different concerns.

17
Multi-Selecteasy

A company is designing a data processing system that must handle both batch and streaming workloads with unified pipeline code. Which two Google Cloud services are most suitable for implementing a unified batch and streaming pipeline? (Choose TWO.)

Select 2 answers
A.Cloud Data Fusion
B.BigQuery
C.Apache Beam SDK
D.Cloud Dataflow
E.Cloud Dataproc
AnswersC, D

Beam is the unified model; Dataflow is one runner.

Why this answer

Apache Beam SDK (C) provides a unified programming model that allows developers to write a single pipeline that can execute in both batch and streaming modes without code changes. It abstracts the underlying execution engine, making it the correct choice for unified pipeline code.

Exam trap

Google Cloud often tests the misconception that Cloud Data Fusion or Cloud Dataproc can achieve unified batch and streaming with a single codebase, but only Apache Beam SDK combined with Cloud Dataflow provides the native programming model and execution engine for this requirement.

18
MCQhard

A financial services company must comply with GDPR "right to be forgotten". They store customer transactions in BigQuery partitioned by date. When a user requests deletion, all their data must be removed within 48 hours. The deletion requests are received via a Pub/Sub topic. What is the most scalable and cost-effective approach?

A.Use Cloud Functions to execute a BigQuery DELETE statement on each request
B.Use Cloud DLP to redact the user's data in Cloud Storage
C.Use a Dataflow pipeline that reads the deletion IDs from Pub/Sub, joins with the transactions table using a side input, and writes the filtered data to a new table, then swapping
D.Use BigQuery table snapshots and restore after deletion
AnswerC

This scales well and avoids full table scans; the side input contains the IDs to delete.

Why this answer

Option C is correct because it uses Dataflow to process deletion requests from Pub/Sub, join them with the BigQuery transactions table via a side input, and write a filtered copy to a new table. This approach is scalable (handles high-throughput streaming deletions) and cost-effective (avoids expensive DELETE mutations on BigQuery, which consume slot resources and can be slow for large tables). Swapping the new table for the old one completes the deletion efficiently within the 48-hour SLA.

Exam trap

Google Cloud often tests the misconception that BigQuery DELETE statements are the simplest way to remove data, but the trap here is that DELETE operations on large partitioned tables are expensive and not scalable for streaming deletion requests, whereas a Dataflow-based rewrite is both cost-effective and meets the 48-hour SLA.

How to eliminate wrong answers

Option A is wrong because executing a BigQuery DELETE statement per request is not scalable for high-volume deletion requests; each DELETE incurs slot consumption and can be slow on large partitioned tables, potentially exceeding the 48-hour SLA. Option B is wrong because Cloud DLP is designed for data masking and redaction in Cloud Storage, not for deleting rows from BigQuery tables; it does not address the requirement to remove customer transactions from BigQuery. Option D is wrong because BigQuery table snapshots are read-only copies used for point-in-time recovery, not for deleting specific user data; restoring a snapshot would revert the table to a previous state, not selectively remove a user's records.

19
MCQhard

A gaming company uses Pub/Sub to ingest player events and Dataflow for real-time analytics. They notice that the Pub/Sub subscription backlog is growing despite the Dataflow pipeline running continuously. The pipeline has a 1-hour window for aggregations. What is the most effective way to reduce the backlog?

A.Increase the Dataflow pipeline's worker count via autoscaling.
B.Use a push subscription instead of pull.
C.Decrease the window duration to 10 minutes.
D.Enable Pub/Sub topic retention.
AnswerA

More workers increase parallelism and processing rate, reducing backlog.

Why this answer

Increasing the Dataflow pipeline's worker count via autoscaling directly addresses the backlog by adding more parallel processing capacity to consume messages from the Pub/Sub subscription faster. Since the pipeline is continuously running but the backlog grows, the bottleneck is processing throughput, not pipeline availability. Autoscaling allows Dataflow to dynamically allocate more workers based on the backlog size, matching consumption rate to the incoming message rate.

Exam trap

Google Cloud often tests the misconception that changing window duration or subscription type can fix a throughput bottleneck, when the real solution is scaling compute resources to match the consumption rate.

How to eliminate wrong answers

Option B is wrong because switching from pull to push subscription does not inherently increase throughput; push subscriptions have their own limitations (e.g., endpoint capacity, HTTP timeouts) and the backlog growth is a processing capacity issue, not a delivery mechanism issue. Option C is wrong because decreasing the window duration to 10 minutes does not reduce the backlog; it changes the aggregation granularity but does not affect the rate at which messages are consumed from the subscription. Option D is wrong because enabling Pub/Sub topic retention controls how long unacknowledged messages are kept, not the rate of consumption; it would only extend the time messages remain available, not reduce the backlog.

20
MCQhard

A company wants to implement a near-real-time lake architecture using Cloud Storage and BigQuery. They need to enable queries on data within 5 minutes of arrival. Which approach meets the requirement with minimal operational overhead?

A.Use BigQuery Omni with external tables pointing to Cloud Storage
B.Set up a Cloud Function to trigger BigQuery load jobs every 5 minutes
C.Use Cloud Storage FUSE to mount buckets and query with Spark on Dataproc
D.Stream data into a BigQuery table via streaming inserts, then use a scheduled query to merge into the main table
AnswerA

BigQuery Omni allows querying data directly from Cloud Storage with minimal latency.

Why this answer

Option D is correct because BigQuery Omni with external tables can query data directly from Cloud Storage without loading. Option A is wrong because Cloud Storage FUSE adds a filesystem layer that may not be fast enough. Option B is wrong because streaming inserts into a separate table and then merging adds complexity and latency.

Option C is wrong because scheduled batch loads have a minimum 10-minute interval, not meeting the 5-minute requirement.

21
MCQhard

A data engineer configures the above lifecycle rule on a Cloud Storage bucket that stores daily log files. After 60 days, they notice that files older than 30 days have been transitioned to Nearline, but files older than 90 days are still present. What is the most likely cause?

A.The delete rule is missing `isLive: true` condition, so it does not apply to live objects.
B.The `age` condition in the delete rule is calculated from the transition date, not creation date.
C.The bucket has object versioning enabled, and the delete rule only applies to non-current versions.
D.The delete rule's condition includes `matchesStorageClass`: `STANDARD`, which does not match the Nearline storage class of transitioned objects.
AnswerD

After the first rule transitions objects to Nearline, they no longer match the `STANDARD` storage class required by the delete rule, so they are not deleted.

Why this answer

Option D is correct because the lifecycle delete rule includes a `matchesStorageClass` condition set to `STANDARD`. Once objects are transitioned to Nearline (which is a different storage class), they no longer match the `STANDARD` condition, so the delete rule does not apply to them. As a result, files older than 90 days that were moved to Nearline remain in the bucket.

Exam trap

Google Cloud often tests the interaction between lifecycle rules and storage class transitions, specifically that a `matchesStorageClass` condition filters objects based on their current storage class, not the original class at creation.

How to eliminate wrong answers

Option A is wrong because the delete rule does not need an `isLive: true` condition to apply to live objects; in fact, `isLive: true` is the default behavior for lifecycle rules, and omitting it does not prevent the rule from applying to live objects. Option B is wrong because the `age` condition in lifecycle rules is always calculated from the object's creation date, not from the transition date. Option C is wrong because object versioning being enabled would cause the delete rule to apply only to non-current versions only if the rule explicitly targets non-current versions; the scenario describes current versions still present, and versioning does not inherently prevent deletion of current versions.

22
Multi-Selecteasy

A company uses Dataproc for transient clusters. Which TWO actions can reduce costs?

Select 2 answers
A.Increase master node size
B.Set cluster autoscaling to minimize idle resources
C.Use standard VMs for all nodes
D.Use persistent clusters to avoid creation overhead
E.Use preemptible VMs for worker nodes
AnswersB, E

Autoscaling reduces resource waste, lowering cost.

Why this answer

Option B is correct because Dataproc cluster autoscaling automatically adjusts the number of worker nodes based on the YARN memory and CPU utilization metrics. By scaling down during idle periods, you avoid paying for unused compute capacity, directly reducing costs for transient clusters that have variable workloads.

Exam trap

The trap here is that candidates often think 'persistent clusters' are cheaper because they avoid re-creation overhead, but they overlook the continuous compute cost of idle persistent clusters versus the pay-per-use model of transient clusters.

23
MCQeasy

A team has set up a push subscription to an HTTPS endpoint. They notice that messages are not being acknowledged and are resent every 10 seconds. What is the most likely issue?

A.The push endpoint is returning HTTP 200 but taking too long to process
B.The push endpoint is returning HTTP 500
C.The push endpoint is returning HTTP 200 with 'ack' in the body
D.The push endpoint is returning HTTP 400
AnswerB

Any non-200 response (e.g., 500) causes Pub/Sub to retry; 500 indicates a server error.

Why this answer

In Google Cloud Pub/Sub push subscriptions, the subscriber must acknowledge messages by returning an HTTP 200 status code. If the endpoint returns HTTP 500, Pub/Sub interprets this as a failure and will retry delivery with exponential backoff, but the default minimum retry interval is 10 seconds. This matches the observed behavior of messages being resent every 10 seconds without acknowledgment.

Exam trap

Google Cloud often tests the misconception that the response body or processing time affects acknowledgment, when in fact only the HTTP status code determines whether a message is acknowledged or retried.

How to eliminate wrong answers

Option A is wrong because returning HTTP 200, even with slow processing, is treated as a successful acknowledgment; Pub/Sub would not resend the message. Option C is wrong because returning HTTP 200 with 'ack' in the body is still a valid acknowledgment (the body content is irrelevant; only the status code matters). Option D is wrong because HTTP 400 indicates a client error, which Pub/Sub treats as a permanent failure and will not retry indefinitely with a 10-second interval.

24
MCQhard

A company runs a Dataproc cluster with 10 worker nodes for a Spark streaming job that processes data from Pub/Sub (via Pub/Sub Lite) and writes to Cloud Storage. They observe that the job is producing many small files in Cloud Storage, leading to high costs and performance issues in downstream batch pipelines. The team wants to consolidate output files while maintaining low latency. What is the best solution?

A.Run a separate compaction job that periodically merges small files into larger ones
B.Use windowed streaming with a longer window duration and Spark's file size configuration
C.Reduce the number of workers to force more data per task
D.Switch from Dataproc to Dataflow, which has built-in file sharding optimization
AnswerB

Allows batching data to create larger files with acceptable latency.

Why this answer

Option B is correct because using a longer window duration in Spark Streaming allows more data to accumulate before writing, and combining this with Spark's file size configuration (e.g., `spark.sql.files.maxRecordsPerFile` or `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`) ensures that output files are consolidated into larger sizes. This reduces the number of small files in Cloud Storage while maintaining low latency by avoiding an extra compaction job or reducing parallelism.

Exam trap

The trap here is that candidates often choose a separate compaction job (Option A) because it seems like a straightforward fix, but they overlook the latency penalty and the fact that Spark's native streaming configurations can achieve the same goal without extra overhead.

How to eliminate wrong answers

Option A is wrong because running a separate compaction job introduces additional latency and resource overhead, which contradicts the requirement to maintain low latency; it also adds complexity and potential data consistency issues. Option C is wrong because reducing the number of workers decreases parallelism, which can increase processing latency and may not guarantee larger files if the data volume per task remains small due to Spark's default partitioning. Option D is wrong because switching to Dataflow does not inherently solve the small files problem; Dataflow's built-in file sharding optimization (e.g., via `FileIO.write()` with `withNumShards`) still requires explicit configuration, and the question specifically asks for a solution within the existing Dataproc/Spark context.

25
MCQeasy

A company is designing a real-time clickstream analytics pipeline using Pub/Sub and Dataflow. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which Dataflow feature should be configured to handle late data correctly?

A.Configure the trigger with allowed lateness of 1 hour.
B.Use fixed windows with a 1-hour period and enable data discarding.
C.Use session windows with a gap duration of 1 hour.
D.Set the watermark estimate to 1 hour.
AnswerA

Allowed lateness specifies how long after the watermark the system waits for late data before considering the window complete.

Why this answer

Option A is correct because Dataflow's allowed lateness feature explicitly controls how long the pipeline waits for late-arriving data before closing a window. By setting allowed lateness to 1 hour, the watermark is held back, and late data within that period is still processed with exactly-once semantics. This directly addresses the requirement for handling late data up to 1 hour while ensuring no duplicates or data loss.

Exam trap

Google Cloud often tests the distinction between allowed lateness (which extends window lifetime for late data) and watermark estimation (which is a system property, not a user-set parameter), leading candidates to incorrectly choose D.

How to eliminate wrong answers

Option B is wrong because fixed windows with a 1-hour period and data discarding would drop any data arriving after the window's end, failing the late-data requirement. Option C is wrong because session windows with a 1-hour gap duration merge events into sessions based on inactivity gaps, not fixed lateness, and do not guarantee handling of data arriving up to 1 hour late for a specific event time. Option D is wrong because the watermark estimate is a system-managed heuristic, not a configurable feature; setting it to 1 hour is not a valid Dataflow configuration and would not correctly handle late data.

26
MCQeasy

A company is building a data lake on Cloud Storage for log analysis. Log files (CSV) arrive every 5 minutes from multiple sources. The files should be ingested into BigQuery for reporting within 15 minutes. Which approach best meets the requirements with minimal operational overhead?

A.Set up a Cloud Storage notification to trigger a Cloud Function that loads each file into BigQuery using the BigQuery API.
B.Schedule a daily batch load from Cloud Storage to BigQuery using the BigQuery Data Transfer Service.
C.Use Dataflow to read from Pub/Sub (ingested from Cloud Storage) and write to BigQuery.
D.Use BigQuery federated queries to query the CSV files directly from Cloud Storage.
AnswerA

This approach provides near-real-time loading (within minutes) with minimal operational overhead, as Cloud Functions are serverless.

Why this answer

Option A is correct because Cloud Storage notifications trigger a Cloud Function on each file upload, which then loads the file into BigQuery via the BigQuery API. This provides near-real-time ingestion (within seconds of file arrival) with minimal operational overhead, as there are no servers to manage and no scheduling needed. The 5-minute file arrival and 15-minute SLA are easily met without complex infrastructure.

Exam trap

Google Cloud often tests the misconception that serverless options like Cloud Functions are only for simple tasks, but here they are the most efficient choice for near-real-time ingestion with minimal overhead, while Dataflow is overkill for this straightforward file-load pattern.

How to eliminate wrong answers

Option B is wrong because a daily batch load does not meet the 15-minute ingestion requirement; it would only load data once per day, causing up to 24 hours of latency. Option C is wrong because it introduces unnecessary complexity and operational overhead by adding Pub/Sub and Dataflow, which are not needed when files are already in Cloud Storage and can be loaded directly via a Cloud Function. Option D is wrong because BigQuery federated queries do not ingest data into BigQuery; they query the CSV files directly from Cloud Storage, which is slower and does not support the required reporting use case where data must be stored in BigQuery for efficient analysis.

27
Multi-Selectmedium

A company is designing a data lake on Google Cloud. They need to store raw data in multiple formats (CSV, Parquet, Avro) and allow various downstream processing frameworks. Which two storage solutions provide flexibility and scalability? (Choose two.)

Select 2 answers
A.Cloud Filestore
B.BigQuery
C.Cloud Storage
D.Cloud Spanner
E.Cloud Bigtable
AnswersB, C

BigQuery can store and query structured data, and with federated queries it can access external files.

Why this answer

BigQuery is correct because it can directly query raw data stored in Cloud Storage in formats like CSV, Parquet, and Avro using external tables or federated queries, without requiring data loading. This provides a flexible, serverless analytics layer that scales automatically and integrates with downstream processing frameworks like Apache Spark, Dataflow, and Dataproc.

Exam trap

Google Cloud often tests the misconception that any database or storage service can serve as a data lake, but the trap here is that only object storage (Cloud Storage) and a serverless query engine (BigQuery) provide the schema-on-read flexibility and scalability required for raw multi-format data, while transactional or operational databases (Spanner, Bigtable) impose schema-on-write constraints and are not designed for bulk analytical storage.

28
MCQhard

A healthcare company streams patient monitoring data to Cloud Pub/Sub. A Dataflow pipeline reads the stream, enriches with patient records from BigQuery, and writes to Bigtable for real-time queries. The BigQuery lookup is slow and causes pipeline lag. What is the best approach to improve performance?

A.Increase the number of Dataflow workers and use vertical scaling.
B.Use BigQuery's streaming read API in the pipeline.
C.Pre-join the data in a batch pipeline and load into Bigtable.
D.Use a side input from a BigQuery query with a global window and periodic refresh.
AnswerD

Side inputs cache data efficiently.

Why this answer

Option D is correct because using a side input from BigQuery with a global window and periodic refresh allows the Dataflow pipeline to cache the patient records in memory across all workers, avoiding per-element slow lookups. This pattern leverages Beam's side input semantics to broadcast a relatively static lookup table, significantly reducing latency compared to synchronous BigQuery queries for each incoming event.

Exam trap

The trap here is that candidates often assume that increasing parallelism (Option A) or using a faster read API (Option B) will solve the latency issue, when in fact the core problem is the synchronous per-element lookup pattern, which is best addressed by caching the reference data as a side input.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers and vertical scaling does not address the root cause: the per-element synchronous BigQuery lookup is the bottleneck, and simply adding more workers will not reduce the latency of each individual query. Option B is wrong because BigQuery's streaming read API is designed for high-throughput ingestion, not for low-latency point lookups; it still requires a query per event and does not eliminate the network round-trip overhead. Option C is wrong because pre-joining in a batch pipeline and loading into Bigtable would work only if the patient records are static and the data is not truly streaming; it sacrifices the real-time nature of the pipeline and cannot handle late-arriving or updated patient data without reprocessing.

29
MCQhard

A Dataflow streaming pipeline uses stateful transformations with per-key state and timers. After a deployment, the team observes that the pipeline is reprocessing events from the last 30 minutes every time it restarts. The pipeline's checkpoint is configured to persist every 10 seconds. Which change should be made to prevent unnecessary reprocessing?

A.Use a non-volatile state backend like Cloud Bigtable for state storage.
B.Increase the checkpoint interval to 60 seconds to reduce frequency of checkpoints.
C.Enable idempotent writes to the sink by adding a unique identifier per event.
D.Decrease the checkpoint interval to 1 second to checkpoint more frequently.
AnswerC

Idempotent writes prevent duplicates from being written when reprocessing occurs.

Why this answer

Option C is correct because enabling idempotent writes ensures that even if events are reprocessed due to pipeline restarts, the sink will deduplicate them based on the unique identifier. This prevents duplicate data from being written, which is the core issue when stateful transformations cause reprocessing of events from the last 30 minutes. The checkpoint interval (10 seconds) is already frequent enough; the problem is not checkpoint frequency but the lack of deduplication at the sink.

Exam trap

Google Cloud often tests the misconception that increasing checkpoint frequency or changing state backends solves reprocessing issues, when the real solution is idempotent sinks to handle duplicates from replay.

How to eliminate wrong answers

Option A is wrong because using a non-volatile state backend like Cloud Bigtable does not prevent reprocessing; it only ensures state survives restarts, but the pipeline still replays uncommitted events from the last checkpoint. Option B is wrong because increasing the checkpoint interval to 60 seconds would actually increase the window of potential reprocessing, making the problem worse, not better. Option D is wrong because decreasing the checkpoint interval to 1 second would increase overhead and still not prevent reprocessing; the pipeline will always replay events from the last successful checkpoint, regardless of frequency.

30
Multi-Selectmedium

An organization is moving on-premises Hadoop workloads to Google Cloud. They need to minimize code changes and manage transient clusters for cost savings. Which two Google Cloud services should they consider? (Choose TWO.)

Select 2 answers
A.Compute Engine with self-managed Hadoop
B.BigQuery
C.Dataproc on GKE
D.Cloud Dataproc
E.Cloud Dataflow
AnswersC, D

Allows running Spark workloads on GKE, leveraging container orchestration.

Why this answer

Options B and D are correct: Dataproc is a managed Hadoop/Spark service that can run transient clusters, and Dataproc on GKE allows running Spark workloads on GKE for flexibility. Option A is wrong because Dataflow is not compatible with Hadoop. Option C is wrong because Compute Engine requires manual cluster setup.

Option E is wrong because BigQuery is not Hadoop-compatible.

31
Multi-Selecteasy

A company uses Pub/Sub to decouple services. They have a topic with two subscriptions: Subscription A is a push subscription that sends messages to a Cloud Function; Subscription B is a pull subscription used by a Dataflow job. They need to ensure that messages are processed in order for a specific device_id. Which TWO configurations should they apply?

Select 2 answers
A.Enable message ordering on the topic and set an ordering key for each message.
B.Disable duplicate filtering on the topic.
C.Configure the Cloud Function to retry on failure with exponential backoff.
D.Use exactly one subscription for both the Cloud Function and Dataflow job.
E.Use a single subscription with multiple concurrent consumers.
AnswersA, D

Ordering key is required for ordered delivery.

Why this answer

Option A is correct because enabling message ordering on the topic and setting an ordering key (e.g., device_id) ensures that messages with the same key are delivered to subscribers in the order they were published. This is a fundamental Pub/Sub feature that guarantees FIFO (first-in, first-out) delivery per ordering key, which directly addresses the requirement for processing messages in order for a specific device_id.

Exam trap

Google Cloud often tests the misconception that multiple subscriptions or multiple consumers can maintain ordering independently, but in Pub/Sub, ordering is per subscription and per ordering key, and only a single subscriber per subscription can guarantee FIFO delivery.

32
MCQhard

A company runs a production Dataflow streaming pipeline that reads from Pub/Sub, groups events by customer ID, and writes to BigQuery. The pipeline uses global windows with triggers. After a recent code change, the pipeline started generating duplicate events in BigQuery for the same customer ID. The previous version did not have duplicates. The team reviews the code and sees that the trigger was changed from 'afterProcessingTime' to 'afterWatermark'. What is the most likely reason for duplicates?

A.The afterProcessingTime trigger fired multiple times for the same window
B.Late-arriving events cause the afterWatermark trigger to fire additional panes for the same window
C.The pipeline is firing early and on-time panes for the same window
D.The pipeline uses accumulation mode which accumulates results across firings
AnswerB

Watermark triggers can fire again for late data, producing duplicates if not deduplicated.

Why this answer

The change from `afterProcessingTime` to `afterWatermark` introduces a dependency on the watermark, which estimates event time progress. When late-arriving events (those with timestamps before the watermark) arrive after the watermark has advanced, the `afterWatermark` trigger fires an additional pane for the same window, causing duplicate writes to BigQuery. The previous trigger (`afterProcessingTime`) fired based on processing time, which does not react to late data in the same way, hence no duplicates.

Exam trap

Google Cloud often tests the distinction between processing-time and event-time triggers, and the trap here is that candidates assume `afterWatermark` is simply a 'one-time' trigger, overlooking that late-arriving data can cause additional firings.

How to eliminate wrong answers

Option A is wrong because `afterProcessingTime` fires based on wall-clock time, not on data arrival, and it does not inherently cause multiple firings for the same window unless combined with other triggers or accumulation; the issue here is specifically the switch to watermark-based triggering. Option C is wrong because firing early and on-time panes is a feature of `afterWatermark` with early firings, but the question states the trigger was changed to `afterWatermark` alone (without early firings), so this does not explain the duplicates. Option D is wrong because accumulation mode (e.g., `accumulatingFiredPanes`) determines whether results are accumulated across firings, but the core cause of additional panes is the watermark reacting to late data, not the accumulation mode itself.

33
MCQmedium

The exhibit shows a Cloud Logging query result. A data engineer sees this log for a streaming Dataflow job. What is the most likely cause?

A.The job is experiencing network latency.
B.The job is using too much memory per worker.
C.The job has insufficient permissions to scale.
D.The job has reached the maximum number of workers allowed by the project quota.
AnswerD

Worker pool exhausted indicates quota limit.

Why this answer

The log shows that the Dataflow job is not scaling up despite pending work. This typically occurs when the job has reached the maximum number of workers allowed by the project quota. Dataflow uses the Compute Engine default worker quota, and if the job attempts to exceed that limit, it will stop scaling and log messages indicating that it cannot add more workers.

Exam trap

Google Cloud often tests the distinction between resource quotas and permissions, so the trap here is that candidates confuse a quota limit (which is a hard resource cap) with an IAM permissions issue (which would produce a different error).

How to eliminate wrong answers

Option A is wrong because network latency would cause delays in data processing but would not prevent the job from scaling up; the log would show slow progress or timeouts, not a scaling block. Option B is wrong because excessive memory usage per worker would cause worker crashes or OOM errors, not a failure to scale; the job would still attempt to add workers. Option C is wrong because insufficient permissions to scale would result in an authorization error when trying to create new worker instances, not a quota-related log message; the error would reference IAM roles or service account permissions.

34
Multi-Selecthard

A company uses Cloud Pub/Sub with pull subscriptions to process orders. The application requires at-least-once delivery and the ability to process orders in order per customer_id. Which THREE features should they configure? (Choose three.)

Select 3 answers
A.Configure a dead letter topic
B.Use a push subscription with a HTTPS endpoint
C.Enable ordering keys on the topic
D.Enable message ordering on the subscription
E.Set the subscription's ackDeadline to 600 seconds
AnswersA, C, D

Allows failed messages to be stored without blocking subsequent messages.

Why this answer

Correct answers are B, C, and E. Enabling ordering keys ensures messages with the same key are delivered in order. Setting exactly-once delivery on subscriptions provides at-least-once with deduplication; but exactly-once delivery on subscription actually reduces duplicates.

However, the question says at-least-once, so exactly-once delivery would be too strong? Actually, exactly-once delivery on subscription ensures no duplicates, so it's even better. But if they need at-least-once, they could enable it and it still satisfies. Dead letter topics allow failed messages to be isolated and reprocessed later without blocking.

A (ackDeadline of 600 seconds) is too long. D (using a push subscription) does not inherently improve ordering.

35
MCQeasy

A company is running a Cloud Dataflow streaming pipeline that aggregates events in 1-minute windows. They notice that the watermark is lagging significantly behind real-time. What is the most likely cause?

A.A hot key is causing data skew.
B.The window duration is too short.
C.The pipeline was recently updated.
D.The allowed lateness is set too high.
AnswerA

Hot key causes processing delays.

Why this answer

A hot key causes data skew, which means a disproportionate amount of data is assigned to a single key. In Cloud Dataflow, this leads to a single worker processing the bulk of the events, creating a processing bottleneck. The watermark, which tracks the progress of event-time processing, cannot advance until all data for a given window is processed, so the skewed key delays watermark progression significantly behind real-time.

Exam trap

Google Cloud often tests the misconception that watermark lag is caused by configuration settings like window duration or allowed lateness, rather than by data-level issues like hot keys that create processing bottlenecks.

How to eliminate wrong answers

Option B is wrong because a short window duration does not inherently cause watermark lag; it may increase computational overhead but does not prevent the watermark from advancing based on data arrival. Option C is wrong because a pipeline update (e.g., via a new job version) does not cause persistent watermark lag; it may cause a brief reprocessing delay but not a sustained lag. Option D is wrong because setting allowed lateness too high only affects how long the pipeline waits for late data after the watermark passes; it does not cause the watermark itself to lag behind real-time.

36
MCQmedium

A financial services company uses Cloud Composer to orchestrate daily batch jobs. One job extracts data from MongoDB to Cloud Storage, then loads into BigQuery, and finally runs a Dataflow pipeline for aggregations. The Dataflow job fails intermittently. They want to automatically restart only the failed Dataflow job without re-running the earlier extraction and load. Which Airflow operator configuration should they use?

A.Implement a SlaMiss sensor
B.Use a DAG with depends_on_past=True
C.Set retries=2 on the Dataflow operator
D.Set trigger_rule='one_success' for downstream tasks
AnswerC

Retries automatically re-run the failed task without affecting upstream tasks.

Why this answer

Option C is correct because setting retries=2 on the Dataflow operator instructs Airflow to automatically restart only that specific task upon failure, without affecting upstream tasks (MongoDB extraction, BigQuery load). This isolates the retry to the Dataflow job, preserving the earlier completed work and avoiding redundant data movement.

Exam trap

Google Cloud often tests the distinction between task-level retry mechanisms and dependency/trigger rules, so the trap here is confusing `retries` (which restarts the failed task) with `trigger_rule` or `depends_on_past` (which only affect task scheduling or downstream execution).

How to eliminate wrong answers

Option A is wrong because SlaMiss sensors are used to detect when tasks have not completed within a defined SLA window, not to trigger automatic retries of failed tasks. Option B is wrong because depends_on_past=True enforces sequential execution order across DAG runs (e.g., today’s task waits for yesterday’s success), but does not provide automatic retry on failure within the same run. Option D is wrong because trigger_rule='one_success' controls downstream task execution based on upstream task outcomes (e.g., if one upstream succeeds, proceed), but does not restart a failed task; it only affects task dependencies.

37
Multi-Selectmedium

A data engineer is migrating on-premises Hadoop jobs to Dataproc. Which TWO considerations are important?

Select 2 answers
A.Use Preemptible VMs for worker nodes to reduce cost
B.Use Cloud Storage instead of HDFS for data storage
C.Avoid using Cloud Storage connector to prevent overhead
D.Keep HDFS for better performance
E.Use on-demand VMs for master node to ensure availability
AnswersA, B

Preemptible VMs are cost-effective for fault-tolerant jobs.

Why this answer

Options A and C are correct: use Cloud Storage instead of HDFS for scalability and cost, and use Preemptible VMs for transient tasks. Option B is wrong because HDFS is not recommended on Dataproc. Option D is wrong because master nodes should be on-demand for reliability.

Option E is wrong because Cloud Storage connector is necessary.

38
MCQmedium

A data engineering team needs to build a data integration pipeline that involves connecting to multiple sources, performing data transformations with visual editing, and then running custom machine learning algorithms. The team has both data analysts and data scientists. Which approach is most suitable?

A.Use Cloud Composer to orchestrate both Data Fusion and Dataproc
B.Use only Cloud Dataproc for all steps
C.Use only Cloud Data Fusion for all steps
D.Use Cloud Data Fusion for the initial ingestion and transformations, then export the data to Cloud Dataproc for the ML algorithms
AnswerD

This leverages the strengths of both services: visual integration and custom ML.

Why this answer

Option D is correct because it leverages Cloud Data Fusion's visual, no-code interface for data ingestion and transformation, which is ideal for data analysts, and then exports the prepared data to Cloud Dataproc, which provides native support for custom machine learning algorithms using Spark or Hadoop, meeting the data scientists' needs. This separation of concerns optimizes the pipeline for both user groups and avoids forcing all tasks into a single tool that may not excel at both visual ETL and custom ML.

Exam trap

Google Cloud often tests the misconception that a single tool can handle both visual ETL and custom ML, leading candidates to choose Cloud Data Fusion alone (Option C) without realizing it lacks native support for running custom algorithms like Spark MLlib or TensorFlow.

How to eliminate wrong answers

Option A is wrong because Cloud Composer is an orchestration tool (based on Apache Airflow) that manages workflow dependencies and scheduling, but it does not perform data transformations or run ML algorithms itself; using it to orchestrate both Data Fusion and Dataproc adds unnecessary complexity and does not directly address the need for visual editing or custom ML execution. Option B is wrong because Cloud Dataproc is a managed Spark/Hadoop service that requires coding for data transformations, which does not provide the visual editing capabilities needed by data analysts, and it would force all team members to write code, reducing productivity. Option C is wrong because Cloud Data Fusion is designed for visual ETL and data integration but lacks native support for running custom machine learning algorithms; it can only trigger external services like Dataproc for such tasks, making it insufficient for the ML step.

39
MCQeasy

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

A.Write a custom application on Compute Engine that polls Pub/Sub and writes to BigQuery.
B.Create a Dataproc cluster running a Spark Streaming job.
C.Create a Cloud Function that writes to BigQuery.
D.Use a Dataflow pipeline with a BigQuery subscription.
AnswerD

Serverless and low maintenance.

Why this answer

Option D is correct because a Dataflow pipeline with a BigQuery subscription provides a fully managed, serverless streaming solution that directly ingests messages from Pub/Sub and writes them to BigQuery with minimal latency. Dataflow handles autoscaling, checkpointing, and exactly-once semantics, which aligns with the team's limited operational resources. The BigQuery subscription (via the Pub/Sub to BigQuery template) eliminates the need for custom code or cluster management, ensuring low-latency streaming without operational overhead.

Exam trap

Google Cloud often tests the misconception that a simple serverless function (Cloud Function) is sufficient for streaming workloads, but candidates overlook that Cloud Functions are designed for event-driven, short-lived tasks and lack the state management, exactly-once guarantees, and sustained throughput needed for continuous data ingestion into BigQuery.

How to eliminate wrong answers

Option A is wrong because writing a custom application on Compute Engine requires the team to manage polling logic, handle failures, and scale instances manually, which contradicts the requirement for minimal operational resources and introduces unnecessary latency and complexity. Option B is wrong because creating a Dataproc cluster running a Spark Streaming job introduces significant operational overhead for cluster provisioning, scaling, and maintenance, and Spark Streaming typically has higher latency (seconds) compared to Dataflow's millisecond-level streaming, making it suboptimal for minimal latency. Option C is wrong because a Cloud Function that writes to BigQuery is not designed for continuous streaming; Cloud Functions have a maximum timeout of 9 minutes (or 60 minutes with 2nd gen) and are triggered per event, which can lead to throttling, out-of-order writes, and lack of exactly-once semantics, making it unsuitable for sustained, low-latency streaming into BigQuery.

40
MCQeasy

An online retailer uses BigQuery for analytics. They have a time-series table with 5 billion rows and new data arrives every day. They want to optimize query performance and reduce costs by ensuring that queries scan only the partitions they need. Which table design should they use?

A.Use a table partitioned on the timestamp column.
B.Use a table clustered on the timestamp column.
C.Use a table with no partitioning but use LIMIT in queries.
D.Use a table partitioned by ingestion time with a partition expiration.
AnswerA

Allows queries to scan only relevant time-range partitions.

Why this answer

Partitioning on the timestamp column allows BigQuery to perform partition pruning, so queries with filters on that column only scan the relevant partitions. This directly reduces the amount of data read, lowering both query cost (pay-per-byte) and improving performance. For a 5-billion-row table with daily data arrival, time-unit partitioning is the standard design to meet the stated goals.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes data at the storage level) and clustering (which only sorts data within a partition or table), leading candidates to mistakenly believe clustering alone can reduce bytes scanned for time-range queries.

How to eliminate wrong answers

Option B is wrong because clustering only sorts data within partitions or within the table, but does not enable partition pruning; without partitioning, queries still scan the entire table unless a filter matches the clustering key, and clustering alone does not reduce the bytes billed to only the needed time range. Option C is wrong because using LIMIT does not reduce the amount of data scanned; BigQuery still reads all bytes from the entire table before applying the LIMIT, so costs remain high and performance is not improved. Option D is wrong because partitioning by ingestion time (using _PARTITIONTIME or _PARTITIONDATE) only works for append-only streaming or load jobs and does not allow querying on an arbitrary timestamp column; also, partition expiration would delete old data automatically, but the requirement is to scan only needed partitions, not to expire them.

41
MCQhard

A manufacturing company wants to detect anomalies in sensor data from thousands of IoT devices in real time. The data is streaming into Pub/Sub. The best solution should use a machine learning model served from AI Platform that scores sensor readings aggregated over 5-minute windows. Which pipeline design meets these requirements?

A.Use Cloud Dataproc with Spark Streaming to aggregate data, and use a Spark ML model embedded in the pipeline
B.Use BigQuery streaming inserts and run scheduled queries that call the ML model
C.Use Cloud Dataflow with sliding windows to aggregate sensor readings every 5 minutes, then call a trained model hosted on AI Platform Prediction for each window
D.Use Cloud Functions triggered by Pub/Sub to process each sensor reading individually
AnswerC

Dataflow handles streaming and windowing natively, and AI Platform Prediction provides low-latency model serving.

Why this answer

Option C is correct because Cloud Dataflow's sliding windows natively handle the 5-minute aggregation requirement for streaming data, and its ability to call external services via a DoFn allows integration with AI Platform Prediction for real-time model scoring. This design aligns with the need for low-latency, scalable processing of Pub/Sub streams without managing infrastructure.

Exam trap

Google Cloud often tests the distinction between stream processing (Dataflow) and batch-oriented services (BigQuery scheduled queries), and the trap here is assuming that BigQuery's streaming inserts combined with scheduled queries can achieve real-time aggregation, when in fact scheduled queries introduce minutes of delay and are not window-aware for sliding time intervals.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc with Spark Streaming requires managing a cluster and embedding a Spark ML model in the pipeline, which adds operational overhead and does not leverage AI Platform's managed prediction service as specified. Option B is wrong because BigQuery streaming inserts and scheduled queries introduce latency (scheduled queries run at intervals, not in real time) and are not designed for per-window scoring of streaming data. Option D is wrong because Cloud Functions triggered by Pub/Sub process each sensor reading individually, which cannot aggregate data over 5-minute windows as required.

42
MCQmedium

A financial company needs to process batch trades data daily and ensure that if a transformation step fails, the entire daily run is retried from the beginning. Which design pattern is appropriate?

A.Use idempotent writes with checkpointing
B.Use an orchestrator like Cloud Composer with retry logic
C.Retry the failed step only
D.Use a transactional staging area
AnswerB

Cloud Composer (Airflow) allows defining DAGs with retry policies on the entire pipeline, ensuring full restart on failure.

Why this answer

Option B is correct because the requirement states that if any transformation step fails, the entire daily run must be retried from the beginning. An orchestrator like Cloud Composer (Apache Airflow) provides native DAG-level retry logic that can be configured to restart the entire workflow on failure, ensuring atomicity of the batch run. This pattern is essential for maintaining data consistency when partial processing cannot be tolerated.

Exam trap

Google Cloud often tests the misconception that checkpointing or idempotent writes are sufficient for full-run retries, but the trap is that checkpointing enables partial resumption, not the complete restart from scratch that the question explicitly demands.

How to eliminate wrong answers

Option A is wrong because idempotent writes with checkpointing allow resumption from the last successful checkpoint, which contradicts the requirement to retry the entire run from the beginning; checkpointing is designed for partial retries, not full restarts. Option C is wrong because retrying only the failed step would leave the daily run in an inconsistent state, as earlier steps may have already committed partial results that cannot be rolled back without a full restart. Option D is wrong because a transactional staging area ensures atomic writes but does not provide orchestration or retry logic to restart the entire pipeline from the start upon failure.

43
MCQhard

A company runs a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They experience a sudden spike in data volume causing BigQuery write throughput to be exceeded, resulting in errors. Which strategy should they implement to handle this gracefully?

A.Use a BigQuery sink with 'FAIL_FAST' error handling and set a dead-letter queue for failed writes.
B.Use a BigQuery sink with 'WRITE_APPEND' mode and set 'writeDisposition' to 'WRITE_APPEND'.
C.Use a BigQuery sink with 'WRITE_TRUNCATE' mode.
D.Use a BigQuery sink with 'CREATE_NEVER' write method.
AnswerA

Routes failed writes to a dead-letter queue for retry, avoiding pipeline stalls and data loss.

Why this answer

Using a BigQuery sink with 'FAIL_FAST' error handling and a dead-letter queue (D) allows the pipeline to route failed writes to a separate Pub/Sub topic for later retry, preventing data loss and backpressure. Option A and B change write mode but don't handle errors. Option C prevents table creation but doesn't address throughput.

44
MCQeasy

A company uses Cloud Dataproc to run Spark ML jobs. The jobs are memory-intensive and often fail with OutOfMemory errors. Which action would most effectively reduce memory pressure without changing the Spark code?

A.Increase the number of worker nodes and reduce the number of cores per worker.
B.Increase the master node's memory.
C.Increase the number of Spark partitions.
D.Use preemptible VMs for workers.
AnswerA

More workers spread memory load.

Why this answer

Increasing the number of worker nodes while reducing the number of cores per worker reduces memory pressure by distributing the workload across more JVMs, each with a smaller heap. This lowers the per-executor memory requirement and reduces the risk of OutOfMemory errors without modifying Spark code. In Cloud Dataproc, this approach directly addresses memory contention by giving each executor fewer tasks to process concurrently.

Exam trap

Google Cloud often tests the misconception that adding more partitions or increasing master memory solves executor-level memory issues, when the real solution is to reduce per-executor task concurrency by adjusting the worker-to-core ratio.

How to eliminate wrong answers

Option B is wrong because increasing the master node's memory does not help with executor memory pressure; the master node handles cluster coordination and driver tasks, not the memory-intensive worker processing. Option C is wrong because increasing the number of Spark partitions can reduce the data size per task but does not directly reduce per-executor memory pressure and may increase scheduling overhead without addressing the root cause. Option D is wrong because using preemptible VMs for workers reduces cost but does not change memory allocation per worker; preemptible VMs can be reclaimed at any time, potentially causing job instability and not solving OutOfMemory errors.

45
MCQhard

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

A.Ingest to Cloud Spanner, query directly.
B.Ingest to Cloud SQL, then export to Cloud Storage for queries.
C.Ingest to Cloud Storage, create BigQuery external tables.
D.Ingest to Cloud Storage, load into Dataproc for queries.
AnswerC

Schema-on-read and SQL.

Why this answer

BigQuery external tables allow schema-on-read by defining the schema at query time over data stored in Cloud Storage, enabling ad-hoc SQL queries without loading data into a separate system. This architecture directly supports the requirement for schema-on-read and SQL-based analysis, as BigQuery provides a serverless, scalable SQL engine.

Exam trap

Google Cloud often tests the distinction between schema-on-read (BigQuery external tables) and schema-on-write (traditional databases like Cloud Spanner or Cloud SQL), where candidates mistakenly choose a transactional database for analytical workloads.

How to eliminate wrong answers

Option A is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database designed for transactional workloads, not for schema-on-read or ad-hoc SQL queries over raw data in a data lake. Option B is wrong because Cloud SQL is a managed relational database for OLTP workloads, and exporting to Cloud Storage for queries adds unnecessary latency and complexity, failing to leverage schema-on-read directly. Option D is wrong because Dataproc is a managed Spark/Hadoop service that requires data loading and cluster management, which is not as efficient or serverless as BigQuery external tables for ad-hoc SQL queries on a data lake.

46
MCQhard

You are a data engineer at a global e-commerce company. Your team manages a real-time recommendation system that ingests user clickstream events from a Pub/Sub topic (topic-clickstream). The pipeline uses Dataflow to read events, join with user profile data from Cloud Bigtable, compute recommendations using a machine learning model hosted on Cloud Run, and write results to a BigQuery table for analytics. The pipeline has been running smoothly for months, but recently the Dataflow job started failing with the error: "Workflow failed. Causes: S01:ReadPubSub/Read+Transform/ParDo(ExtractUserID)+ ... (5a3b2c1d) The job failed because a worker encountered an out-of-memory error." The Dataflow job uses the Streaming Engine feature with a worker type of n2-standard-8 (8 vCPU, 32 GB memory) and autoscaling from 2 to 20 workers. The clickstream event rate has increased from 500 events/second to 5000 events/second over the past week. The user profile data in Bigtable has also grown, with average row size increasing from 1 KB to 10 KB due to additional fields. You need to resolve the out-of-memory errors without completely redesigning the pipeline. What should you do?

A.Increase the maximum number of workers in autoscaling from 20 to 50.
B.Change the worker machine type to n2-highmem-8 (8 vCPU, 64 GB memory) in the Dataflow job configuration.
C.Reduce the batch size in the Dataflow pipeline by setting the `max_batch_size` parameter to a lower value.
D.Increase the number of Bigtable nodes to improve read throughput.
AnswerB

Increasing memory per worker directly addresses OOM without major pipeline changes.

Why this answer

Option B is correct because the out-of-memory error is caused by the increased per-worker memory load from larger Bigtable rows (1 KB to 10 KB) and higher event throughput (500 to 5000 events/sec). Switching to n2-highmem-8 doubles the memory from 32 GB to 64 GB, giving each worker more headroom to cache user profiles and process larger batches without OOM. This directly addresses the root cause without redesigning the pipeline.

Exam trap

Google Cloud often tests the misconception that scaling out (more workers) solves memory issues, when in fact the per-worker memory limit is the bottleneck and must be increased via a higher-memory machine type.

How to eliminate wrong answers

Option A is wrong because increasing the maximum number of workers spreads the load across more machines but does not increase the memory per worker; each worker still has only 32 GB, so the same OOM condition persists on individual workers. Option C is wrong because reducing batch size lowers memory per batch but increases the number of batches and overhead, which can worsen performance and still not prevent OOM if the per-row memory footprint (10 KB) is the dominant factor. Option D is wrong because Bigtable node count affects read throughput and latency, not the memory consumed by the Dataflow worker when caching or processing rows; the OOM is on the Dataflow side, not Bigtable.

47
Multi-Selectmedium

A data engineer is designing a batch processing system using Cloud Dataproc. Which TWO practices improve performance and reduce costs? (Choose TWO.)

Select 2 answers
A.Always use persistent disks for all nodes.
B.Set autoscaling policies based on YARN memory.
C.Store intermediate data in HDFS.
D.Use preemptible VMs for worker nodes.
E.Use the largest machine types for master nodes.
AnswersB, D

Optimizes resource utilization.

Why this answer

Option B is correct because autoscaling policies based on YARN memory allow the cluster to dynamically add or remove worker nodes in response to actual resource demand from running jobs. This prevents over-provisioning (reducing costs) and ensures sufficient resources for job completion (improving performance), as Cloud Dataproc directly monitors YARN memory metrics to trigger scaling actions.

Exam trap

The trap here is that candidates often confuse HDFS with Cloud Storage, assuming intermediate data must be stored locally for performance, but Cloud Storage is actually faster and cheaper for transient data in Dataproc due to its native integration and lack of replication overhead.

48
MCQmedium

A data pipeline using Cloud Pub/Sub and Cloud Dataflow is experiencing duplicate messages. The source system publishes messages at least once. What Dataflow technique ensures exactly-once processing?

A.Use idempotent sinks
B.Use GlobalWindows
C.Set watermark threshold
D.Enable streaming engine
AnswerA

Idempotent sinks allow safe duplicate writes, achieving exactly-once.

Why this answer

Option A is correct because idempotent sinks ensure that even if Cloud Pub/Sub delivers the same message multiple times (due to its at-least-once delivery semantics), the Dataflow pipeline can deduplicate or safely reapply the same data without causing duplicates in the output. This is achieved by designing the sink (e.g., BigQuery with insertId, Cloud Storage with unique filenames) to recognize and ignore repeated writes, effectively providing exactly-once processing semantics downstream.

Exam trap

The trap here is that candidates confuse 'exactly-once processing' with 'exactly-once delivery' from the source, but Pub/Sub only guarantees at-least-once delivery, so the responsibility for deduplication falls on the Dataflow pipeline and its sink design, not on windowing or engine settings.

How to eliminate wrong answers

Option B is wrong because GlobalWindows groups all elements into a single window for batch-like processing, but it does not address message duplication; it only changes how data is windowed, not how duplicates are handled. Option C is wrong because setting a watermark threshold controls how long the pipeline waits for late data, which affects completeness and latency but does not prevent duplicate messages from being processed. Option D is wrong because enabling Streaming Engine improves scalability and reduces checkpoint latency in Dataflow, but it does not provide deduplication or exactly-once guarantees; duplicates can still occur from Pub/Sub's at-least-once delivery.

49
MCQhard

A large e-commerce company is migrating its on-premise Hadoop cluster to Google Cloud using Dataproc for batch processing. The cluster processes daily sales data from multiple sources, generates aggregated reports, and performs ad-hoc analysis. The migration is complete, but users report that jobs are running 30% slower than on-premise. The data is stored in Cloud Storage as Parquet files partitioned by date. The Dataproc cluster uses preemptible VMs for worker nodes, and the master node uses a standard VM. The jobs heavily rely on shuffling data between stages. The cluster's autoscaling is enabled with a minimum of 10 and a maximum of 50 workers. During job execution, CPU utilization on workers is low, but disk I/O is high, especially on local SSDs. The network utilization is moderate. The team suspects that the shuffle operation is causing the slowdown. Which action should the team take to improve job performance?

A.Attach additional local SSDs to each worker to increase local disk capacity and I/O throughput.
B.Enable Cloud Storage as a shuffle destination by setting the property `dataproc:dataproc.shuffle.direct` to `true` and ensure the cluster has appropriate IAM permissions.
C.Change all worker VMs from preemptible to standard VMs to avoid preemption and improve reliability.
D.Increase the maximum number of preemptible workers to 100 to provide more parallelism.
AnswerB

Cloud Storage shuffle can offload intermediate shuffle data to Cloud Storage, reducing local disk I/O and potentially improving overall shuffle performance, especially when local disks are saturated.

Why this answer

B is correct because the high disk I/O on local SSDs during shuffling indicates that the shuffle data is being written to local disk, which is a bottleneck. By enabling Cloud Storage as a shuffle destination via `dataproc:dataproc.shuffle.direct`, shuffle data is written directly to Cloud Storage, bypassing local disks and leveraging Google Cloud's high-throughput object storage. This reduces disk I/O contention and improves shuffle performance, especially when preemptible VMs are used, as shuffle data is not lost on VM preemption.

Exam trap

The trap here is that candidates often assume adding more local SSDs or increasing worker count will solve shuffle bottlenecks, but the real issue is the I/O bottleneck of local disks, and Cloud Storage shuffle is the specific Dataproc feature designed to offload shuffle data to a scalable, high-throughput object store.

How to eliminate wrong answers

Option A is wrong because attaching additional local SSDs increases capacity but does not address the root cause of high disk I/O during shuffling; the bottleneck is the local disk I/O itself, not capacity, and Cloud Storage shuffle provides better throughput. Option C is wrong because changing preemptible VMs to standard VMs improves reliability but does not directly address the shuffle I/O bottleneck; the performance issue is disk I/O, not VM preemption. Option D is wrong because increasing the maximum number of preemptible workers to 100 increases parallelism but does not reduce the disk I/O bottleneck during shuffling; more workers can actually increase shuffle traffic and exacerbate the problem.

50
MCQmedium

A BigQuery query fails with the error shown in the exhibit. What is the most likely cause?

A.The query scans too many partitions or data without efficient pruning
B.The table has too many partitions
C.The user does not have permission to query the table
D.Insufficient slot capacity in the project
AnswerA

SELECT * on a large table can exceed resource limits; partition pruning might help.

Why this answer

The error indicates that the query attempted to scan too many partitions or a large amount of data without effective partition pruning. BigQuery charges based on the amount of data processed, and queries that scan all partitions of a large table can hit limits or incur high costs. The most likely cause is that the query's WHERE clause does not filter on the partitioning column, forcing a full table scan across all partitions.

Exam trap

Google Cloud often tests the distinction between 'too many partitions' (a table design issue) and 'scanning too many partitions' (a query design issue), leading candidates to mistakenly choose the option about partition count rather than the lack of pruning.

How to eliminate wrong answers

Option B is wrong because having too many partitions does not directly cause a query failure; BigQuery supports up to 4,000 partitions per table, and the error is about scanning too many partitions, not the count itself. Option C is wrong because permission errors produce a distinct 'Access Denied' or 'Permission denied' message, not a partition scanning error. Option D is wrong because insufficient slot capacity results in 'Resources exceeded' or 'Query execution timed out' errors, not a partition scanning limit error.

51
Multi-Selecthard

A company uses Cloud Dataproc for large-scale Spark jobs. They notice that some jobs are failing due to insufficient memory on the worker nodes. They want to improve memory management without over-provisioning. Which three configurations should they apply? (Choose 3)

Select 3 answers
A.Set spark.executor.memory to a value that fits within the node memory
B.Enable Spark dynamic allocation
C.Use custom machine types with high memory ratios
D.Use local SSDs for temporary storage
E.Use preemptible worker nodes for volatile tasks
AnswersA, B, C

Prevents out-of-memory errors by ensuring executor memory fits worker capacity.

Why this answer

Option A is correct because setting spark.executor.memory to a value that fits within the node memory ensures that each executor does not exceed the available RAM on a worker node, preventing out-of-memory (OOM) errors. This configuration directly controls the heap size allocated to each executor, and when combined with spark.executor.cores and spark.executor.instances, it allows precise memory budgeting per node. Over-provisioning is avoided by calculating the maximum safe executor memory as (node memory - OS overhead - HDFS cache) / number of executors per node.

Exam trap

Google Cloud often tests the distinction between memory management and storage optimization, so candidates mistakenly choose local SSDs (option D) thinking they help with memory, when in fact they only improve disk I/O for shuffle operations.

52
MCQhard

In Cloud Composer, a DAG has two tasks: task_A (runs an Apache Spark job on Dataproc) and task_B (loads data from Cloud Storage to BigQuery). task_B must start after task_A completes. The DAG is scheduled to run hourly. Sometimes task_B starts before task_A finishes because task_A's Dataproc job appears to complete in the Airflow metadata but the data is not yet available. What is the best way to ensure task_B only runs after the data is fully written?

A.Increase the number of retries for task_B
B.Use a sensor after task_A that checks for a specific file in Cloud Storage
C.Use DataprocJobOperator with a job_poll_interval and add a sensor to verify output
D.Change the DAG schedule to run every 30 minutes
AnswerC

DataprocJobOperator can poll the job status, and adding a sensor ensures data is written before proceeding.

Why this answer

Option C is correct because it addresses the root cause: the Dataproc job may report completion in Airflow metadata before the output data is fully written to Cloud Storage. By using DataprocJobOperator with a job_poll_interval, you ensure Airflow waits for the actual job completion on Dataproc, and adding a sensor to verify the output (e.g., checking for a success marker file or expected data in Cloud Storage) guarantees that task_B only starts after the data is fully available. This two-step approach prevents race conditions between job completion and data consistency.

Exam trap

The trap here is that candidates assume a job's completion status in Airflow metadata is sufficient to guarantee data availability, overlooking the eventual consistency of Cloud Storage and the fact that Dataproc job completion and data write finalization are not atomic.

How to eliminate wrong answers

Option A is wrong because increasing retries for task_B does not solve the data availability issue; it only retries a task that may fail repeatedly due to missing data, wasting resources and time. Option B is wrong because using a sensor after task_A that checks for a specific file in Cloud Storage is a partial solution—it does not address the fact that the Dataproc job may not have fully completed, and the file check could succeed before all data is written if the file is created early. Option D is wrong because changing the DAG schedule to run every 30 minutes does not fix the dependency timing; it only increases execution frequency, potentially causing more overlaps and still allowing task_B to start before data is ready.

53
Matchingmedium

Match each Google Cloud service to its data processing capability.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Unified stream and batch processing (Apache Beam)

Managed Spark and Hadoop clusters

Workflow orchestration (Apache Airflow)

Visual data integration and pipeline builder

Why these pairings

Services for data processing and orchestration.

54
MCQmedium

A data pipeline reading from Cloud Storage and writing to BigQuery using Dataflow is experiencing high cost. The data is CSV and needs schema inference. What change reduces cost?

A.Use Dataproc instead of Dataflow
B.Use Cloud Functions to transform data
C.Use BigQuery load jobs with schema auto-detection
D.Use BigQuery Data Transfer Service
AnswerC

Load jobs are free for data ingestion (only storage cost) and support auto-detection.

Why this answer

Option C is correct because BigQuery load jobs with schema auto-detection can directly ingest CSV files from Cloud Storage without the need for a Dataflow pipeline, eliminating the compute cost associated with Dataflow. Schema auto-detection infers column names and types from the CSV header and data, matching the requirement for schema inference while being a serverless, no-cost-for-compute operation (you only pay for storage and querying). This reduces cost by removing the Dataflow processing step entirely.

Exam trap

Google Cloud often tests the misconception that any data transformation or schema inference requires a processing framework like Dataflow or Dataproc, when in fact BigQuery's native load jobs with auto-detection can handle many CSV ingestion scenarios at zero compute cost.

How to eliminate wrong answers

Option A is wrong because Dataproc is a managed Spark/Hadoop service that incurs compute costs for cluster VMs, and using it instead of Dataflow would not reduce cost—it would likely increase cost due to cluster overhead and the need to manage schema inference manually. Option B is wrong because Cloud Functions are event-driven compute that would still require processing each CSV file, incurring invocation and execution costs, and they lack native schema inference for BigQuery, requiring custom code that adds complexity and potential cost. Option D is wrong because BigQuery Data Transfer Service is designed for scheduled transfers from sources like Google Ads, Amazon S3, or SaaS applications, not for ad-hoc CSV files in Cloud Storage; it does not support schema auto-detection for arbitrary CSV files and would not replace the need for a pipeline.

55
MCQmedium

A team runs a Dataflow streaming pipeline that reads from Pub/Sub, windows events by processing time, and writes to BigQuery. Some late-arriving events are being dropped. The requirement is to include all events that arrive within 10 minutes of the watermark. Which pipeline configuration should be used?

A.Use sliding windows with no allowed lateness
B.Use fixed windows with .withAllowedLateness(Duration.standardMinutes(10))
C.Use fixed windows with withAllowedLateness(Duration.standardSeconds(10))
D.Switch from processing time to event time and use default triggers
AnswerB

Allows late data up to 10 minutes after watermark.

Why this answer

Option B is correct because `withAllowedLateness(Duration.standardMinutes(10))` on a fixed window allows late-arriving events to be included up to 10 minutes after the watermark passes the window's end. This directly meets the requirement to retain events arriving within 10 minutes of the watermark, while still using processing-time windows as specified.

Exam trap

Google Cloud often tests the distinction between processing time and event time, and the exact value of allowed lateness, tricking candidates into choosing a shorter duration or the wrong window type.

How to eliminate wrong answers

Option A is wrong because sliding windows with no allowed lateness will drop all late events, failing the requirement to include events within 10 minutes of the watermark. Option C is wrong because `withAllowedLateness(Duration.standardSeconds(10))` only allows 10 seconds of lateness, not the required 10 minutes. Option D is wrong because switching to event time would change the windowing basis from processing time, which is not requested, and default triggers alone do not provide the explicit 10-minute lateness allowance needed.

56
MCQhard

A company uses Cloud Storage to store IoT sensor data in JSON format. The data is ingested using a Cloud Function triggered by Cloud Storage events. They notice that when many files are uploaded simultaneously, some files are not processed and the Cloud Function logs show 'function execution timeout'. What is the most likely cause and solution?

A.The Cloud Function is not idempotent; implement idempotency.
B.The Cloud Storage event notification is unreliable; switch to Pub/Sub notifications.
C.The Cloud Function has too few instances; increase max instances.
D.The Cloud Function's timeout is too short; increase timeout beyond 540 seconds.
AnswerD

Increasing the timeout allows the function to complete its processing within the allocated time.

Why this answer

The Cloud Function logs explicitly show 'function execution timeout', which indicates the function is exceeding its configured maximum runtime. The default Cloud Functions timeout is 60 seconds, and the maximum is 540 seconds (9 minutes). When many files are uploaded simultaneously, each function invocation may take longer due to increased processing load, causing timeouts.

Increasing the timeout to the maximum of 540 seconds gives the function more time to complete processing, directly addressing the logged error.

Exam trap

Google Cloud often tests the distinction between scaling issues (max instances) and timeout issues, so the trap here is that candidates see 'many files uploaded simultaneously' and incorrectly assume a concurrency/scaling problem, when the logs explicitly point to a timeout.

How to eliminate wrong answers

Option A is wrong because idempotency ensures duplicate events don't cause duplicate processing, but the logs show timeouts, not duplicate processing errors. Option B is wrong because Cloud Storage event notifications are reliable for triggering Cloud Functions; switching to Pub/Sub adds a buffer but does not solve the timeout issue. Option C is wrong because increasing max instances would allow more concurrent invocations, but the problem is that individual invocations are timing out, not that there are too few instances to handle the load.

57
Multi-Selectmedium

A healthcare company stores patient records as JSON files in Cloud Storage for analysis. They want to design a data lake that enables querying the data with BigQuery while minimizing storage costs and maintaining data security. Which two actions should they take? (Choose two.)

Select 2 answers
A.Partition the data by date and store in separate directories for each partition.
B.Configure object lifecycle management to transition files older than 90 days to Nearline storage.
C.Convert all JSON files to CSV to reduce storage size.
D.Use BigLake to create external tables with row-level security and access delegation.
E.Enable Cloud KMS to encrypt the data with customer-managed encryption keys.
AnswersB, D

Lifecycle policies automatically move data to cheaper storage classes, reducing cost.

Why this answer

Correct answers are A and C. Option A is correct because BigLake allows BigQuery to query Cloud Storage data with fine-grained access control and supports various formats. Option C is correct because object lifecycle management can move old data to colder storage classes (e.g., Nearline, Coldline) to reduce costs.

Option B is incorrect because encryption is already default; Cloud KMS provides additional control but is not a cost-saving measure. Option D is incorrect because CSV is less efficient for nested data; JSON or Parquet is better. Option E is incorrect because partitioning in BigQuery is for managed tables, not for external tables on Cloud Storage (BigLake supports partitioning but not automatic; however, it's not the primary cost or security action).

58
MCQmedium

A company processes real-time clickstream data from websites. They need to aggregate user sessions that may span multiple hours and handle events that arrive late due to network delays. The pipeline must avoid discarding late data. Which Dataflow feature should they configure?

A.Use fixed windows with a trigger that fires after every element
B.Use session windows with a gap duration and allow late data with a suitable allowed_lateness
C.Use the GlobalWindow with a watermark
D.Use sliding windows with no allowed lateness
AnswerB

Session windows group events within a gap, and allowed_lateness accommodates late arrivals.

Why this answer

Session windows are ideal for aggregating user sessions that span multiple hours, as they group events based on a gap duration of inactivity. By configuring `allowed_lateness`, the pipeline can handle late-arriving events without discarding them, ensuring completeness. This directly addresses the requirement to avoid discarding late data while aggregating sessions.

Exam trap

Google Cloud often tests the distinction between window types and late-data handling; the trap here is that candidates might choose fixed or sliding windows without realizing they lack the session-gap logic needed for variable-length user sessions, or they might overlook the `allowed_lateness` parameter as the key to preserving late data.

How to eliminate wrong answers

Option A is wrong because fixed windows with a trigger after every element would create a new window per event, failing to aggregate sessions that span hours and not handling late data properly. Option C is wrong because GlobalWindow with a watermark is used for global aggregations (e.g., counting all events) but does not naturally group events into sessions based on inactivity gaps; it would require complex triggers and does not inherently support sessionization. Option D is wrong because sliding windows with no allowed lateness would discard any late-arriving events, violating the requirement to avoid discarding late data.

59
MCQeasy

A logistics company uses Cloud Functions to process incoming tracking events from IoT devices. Events are sent via HTTP triggers. During peak hours, some events fail with 500 errors. What is the best strategy to handle this reliably?

A.Implement client-side retry with exponential backoff.
B.Increase the Cloud Functions timeout to 9 minutes and memory to 2GB.
C.Switch to Cloud Tasks and configure retry parameters.
D.Use Cloud Pub/Sub as an intermediary: send events to Pub/Sub and trigger Cloud Functions via Pub/Sub subscription.
AnswerD

Pub/Sub provides buffering and retries.

Why this answer

Option D is correct because Cloud Pub/Sub decouples event ingestion from processing, providing at-least-once delivery and built-in retry with exponential backoff. This ensures that HTTP 500 errors from Cloud Functions are automatically retried without data loss, even during peak loads, and the Pub/Sub subscription can be configured with a dead-letter queue for persistent failures.

Exam trap

Google Cloud often tests the misconception that client-side retry (Option A) or increasing resource limits (Option B) is sufficient for reliability, when the core requirement is decoupling ingestion from processing to handle transient failures and scale independently.

How to eliminate wrong answers

Option A is wrong because client-side retry with exponential backoff shifts the burden to IoT devices, which may be resource-constrained or unreliable, and does not guarantee delivery if the client fails or disconnects. Option B is wrong because increasing timeout and memory does not address the root cause of 500 errors (e.g., transient backend failures or throttling) and can increase costs without improving reliability. Option C is wrong because Cloud Tasks is designed for HTTP target tasks with retries, but it still relies on the Cloud Functions HTTP endpoint, which can fail under load; Cloud Tasks does not provide the same buffering and decoupling as Pub/Sub for event-driven ingestion.

60
MCQhard

A Dataflow pipeline as described in the exhibit has increasing lag. Which optimization is most likely to reduce the lag?

A.Use FileLoads instead of StreamingInserts for BigQuery output
B.Increase the number of workers
C.Use global windows instead of fixed windows
D.Add additional ParDo transforms
AnswerA

FileLoads (batch loads) are more efficient for high throughput and reduce lag.

Why this answer

The exhibit shows increasing lag in a Dataflow pipeline writing to BigQuery. StreamingInserts (the default) use the BigQuery Storage Write API, which can throttle under high throughput, causing backpressure and lag. Switching to FileLoads writes data to temporary files in Cloud Storage and then loads them into BigQuery via batch load jobs, which decouples the write path from the streaming insert quota and reduces lag by avoiding per-row insert limits.

Exam trap

Google Cloud often tests the misconception that scaling workers or changing windowing fixes all performance issues, but the trap here is that the lag is specifically caused by the BigQuery sink's streaming insert throttling, which requires a sink-level optimization like FileLoads.

How to eliminate wrong answers

Option B is wrong because increasing the number of workers can help with parallel processing but does not address the root cause of lag from BigQuery streaming insert quota exhaustion or throttling; it may even increase the rate of inserts and worsen the problem. Option C is wrong because using global windows instead of fixed windows does not affect the write path to BigQuery; windowing changes how data is grouped for aggregation but does not reduce lag caused by the sink's throughput limitations. Option D is wrong because adding additional ParDo transforms increases the processing steps and can introduce more latency, making the lag worse rather than reducing it.

61
MCQhard

A media company uses Cloud Dataflow to process video metadata from a Pub/Sub stream. The pipeline enriches metadata using a lookup table stored in Cloud Bigtable. Recently, they noticed increased latency and occasional 'Bigtable operation timeout' errors. The Bigtable instance has 3 nodes and the data is highly distributed. The Dataflow pipeline uses default settings. What is the most likely cause of the timeouts?

A.The Bigtable table uses a single column family with over 100 columns, leading to high read overhead
B.The Dataflow pipeline uses a large batch size for Bigtable reads, overwhelming the instance
C.The Bigtable cluster has too few nodes for the read throughput
D.The Dataflow pipeline does not cache Bigtable results, causing repeated lookups
AnswerA

Wide column families cause inefficient reads in Bigtable.

Why this answer

A single column family with over 100 columns in Bigtable forces the system to read all column qualifiers for each row, even if only a few are needed. This increases read overhead and latency, and can trigger 'operation timeout' errors when the Dataflow pipeline's default settings (which do not limit column qualifiers) request the entire row. The highly distributed data and 3-node cluster exacerbate the issue, but the root cause is the excessive column count within one family.

Exam trap

Google Cloud often tests the misconception that Bigtable timeouts are always due to insufficient nodes or throughput, when in fact the root cause can be inefficient schema design like a single column family with too many columns.

How to eliminate wrong answers

Option B is wrong because Dataflow's default batch size for Bigtable reads is conservative (typically 1–10 rows per RPC), not large; a large batch size would actually reduce overhead, not cause timeouts. Option C is wrong because 3 nodes for a highly distributed dataset is generally sufficient for moderate throughput; the timeouts are due to per-row read overhead, not node count. Option D is wrong because caching Bigtable results would not help with timeouts caused by reading too many columns per row; caching reduces repeated lookups but does not address the fundamental read amplification from a wide column family.

62
MCQmedium

A company is migrating their on-premises Apache Spark jobs to Dataproc. They want to minimize code changes and take advantage of serverless infrastructure. Which Dataproc feature should they use?

A.Dataproc clusters with preemptible VMs
B.Dataproc Workflow Templates
C.Dataproc Serverless Spark
D.Dataproc Jobs API with custom machine types
AnswerC

Serverless Spark runs jobs without cluster management and is compatible with existing Spark code.

Why this answer

Dataproc Serverless Spark is the correct choice because it allows the company to run Spark workloads without provisioning or managing clusters, minimizing code changes by using the same Spark APIs and libraries. This serverless infrastructure automatically scales resources and handles failures, aligning with the goal of reducing operational overhead while maintaining compatibility with existing Spark jobs.

Exam trap

Google Cloud often tests the distinction between 'serverless' and 'managed' services; the trap here is that candidates may confuse Dataproc Workflow Templates or Jobs API with serverless capabilities, but those still require cluster management, whereas Dataproc Serverless Spark truly abstracts the infrastructure.

How to eliminate wrong answers

Option A is wrong because preemptible VMs are cost-effective but still require managing a cluster and do not provide serverless infrastructure; they are prone to termination, which can disrupt jobs without proper checkpointing. Option B is wrong because Workflow Templates orchestrate job sequences on existing clusters but do not eliminate cluster management or provide serverless execution. Option D is wrong because the Dataproc Jobs API with custom machine types still requires a running cluster to submit jobs, thus not achieving serverless infrastructure or minimizing cluster management.

63
MCQmedium

A financial services company uses a Dataflow streaming pipeline to process real-time stock trades. The pipeline reads from Pub/Sub, enriches with reference data from Cloud Bigtable, and writes to BigQuery. Recently, they noticed an increase in processing latency during market open hours. Investigation shows that the pipeline is data-skewed: a few stock symbols generate 90% of the traffic. The team wants to reduce latency without changing the pipeline structure. What should they do?

A.Increase the Pub/Sub subscription flow control to buffer less data
B.Use event-time windows based on trade timestamp to spread data
C.Enable Dataflow Streaming Engine to dynamically repartition work
D.Increase the number of workers and use more CPU
AnswerC

Streaming Engine handles hot keys by splitting processing across workers.

Why this answer

Option C is correct because using a streaming engine separates compute from storage, allowing better handling of hot keys. Option A is wrong because more workers may not help if the hot key bottleneck is within a single worker. Option B is wrong because reshuffling is already happening; using a different window doesn't fix skew.

Option D is wrong because waiting for no backlog is not a solution.

64
MCQmedium

Your company is building a real-time fraud detection system using Google Cloud. Transactions are streamed into Pub/Sub, and you need to process them with low latency (under 100ms per event) and aggregate data over sliding windows. Which Google Cloud service is best suited for this processing logic?

A.Dataflow
B.BigQuery streaming inserts with scheduled queries
C.Dataproc with Spark Streaming
D.Cloud Functions
AnswerA

Dataflow provides exactly-once, low-latency stream processing with native sliding window support.

Why this answer

Dataflow is the best choice because it provides a unified stream and batch processing model with native support for Pub/Sub, exactly-once semantics, and low-latency sliding window aggregations. Its autoscaling and millisecond-level checkpointing enable sub-100ms per event processing, which is critical for real-time fraud detection.

Exam trap

Google Cloud often tests the misconception that BigQuery streaming inserts can handle real-time per-event processing, but candidates overlook that scheduled queries add latency and BigQuery is not designed for stateful per-event aggregations with sliding windows.

How to eliminate wrong answers

Option B is wrong because BigQuery streaming inserts with scheduled queries cannot achieve sub-100ms latency per event; scheduled queries run on a periodic basis (e.g., every minute), introducing significant delay, and BigQuery is optimized for analytical queries, not per-event low-latency processing. Option C is wrong because Dataproc with Spark Streaming introduces higher startup and shuffle overhead, typically achieving latencies in the seconds range, and requires manual cluster management, making it unsuitable for consistent sub-100ms per event. Option D is wrong because Cloud Functions has a maximum timeout of 9 minutes and is designed for stateless, short-lived tasks; it lacks built-in support for stateful sliding window aggregations and cannot maintain per-key state across events without external services.

65
Multi-Selecthard

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

Select 3 answers
A.Migrate to BigQuery for all analytics.
B.Use Cloud Dataproc with Spark and Hive components.
C.Store data in Cloud Storage instead of HDFS.
D.Rewrite Spark jobs as Dataflow pipelines.
E.Use Dataproc Jobs API to submit jobs.
AnswersB, C, E

Compatible with existing code.

Why this answer

Option B is correct because Cloud Dataproc is a managed Spark and Hadoop service that supports the same Spark and Hive components used on-premises, allowing existing Spark jobs to run with minimal modification. It provides native integration with Cloud Storage, which can replace HDFS without changing job logic, and the Dataproc Jobs API enables programmatic job submission, preserving existing workflows.

Exam trap

The trap here is that candidates may assume BigQuery or Dataflow are the only Google Cloud data processing options, overlooking that Dataproc is specifically designed for minimal-change migrations of existing Spark/Hadoop workloads.

66
Multi-Selecthard

A company building a real-time analytics pipeline with Pub/Sub and Dataflow. Which THREE best practices should they follow?

Select 3 answers
A.Use event time processing with watermarks and allowed lateness
B.Design idempotent sinks to handle duplicate outputs
C.Use exactly-once processing for all transforms
D.Use at-least-once delivery with deduplication in the pipeline
E.Use event time processing only for batch pipelines
AnswersA, B, D

Event time processing supports out-of-order data and ensures accurate windowing.

Why this answer

Option A is correct because in streaming pipelines, event time processing with watermarks and allowed lateness is essential for handling out-of-order data. Watermarks track the progress of event time, and allowed lateness specifies how long to wait for late-arriving data before considering it as late, ensuring accurate windowed aggregations.

Exam trap

Google Cloud often tests the misconception that exactly-once processing must be applied uniformly across all pipeline transforms, when in practice it is only required at sinks and can be replaced by at-least-once with deduplication for better performance.

67
MCQeasy

A startup is building a real-time dashboard that shows aggregated metrics from social media feeds. They expect up to 10,000 events per second. The data must be near-real-time (< 30 seconds latency) and stored in BigQuery for historical analysis. They have limited experience managing infrastructure. The CTO suggests using Apache Kafka on Compute Engine for ingestion. However, the data engineer recommends a fully managed solution. Which approach should the team adopt?

A.Use Cloud Functions to ingest events directly into BigQuery
B.Use Apache Kafka on Compute Engine for ingestion, then use Dataflow to write to BigQuery
C.Use Cloud Pub/Sub for ingestion and Cloud Dataflow for streaming into BigQuery
D.Use App Engine to receive events and write to BigQuery
AnswerC

Fully managed, scales automatically, low operations overhead.

Why this answer

Option C is correct because Cloud Pub/Sub provides a fully managed, scalable ingestion service that can handle 10,000+ events per second without infrastructure management, and Cloud Dataflow offers exactly-once, auto-scaling streaming into BigQuery with sub-30-second latency. This combination meets the near-real-time requirement while eliminating operational overhead, aligning with the data engineer's recommendation for a fully managed solution.

Exam trap

The trap here is that candidates may choose Option B (Kafka on Compute Engine) because Kafka is a common streaming tool, but the question emphasizes limited infrastructure experience and a fully managed solution, making the self-managed Kafka approach a distraction that ignores operational overhead.

How to eliminate wrong answers

Option A is wrong because Cloud Functions has a maximum invocation timeout of 9 minutes and is designed for event-driven, short-lived tasks, not sustained high-throughput ingestion of 10,000 events per second; it would also lack buffering and retry mechanisms for streaming into BigQuery. Option B is wrong because managing Apache Kafka on Compute Engine requires significant operational expertise for cluster setup, partitioning, and monitoring, contradicting the team's limited experience and the goal of a fully managed solution. Option D is wrong because App Engine is a web application platform, not a streaming ingestion service; it would introduce HTTP overhead and scaling bottlenecks for high-velocity event streams, and writing directly to BigQuery from App Engine would risk data loss without a buffer.

68
MCQeasy

Given the query plan, what is the most likely reason this query is efficient despite processing 10 billion rows?

A.The query uses a wildcard function.
B.The table is partitioned by sale_date.
C.The table is materialized.
D.The table is clustered by product_id.
AnswerB

Partition pruning removes irrelevant partitions, reducing scanned data from billions of rows to only those in the date range.

Why this answer

Option B is correct because partitioning by sale_date enables partition pruning, which allows the query engine to scan only the relevant partitions instead of the entire 10-billion-row table. This drastically reduces the amount of data read and processed, making the query efficient even with a large total row count.

Exam trap

Google Cloud often tests the distinction between partitioning (which reduces scanned rows via pruning) and clustering (which only improves sorting and compression within partitions), leading candidates to mistakenly choose clustering as the primary efficiency driver.

How to eliminate wrong answers

Option A is wrong because using a wildcard function (e.g., SELECT *) typically increases I/O and processing overhead by reading all columns, which would not improve efficiency. Option C is wrong because a materialized table is a precomputed snapshot that can speed up queries, but it does not inherently reduce the number of rows scanned; the efficiency gain here comes from partition pruning, not materialization. Option D is wrong because clustering by product_id organizes data within partitions for better compression and filter performance, but without partition pruning, the query would still need to scan all 10 billion rows, so clustering alone does not explain the efficiency.

69
Multi-Selectmedium

You are designing a streaming Dataflow pipeline that processes high-throughput data. Which two features can help minimize cost? (Choose TWO.)

Select 2 answers
A.Enable autoscaling based on CPU utilization
B.Use batch loads to BigQuery for streaming inserts
C.Enable Streaming Engine to decouple compute and storage
D.Use preemptible VMs for all workers
E.Use a global window and batch output to BigQuery every hour
AnswersA, C

Autoscaling adjusts the number of workers to meet demand, avoiding over-provisioning and reducing cost.

Why this answer

Option A is correct because enabling autoscaling based on CPU utilization allows the Dataflow pipeline to dynamically adjust the number of worker instances in response to the actual processing load. This prevents over-provisioning during low-throughput periods, directly reducing compute cost while maintaining performance during spikes.

Exam trap

Google Cloud often tests the misconception that preemptible VMs are always cost-effective for streaming workloads, but the trap here is that preemptible VMs are unsuitable for stateful streaming pipelines due to frequent preemption causing data reprocessing and instability.

70
MCQhard

A data analyst frequently queries a BigQuery table that contains an array of structs representing product purchases. The query below runs slowly: SELECT customer_id, COUNT(purchase) as total_purchases FROM sales, UNNEST(purchases) as purchase GROUP BY customer_id What change would most improve query performance?

A.Create a materialized view that pre-aggregates by customer_id and purchase count
B.Partition the table by transaction date
C.Use a subquery to filter purchases first
D.Cluster the table by purchases.product_id
AnswerA

A materialized view pre-computes the aggregation, so queries read the view instead of scanning the full table.

Why this answer

The query runs slowly because it must unnest the `purchases` array for every row and then aggregate. A materialized view pre-aggregates the data by `customer_id` and purchase count, avoiding repeated full scans and unnesting. This is the most impactful optimization because it eliminates the compute cost of UNNEST and GROUP BY at query time.

Exam trap

Google Cloud often tests the misconception that any indexing or partitioning strategy (like clustering or partitioning) universally speeds up all queries, when in fact the fix must target the specific expensive operation — here, the UNNEST and GROUP BY — rather than adding a generic optimization.

How to eliminate wrong answers

Option B is wrong because partitioning by transaction date does not help this query — there is no WHERE clause filtering by date, so all partitions would still be scanned. Option C is wrong because a subquery to filter purchases first does not reduce the amount of data that must be unnested or aggregated; it adds a nested scan without addressing the core performance bottleneck. Option D is wrong because clustering by purchases.product_id would only improve queries that filter or group by that field, but this query groups by customer_id, not product_id.

71
MCQmedium

A data team uses Cloud Dataproc to run nightly Spark jobs. The job volume has increased, and the cluster is often underutilized during the day. They want to reduce costs while ensuring jobs can scale when needed. Which strategy should they adopt?

A.Use preemptible workers for both primary and secondary nodes to minimize cost.
B.Manually scale the cluster up before nightly jobs and down after.
C.Use a cluster with a small number of primary workers and a large pool of preemptible workers, and enable autoscaling.
D.Use custom machine types with local SSDs for primary workers to improve I/O.
AnswerC

Preemptible workers are cheap, and autoscaling adjusts to load.

Why this answer

Option C is correct because it combines a small number of primary (non-preemptible) workers for reliability with a large pool of preemptible workers for cost-effective scaling, and enables autoscaling to dynamically adjust the cluster size based on workload. This minimizes cost during idle periods (preemptible instances are ~80% cheaper) while ensuring jobs can scale up quickly when needed, as autoscaling adds preemptible workers automatically. Preemptible workers are ideal for fault-tolerant Spark jobs that can handle node preemptions.

Exam trap

Google Cloud often tests the misconception that preemptible instances can be used for all nodes, but the trap here is that primary nodes require non-preemptible instances for cluster stability, while preemptible workers are only suitable for secondary (task) nodes in a fault-tolerant framework.

How to eliminate wrong answers

Option A is wrong because using preemptible workers for primary nodes is not allowed in Cloud Dataproc—primary nodes must be non-preemptible to ensure cluster stability and avoid data loss from coordinator failures. Option B is wrong because manual scaling is inefficient and error-prone for a nightly job pattern; it requires human intervention and cannot react to sudden workload spikes, leading to either underutilization or job delays. Option D is wrong because custom machine types with local SSDs improve I/O performance but do not address cost reduction or scaling needs; they increase cost without solving underutilization during the day.

72
Multi-Selectmedium

A data warehouse in BigQuery is experiencing performance issues. Which THREE techniques can improve performance without moving data to a different storage system?

Select 3 answers
A.Partition by date
B.Cluster by common filter columns
C.Use streaming buffer
D.Use BigQuery slots
E.Use materialized views
AnswersA, B, E

Partitioning limits scans to relevant partitions.

Why this answer

Partitioning by date in BigQuery allows the query engine to prune entire partitions that do not match the query's date filter, significantly reducing the amount of data scanned and improving performance. This technique works without moving data to a different storage system because it is a metadata-level reorganization of the existing table.

Exam trap

Google Cloud often tests the misconception that streaming buffer (Option C) is a performance optimization, when in fact it is designed for near-real-time ingestion and can degrade query performance due to the small, unoptimized files it creates.

73
MCQmedium

A media company uses Cloud Data Loss Prevention (DLP) API to inspect and de-identify sensitive data before loading into BigQuery. They want to reduce costs by sampling the data during inspection. Which configuration should they use?

A.Use the 'ROWS' limit in the inspection job.
B.Set the sample method to 'RANDOM' with a percentage.
C.Use a hybrid inspection with a BigQuery sample table.
D.Use the 'BYTES_LIMIT' parameter.
AnswerB

DLP supports random sampling to inspect a subset of data, reducing cost.

Why this answer

Option B is correct because the Cloud DLP API supports a 'sample_method' of 'RANDOM' with a 'sampling_percentage' to inspect only a random subset of rows. This directly reduces the volume of data scanned, lowering costs while still providing statistically representative coverage for sensitive data discovery.

Exam trap

The trap here is that candidates confuse 'limiting rows/bytes' (which scans sequentially from the start) with 'random sampling' (which distributes inspection across the entire dataset), leading them to pick options A or D, which do not achieve representative cost reduction.

How to eliminate wrong answers

Option A is wrong because the 'ROWS' limit in an inspection job caps the total number of rows scanned but does not sample randomly; it stops after scanning that many rows from the start, which can miss sensitive data in later rows and does not provide representative sampling. Option C is wrong because hybrid inspection with a BigQuery sample table requires manually creating and maintaining a separate table, adding complexity and storage costs, whereas the DLP API's built-in sampling is simpler and directly integrated. Option D is wrong because 'BYTES_LIMIT' limits the total bytes scanned but, like 'ROWS', scans sequentially from the beginning and does not perform random sampling, leading to biased results and potential cost inefficiency.

74
MCQmedium

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

A.The Cloud Storage bucket has been deleted.
B.The project has reached its Dataflow API quota.
C.The input data volume has increased significantly.
D.The BigQuery output table schema has changed.
AnswerB

Resource exhausted error indicates quota issue.

Why this answer

The 'Resources have been exhausted' error in Cloud Dataflow typically indicates that the project has reached its Dataflow API quota, such as the maximum number of concurrent jobs or API requests per minute. This is a common issue when multiple jobs run simultaneously or when the quota is set low by default. The error is distinct from resource exhaustion in the underlying compute or storage layers.

Exam trap

Google Cloud often tests the distinction between API quota exhaustion and resource exhaustion in the underlying infrastructure (e.g., Compute Engine CPU/memory), leading candidates to incorrectly attribute the error to increased data volume or schema changes.

How to eliminate wrong answers

Option A is wrong because deleting the Cloud Storage bucket would cause a 'bucket not found' or 'object not found' error, not a 'Resources have been exhausted' error. Option C is wrong because a significant increase in input data volume would lead to autoscaling limits or worker resource exhaustion (e.g., out of memory), but the specific 'Resources have been exhausted' message is tied to API quota limits, not data volume. Option D is wrong because a schema change in BigQuery would result in a schema mismatch or insertion error, not an API quota exhaustion error.

75
Multi-Selecthard

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

Select 3 answers
A.Integrate Cloud Pub/Sub as an intermediary to buffer and allow message retry
B.Use a try-catch block in the pipeline to retry processing failed records
C.Create a Cloud Monitoring alert on pipeline failures
D.Add schema validation before processing to reject invalid JSON records
E.Implement a dead-letter queue in the Dataflow pipeline to store failed records for later analysis
AnswersA, D, E

Pub/Sub can retry delivery of messages, improving reliability.

Why this answer

Option A is correct because integrating Cloud Pub/Sub as an intermediary decouples the ingestion of JSON files from the Dataflow pipeline. Pub/Sub provides at-least-once delivery and automatic retries for messages that are not acknowledged, which buffers against transient failures and malformed records. This allows the pipeline to pull messages at its own pace and retry processing without losing data.

Exam trap

The trap here is that candidates often confuse reactive monitoring (Option C) with proactive reliability improvements, or they assume a simple try-catch block (Option B) is sufficient in a distributed processing framework like Dataflow, where fault tolerance requires persistent retry mechanisms and dead-letter queues.

Page 1 of 3 · 159 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Designing data processing systems questions.