CCNA Pde Designing Data Systems Questions

75 of 110 questions · Page 1/2 · Pde Designing Data Systems topic · Answers revealed

1
MCQhard

A financial services company stream trades into Pub/Sub and processes with Dataflow. The pipeline must ensure exactly-once processing of each trade for regulatory compliance. However, Pub/Sub guarantees at-least-once delivery. Which combination of features should the Dataflow pipeline use to achieve exactly-once semantics?

A.Use Dataflow's exactly-once processing mode and implement idempotent writes in the sink
B.Enable Pub/Sub message deduplication and use at-most-once delivery
C.Use global windowing and discard late data
D.Use Pub/Sub Lite with exactly-once delivery guarantee
AnswerA

Dataflow's exactly-once mode with idempotent sinks ensures output exactly once.

Why this answer

Dataflow's exactly-once sink combined with idempotent writes ensures exactly-once output. Pub/Sub cannot guarantee exactly-once delivery, but Dataflow can deduplicate using unique IDs. Idempotent writes prevent duplicates even if Dataflow retries.

2
MCQhard

A company uses Pub/Sub with push subscriptions to deliver events to a Cloud Run service. Recently, the service has been returning HTTP 429 (Too Many Requests), causing messages to be retried and eventually sent to the dead letter topic. What is the MOST likely cause?

A.The subscription ackDeadline is set too low, causing messages to be redelivered
B.The push endpoint is not acknowledging messages quickly enough, causing a backlog
C.The dead letter topic is misconfigured, causing messages to be sent to it prematurely
D.The Cloud Run service needs more instances to handle the incoming request rate
AnswerD

Cloud Run scales based on requests; if max instances reached, it returns 429. Increasing instances or adjusting concurrency resolves this.

Why this answer

Push subscriptions can be rate limited by the receiving service. Increasing ackDeadline gives more time but doesn't reduce rate. Using pull subscriptions shifts the rate control to the subscriber.

Adjusting max delivery attempts only affects how many retries before dead letter, not rate limiting.

3
MCQmedium

A company is designing a data pipeline that ingests real-time events from IoT devices and must handle late-arriving data (up to 1 hour late) while minimizing duplicate processing. They plan to use Dataflow with Pub/Sub. Which combination of windowing and trigger settings should they use?

A.Sliding windows of 10 minutes with allowed lateness of 30 minutes and accumulating panes
B.Global window with allowed lateness of 1 hour and accumulating trigger every 5 minutes
C.Fixed windows of 1 hour with allowed lateness of 1 hour and no accumulation
D.Session windows with a 10-minute gap duration and allowed lateness of 1 hour
AnswerD

Session windows group events that occur within a 10-minute gap. Allowed lateness of 1 hour ensures that late events up to an hour after the watermark advance are still included, minimizing duplicates by capturing them in the correct session.

Why this answer

Session windows naturally group events based on a gap duration, so late events within the gap extend the window. Setting the allowed lateness to 1 hour ensures that late events are still included in the correct session. Using withAllowedLateness(1 hour) allows the watermark to advance and the session to finalize after the gap, but late data within 1 hour will trigger a pane update.

4
Multi-Selecthard

A data pipeline processes sensitive customer data. You need to ensure that only authorised users can query the data in BigQuery, and that the data is encrypted at rest and in transit. Which THREE steps should you take? (Choose three.)

Select 3 answers
A.Use a Cloud VPN to encrypt data in transit between on-premises and Google Cloud
B.Grant the bigquery.dataViewer role at the dataset level to all users
C.Create authorised views in BigQuery to restrict access to sensitive columns
D.Enable default encryption with Customer-Managed Encryption Keys (CMEK) for BigQuery
E.Use IAM conditions to restrict access based on the requester's IP address
AnswersC, D, E

Authorised views allow fine-grained access control at row/column level.

Why this answer

Encryption at rest is enabled by default with CMEK or CSEK. In-transit encryption is default for BigQuery. Authorised views provide fine-grained access control.

IAM roles control dataset access.

5
Multi-Selectmedium

An organization is using BigQuery for analytics. They have a table that is 500 GB and is frequently queried by 'date' and 'region'. They want to optimize query performance and reduce costs. Which TWO actions should they take?

Select 2 answers
A.Use an authorized view
B.Use a wildcard table
C.Use materialized views
D.Cluster the table by region
E.Partition the table by date
AnswersD, E

Clustering by region within each partition improves query performance for region filters.

Why this answer

Partitioning by date enables partition pruning to scan only relevant partitions. Clustering by region further reduces data scanned within partitions.

6
Multi-Selectmedium

Your organization is designing a data lake on Google Cloud using Cloud Storage. You need to choose a file format for storing raw data that supports schema evolution, is splittable for parallel processing, and is optimized for query performance in BigQuery. Which TWO formats meet these requirements? (Choose 2.)

Select 2 answers
A.Avro
B.CSV
C.Parquet
D.JSON (newline-delimited)
E.ORC
AnswersA, C

Avro supports schema evolution, is splittable, and BigQuery can read Avro files efficiently.

Why this answer

Both Avro and Parquet support schema evolution (through schemas) and are splittable. Parquet is columnar and highly optimized for BigQuery performance. Avro is row-oriented but also splittable and supports schema evolution.

CSV and JSON do not natively support schema evolution and are less performant for BigQuery. ORC is not natively supported by BigQuery.

7
MCQmedium

A company runs Apache Spark jobs on Dataproc. They want to reduce costs by using preemptible instances for worker nodes. The jobs are fault-tolerant and can handle occasional node loss. However, the cluster must remain available for interactive querying during business hours. Which Dataproc cluster configuration meets these requirements?

A.Use a single-node cluster that automatically scales with preemptible instances
B.Use a standard cluster with preemptible instances as secondary workers
C.Use standard cluster with master and worker nodes as preemptible instances
D.Use a high-availability cluster with preemptible instances for primary workers
AnswerB

Secondary workers (preemptible workers) are ideal for fault-tolerant batch jobs. They do not store HDFS data, so losing them does not affect data durability. The cluster remains available because primary workers and master nodes are regular instances.

Why this answer

Dataproc clusters support multiple node types. Primary workers run the NodeManager and DataNode daemons; using preemptible instances for them is risky because they are critical for HDFS and YARN. However, secondary workers (preemptible workers) are designed for stateless processing and can be lost without affecting cluster availability.

Preemptible instances cannot be used for master nodes.

8
Multi-Selectmedium

A company uses Pub/Sub to ingest events from multiple sources. They need to ensure that messages from a specific source are processed in order (per source partition). They also need to deduplicate messages. Which TWO features should they use?

Select 2 answers
A.Set a message schema to enforce ordering
B.Use a dead letter topic to handle out-of-order messages
C.Enable exactly-once delivery on the subscription
D.Use a pull subscription with a large ack deadline
E.Enable message ordering by setting an ordering key
AnswersC, E

Exactly-once delivery ensures that each message is delivered only once, providing deduplication.

Why this answer

Pub/Sub ordering keys ensure messages with the same ordering key are delivered in order. Message deduplication is achieved via exactly-once delivery (Pub/Sub now supports this). Dead letter topics are for undeliverable messages.

Schemas enforce structure, not ordering.

9
Multi-Selecthard

A company is migrating on-premises Hadoop Hive workloads to Google Cloud. They want to use Dataproc for Spark processing and require a managed Hive metastore that can be shared across multiple Dataproc clusters. Which TWO components should they use?

Select 2 answers
A.Cloud Bigtable
B.Dataproc on GKE
C.Dataproc Metastore
D.Dataproc Serverless
E.Cloud SQL for MySQL
AnswersC, E

Dataproc Metastore is a managed Hive metastore that can be shared across clusters.

Why this answer

Dataproc Metastore provides a managed Hive metastore service that can be used by multiple Dataproc clusters. Using Dataproc on GKE allows flexibility but is not required for metastore sharing. Cloud SQL can host a Hive metastore but is not managed for Hive.

Dataproc Serverless does not provide a shared metastore.

10
MCQhard

A Dataflow pipeline with multiple steps uses a side input from a slowly changing reference table stored in BigQuery. The side input is updated every hour. To avoid reprocessing the entire pipeline on each update, which approach should you use?

A.Use a side input with a custom 'AsIterable' and a 'Repeatable' trigger that refreshes the side input every hour
B.Use a side input with a periodic refresh via a DoFn that reads BigQuery on each element
C.Use the side input with a default trigger and 'withAllowedLateness'
D.Use a global window with a trigger that fires every hour
AnswerA

This pattern reads the side input periodically (e.g., using a global window with a trigger) and uses AsIterable for efficient lookup.

Why this answer

The side input should be read periodically using a side input pattern with a Repeatable trigger. This refreshes the side input without restarting the pipeline.

11
Multi-Selecthard

A media company processes video metadata using a Dataflow pipeline. They need to join two streaming sources: user activity (Pub/Sub) and video catalog updates (Pub/Sub). Which THREE transforms should be used in the pipeline?

Select 3 answers
A.Flatten to combine the two PCollections
B.ParDo to process each element individually
C.CoGroupByKey to join the two PCollections on a common key (e.g., video_id)
D.Window both PCollections into a common window (e.g., fixed 1-minute)
E.GroupByKey on each PCollection separately before joining
AnswersC, D, E

CoGroupByKey joins multiple PCollections by key.

Why this answer

CoGroupByKey joins multiple PCollections by key. Window into a common window ensures the joins are time-aligned. Flatten is not needed; ParDo is for element-wise processing, not joining.

12
MCQeasy

A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?

A.Triggers
B.Watermarks
C.Side inputs
D.Windowing
AnswerB

Watermarks track the event time progress and allowed lateness; Dataflow drops elements beyond the watermark.

Why this answer

Watermarks track event time progress and allow specifying allowed lateness. Triggers control when results are emitted, but watermarks handle late data.

13
Multi-Selectmedium

You are designing a BigQuery data lake for a healthcare organization. The data includes patient records that must be access-controlled at the row level. Which TWO features should you use to meet this requirement?

Select 2 answers
A.Row-level security using row access policies
B.Authorised views with row filters
C.Dataset-level IAM roles
D.Clustering on patient_id
E.Materialised views
AnswersA, B

Row access policies filter rows based on user identity or group membership.

Why this answer

Row-level security in BigQuery allows filtering at the row level using access policies. Authorised views can also be used to expose only certain rows. Clustering and materialised views do not provide row-level access control.

14
MCQmedium

You have a BigQuery table that is partitioned by ingestion time and clustered on user_id. The table stores event logs and is queried frequently by user_id to analyze user behavior over the last 30 days. Queries are still scanning too many partitions. Which optimization should you apply first?

A.Create a materialized view that pre-aggregates data by user_id and date
B.Remove partitioning and rely solely on clustering
C.Change the partition column to a DATE column based on event_timestamp and keep clustering on user_id
D.Add clustering on a second column like event_type
AnswerC

If the ingestion time does not match the event timestamp, queries filtering on event time will not prune partitions effectively. Partitioning on the actual event date ensures partition pruning aligns with query filters.

Why this answer

The query filter on user_id already uses clustering, which prunes blocks within a partition. But the query also filters on a date range, which should leverage partition pruning. If queries still scan many partitions, the most likely cause is that the partition filter is not applied effectively.

Using a range filter on the partition column (_PARTITIONDATE) or a column used for partitioning will limit partitions scanned. But the question already says the table is partitioned by ingestion time. The best next step is to ensure the query uses a filter on the partition column.

However, among the options, changing the partition type to a specific date column (e.g., event_timestamp) with clustering on user_id could improve if the ingestion time doesn't align with the query time range.

15
Multi-Selectmedium

You are designing a data pipeline for a financial services company that requires exactly-once processing semantics. Which TWO services or configurations provide exactly-once guarantees?

Select 2 answers
A.Pub/Sub with exactly-once delivery enabled
B.Dataproc with checkpointing
C.Dataflow with exactly-once processing mode
D.Pub/Sub Lite with at-least-once delivery
E.Cloud Storage with object versioning
AnswersA, C

Pub/Sub offers exactly-once delivery for pull subscriptions when enabled.

Why this answer

Pub/Sub with exactly-once delivery enabled provides exactly-once message delivery. Dataflow with exactly-once processing mode ensures each record is processed exactly once. Pub/Sub Lite and Dataproc do not provide exactly-once guarantees, and Cloud Storage is for storage.

16
MCQhard

A company uses Cloud Pub/Sub to ingest events from multiple sources. They need to guarantee that each event is processed exactly once by downstream consumers. However, Pub/Sub guarantees at-least-once delivery. Which additional steps should they implement to achieve exactly-once processing?

A.Set the subscription's acknowledgment deadline to 0.
B.Enable message deduplication on the subscription.
C.Store each message's unique ID in a database and ignore duplicates.
D.Use a dead letter topic to capture duplicates.
AnswerC

Consumer-side deduplication by tracking message IDs achieves exactly-once processing.

Why this answer

Pub/Sub only provides at-least-once; exactly-once requires consumer-side deduplication using a unique message ID.

17
Multi-Selecthard

A company is designing a data pipeline using the lambda architecture. They need to process both real-time streams and batch historical data. Which THREE components are essential for a lambda architecture on Google Cloud?

Select 3 answers
A.BigQuery as the serving layer
B.Cloud Pub/Sub for real-time ingestion
C.Cloud SQL as the serving database
D.Cloud Bigtable for serving
E.Dataflow for stream processing
.Dataflow for batch processing
AnswersB, E

Part of speed layer.

Why this answer

Lambda architecture has batch, speed, and serving layers.

18
Multi-Selecthard

A data engineering team is designing a streaming pipeline using Cloud Dataflow. They need to join two unbounded PCollections based on a common key. The join must handle late data up to 10 minutes. Which THREE components should they use?

Select 3 answers
A.CoGroupByKey transform
B.Window into fixed windows
C.Use a global window with triggers
D.Flatten transform
E.Set allowed lateness and trigger to handle late data
AnswersA, B, E

CoGroupByKey performs a join of two PCollections by key.

Why this answer

CoGroupByKey joins two PCollections by key. Window into fixed windows of appropriate duration. Allowed lateness and triggers handle late data.

19
MCQmedium

A data engineer is designing a pipeline that reads from Cloud Pub/Sub, aggregates events into 5-minute windows, and writes the results to BigQuery. The engineer wants to ensure that late-arriving data (up to 2 minutes late) is included in the correct window. Which Dataflow feature should they configure?

A.Use a sliding window of 5 minutes with 2-minute slide
B.Set the window duration to 7 minutes to account for lateness
C.Set the allowed lateness to 2 minutes with a trigger that fires on late data
D.Use a global window and watermark
AnswerC

Allowed lateness specifies how long to wait; a trigger can emit updates for late data.

Why this answer

Option C is correct because Dataflow's allowed lateness feature (set to 2 minutes) ensures that late-arriving data within that threshold is still assigned to the correct 5-minute window. Combined with a trigger that fires on late data, the pipeline can emit updated results for the window after the watermark passes, which is exactly what the engineer needs to handle late-arriving events up to 2 minutes late.

Exam trap

The trap here is that candidates confuse window duration adjustments (Option B) or sliding windows (Option A) with the proper late-data handling mechanism, not realizing that allowed lateness and triggers are the correct Dataflow primitives for including late-arriving data in the correct event-time window.

How to eliminate wrong answers

Option A is wrong because a sliding window of 5 minutes with a 2-minute slide creates overlapping windows that emit results every 2 minutes, not a single 5-minute window with late data handling; it would double-count events and not solve the late-arrival problem. Option B is wrong because setting the window duration to 7 minutes does not account for lateness—it simply shifts the window boundaries, causing data to be assigned to a different time range, which is incorrect for the intended 5-minute aggregation. Option D is wrong because a global window and watermark would aggregate all data into a single unbounded window, losing the per-5-minute grouping required by the pipeline.

20
MCQmedium

A company wants to use Dataprep to clean and transform raw CSV files stored in Cloud Storage before loading into BigQuery. The data quality checks show missing values and inconsistent date formats. Which Dataprep feature should they use to handle these issues?

A.Data quality profiling
B.Scheduling
C.Wrangler
D.Recipe steps
AnswerD

Recipe steps define transformations such as impute missing values and parse dates.

Why this answer

Recipe steps allow chaining transformations like fill missing values and format dates. Data quality profiling identifies issues but doesn't fix them. Scheduling automates execution.

Wrangler is the UI, not a specific feature for transformations.

21
MCQhard

You are designing a data pipeline that processes streaming events with late-arriving data (up to 2 hours late). The pipeline must compute hourly aggregations and emit results as soon as possible, but must also accurately update results when late data arrives. You want to minimize overall processing cost. Which Dataflow windowing and trigger configuration should you use?

A.Fixed windows of 1 hour with allowed lateness of 2 hours and trigger every 5 minutes (early) and on watermark (late) with accumulating fired panes
B.Global window with triggers every 5 minutes
C.Sliding windows of 1 hour with 30-minute offset
D.Session windows with 10-minute gap duration
AnswerA

Fixed windows match the hourly aggregation requirement. Allowed lateness of 2 hours handles late data. Early triggers provide near-real-time results. Accumulating fired panes ensures updates are included.

Why this answer

Session windows are ideal for capturing bursts of user activity but not for fixed hourly aggregations. The best approach is to use fixed windows with allowed lateness of 2 hours and triggering early every N minutes (e.g., 5 minutes) and also on watermark advancement. This provides early results while allowing late data to update the window.

Using accumulating and discarding late panes (or just accumulating) depends on the use case; but here, accumulating fired panes is typical for correctness.

22
MCQeasy

Your company is building a real-time anomaly detection system for financial transactions. The system must process streams of transactions and flag anomalies within seconds. The volume is moderate (5000 transactions per second). You want a fully managed solution that integrates with BigQuery for historical analysis. Which service should you use for stream processing?

A.Cloud Dataflow
B.Cloud Pub/Sub with push subscriptions
C.Cloud Dataproc with Spark Streaming
D.Cloud Data Fusion
AnswerA

Dataflow is fully managed, handles streaming with sub-second latency, and integrates natively with Pub/Sub and BigQuery.

Why this answer

Cloud Dataflow is a fully managed service ideal for real-time stream processing with low latency. It can read from Pub/Sub, perform transformations (e.g., anomaly detection), and write to BigQuery for historical analysis. Dataproc requires cluster management; Data Fusion is batch-oriented; Pub/Sub alone does not process data.

23
MCQmedium

A team wants to use Cloud Pub/Sub Lite for a high-throughput, low-cost messaging system. They need exactly-once delivery to subscribers. What should they know about Pub/Sub Lite's delivery guarantees?

A.Pub/Sub Lite provides at-least-once delivery, same as standard Pub/Sub.
B.Pub/Sub Lite provides exactly-once delivery when using push subscriptions.
C.Pub/Sub Lite provides exactly-once delivery when using pull subscriptions.
D.Pub/Sub Lite supports exactly-once delivery by default.
AnswerA

Correct: Pub/Sub Lite guarantees at-least-once delivery.

Why this answer

Pub/Sub Lite offers at-least-once delivery like standard Pub/Sub; exactly-once is not guaranteed.

24
MCQmedium

A company uses Dataproc to run daily Spark ML jobs. The jobs run for 2 hours each day. The team wants to reduce costs without changing job characteristics. Which strategy is MOST cost-effective?

A.Use a single-node cluster to eliminate overhead
B.Enable high-availability mode to avoid restarts
C.Use preemptible instances for worker nodes
D.Increase the number of standard workers to finish faster
AnswerC

Preemptible instances are cheap and Spark handles preemptions via fault tolerance.

Why this answer

Preemptible VMs are up to 80% cheaper and can handle job interruptions as Spark is fault-tolerant. Single-node is for testing, not production. High-availability is for long-running clusters with HA requirements.

Standard nodes are more expensive.

25
MCQmedium

Your company uses Pub/Sub to ingest clickstream data. Messages must be processed in order for the same user_id. How should you configure the Pub/Sub subscription to guarantee ordering?

A.Use a pull subscription with enable_message_ordering=true
B.Use a pull subscription with exactly-once delivery enabled
C.Use a push subscription with acknowledgement deadline set to 600 seconds
D.Use a push subscription with a dead letter topic
AnswerA

Ordering keys with enable_message_ordering ensures messages with the same key are delivered in order.

Why this answer

Pub/Sub ordering keys allow messages with the same key to be delivered in order to subscribers. The subscription must be created with enable_message_ordering set to true.

26
MCQhard

A Dataflow pipeline using Apache Beam processes unbounded data from Pub/Sub. The pipeline uses fixed windows of 1 minute and a trigger that fires early every 30 seconds and at watermark. The team observes that the output pane for window [10:00:00, 10:01:00) contains events with timestamps from 10:00:15 and 10:00:45, but also an event with timestamp 10:02:00. What is the most likely cause?

A.The trigger is firing too early, causing the window to close prematurely
B.Allowed lateness is set to more than 1 minute, so late data is still included in its original window
C.The watermark is incorrectly estimated, allowing late data to be included
D.The window duration is actually 2 minutes due to a misconfiguration
AnswerB

When allowed lateness > 0, late data (with timestamp after window end but within allowed lateness) is still included in the correct window. The event with timestamp 10:02:00 is 1 minute late for window [10:00, 10:01), so allowed lateness must be at least 1 minute.

Why this answer

Late data can arrive after the watermark has passed, and with allowed lateness, it can be included in the original window. The event with timestamp 10:02:00 is late data that arrived after the watermark for window [10:00:00, 10:01:00), but within the allowed lateness period. It is not a trigger issue because the trigger fires correctly; the event is simply late.

27
MCQmedium

A company wants to use Pub/Sub Lite to reduce costs for a high-throughput, low-latency streaming pipeline. However, they have a requirement to retain messages for up to 7 days for reprocessing. Which Pub/Sub Lite configuration supports this retention?

A.Set the retention duration on the Pub/Sub Lite topic to 7 days
B.Set the retention duration on the Pub/Sub Lite subscription to 7 days
C.Enable exactly-once delivery on the Pub/Sub Lite topic to retain messages for 7 days
D.Use a Pub/Sub Lite reservation with 7-day retention
AnswerA

Pub/Sub Lite topics allow setting message retention duration up to 7 days. Messages are retained in the topic's storage and can be re-delivered to subscriptions within that period.

Why this answer

Pub/Sub Lite topics are the only entity where retention duration is configured; messages are retained in the topic's storage for the specified duration, allowing subscribers to replay messages within that window. Setting the retention duration to 7 days on the topic ensures messages are available for reprocessing for up to 7 days, meeting the requirement.

Exam trap

Cisco often tests the distinction between topic-level and subscription-level retention in Pub/Sub Lite, where candidates mistakenly assume subscriptions control retention (as in standard Pub/Sub) rather than the topic itself.

How to eliminate wrong answers

Option B is wrong because Pub/Sub Lite subscriptions do not have a configurable retention duration; retention is set at the topic level, not the subscription. Option C is wrong because exactly-once delivery is a delivery semantics feature that prevents duplicate processing but does not control message retention duration. Option D is wrong because a Pub/Sub Lite reservation is used to provision and manage capacity (throughput) across topics, not to set retention policies.

28
MCQmedium

A company wants to design a data pipeline for real-time fraud detection. The system must process streaming financial transactions, enrich them with user profiles from a lookup table, and flag suspicious activities within seconds. Which architecture pattern would be MOST suitable?

A.Pub/Sub combined with Cloud Functions for stateless processing
B.Kappa architecture using a single stream processing framework like Apache Beam
C.Batch processing with hourly micro-batches using Dataflow
D.Lambda architecture with a batch layer for historical analysis and a speed layer for real-time processing
AnswerB

Kappa processes everything as a stream, suitable for real-time fraud detection with enrichment from a side input.

Why this answer

Kappa architecture uses a single stream processing engine to handle both real-time and batch reprocessing, simplifying the pipeline. Lambda architecture requires maintaining separate batch and streaming layers, increasing complexity. The scenario only requires real-time processing with enrichment, so Kappa is more appropriate.

29
MCQmedium

A company needs to process streaming sensor data from millions of devices with sub-second latency, apply transformations, and write results to BigQuery for real-time dashboards. The data volume varies, and they want to avoid managing servers. Which service should they use?

A.Cloud Data Fusion
B.Dataflow
C.Dataproc
D.Dataprep
AnswerB

Dataflow is serverless, supports streaming, and integrates with BigQuery.

Why this answer

Dataflow is a fully managed, serverless stream and batch processing service that can handle high-throughput streaming with sub-second latency.

30
MCQeasy

You need to run a one-time data transformation job on a small CSV file (100 MB) using a visual, code-free interface. Which Google Cloud service is designed for this?

A.Dataflow
B.Cloud Data Fusion
C.Dataprep
D.Dataproc
AnswerC

Dataprep provides visual wrangling for data exploration and transformation.

Why this answer

Dataprep (Trifacta) is a visual data wrangling tool for exploring and transforming data without code. It's ideal for ad-hoc, small to medium datasets.

31
MCQeasy

Your team wants to share a BigQuery dataset with another project while ensuring that users from that project can only query specific tables. Which BigQuery feature should you use?

A.Create an authorised view in your dataset and share the view with the other project
B.Use a materialised view and share the underlying table
C.Grant the BigQuery Data Viewer role to the other project's service account
D.Export the table to Cloud Storage and share the bucket
AnswerA

Authorised views allow fine-grained access control by sharing only the view's results.

Why this answer

Authorised views allow you to share query results with users in other projects without giving them direct access to the underlying tables.

32
Multi-Selectmedium

A company is migrating their on-premises Hadoop workloads to Google Cloud. They want to use Dataproc for data processing and need to minimize costs for non-critical batch jobs that can tolerate interruptions. Which TWO configurations should they use?

Select 2 answers
A.Use preemptible instances for worker nodes
B.Enable high-availability mode
C.Use standard (non-preemptible) instances for all nodes
D.Use single-node clusters for small jobs
E.Use Dataproc on GKE
AnswersA, D

Preemptible VMs are cheaper and suitable for fault-tolerant batch jobs.

Why this answer

Preemptible instances are cheaper and can be preempted, suitable for fault-tolerant batch jobs. Single-node clusters are cost-effective for small jobs.

33
MCQmedium

A company is using Pub/Sub to ingest clickstream events. They need to ensure that events are delivered to a subscriber at least once, but duplicates can be tolerated. They also need to filter events by type before processing. Which subscription configuration should be used?

A.Pull subscription with exactly-once delivery enabled
B.Push subscription with no filter
C.Pull subscription with a filter on event type attribute
D.Push subscription with a dead letter topic
AnswerC

Pull subscriptions allow the subscriber to control message flow. Filtering on attributes ensures only matching messages are delivered. At-least-once is default.

Why this answer

Pub/Sub provides at-least-once delivery for both pull and push subscriptions. Filtering by attributes is supported at subscription level. Pull subscriptions are typically used when the subscriber controls the pace.

Push subscriptions are also possible, but the question does not specify delivery method preference. The key is to enable message filtering on the subscription.

34
MCQeasy

A company has a BigQuery dataset containing sensitive customer data. They want to share a subset of this data with external partners, ensuring that partners can only see specific columns and rows. Which BigQuery feature should they use?

A.Materialized views
B.Authorized views
C.Clustered tables
D.Dataset-level access controls
AnswerB

Authorized views allow you to grant access to a view that selects specific columns and rows, without giving direct access to the base table.

Why this answer

Authorized views allow you to share a query (view) that filters columns and rows, while granting access to the view only, not the underlying tables.

35
MCQhard

An organization is implementing a data lake on Google Cloud using Cloud Storage. They need to process both batch and streaming data with a unified pipeline. The team has experience with Apache Beam. Which architecture should they use to minimize operational overhead?

A.Kappa architecture with Cloud Dataflow using the same pipeline for batch and streaming
B.Use Cloud Dataproc for batch and Cloud Dataflow for streaming
C.Lambda architecture with Cloud Dataflow for batch and Cloud Pub/Sub for streaming
D.Use Cloud Data Fusion for both batch and streaming
AnswerA

Kappa architecture uses a single streaming pipeline; Dataflow can handle both by replaying data.

Why this answer

Kappa architecture uses a single streaming pipeline for both batch and streaming, simplifying operations. Dataflow implements Beam and supports both modes.

36
MCQmedium

A data engineer needs to create a BigQuery table that is partitioned by ingestion time and clustered by customer_id and transaction_date. They also want to limit access so that only users from a specific domain can query the table. Which approach should they use?

A.Create the table with partitioning only, then use a materialized view to restrict access
B.Create the table without clustering, use row-level security to filter by domain, and grant access to the table
C.Create the table with partitioning and clustering, then create an authorized view on the table and grant the view access to the domain users
D.Create the table with partitioning and clustering, then grant bigquery.dataViewer to the domain via IAM at the dataset level
AnswerC

Authorized views allow controlled access without granting direct table access.

Why this answer

Authorized views allow sharing query results with specific users/groups without giving direct table access. Clustering and partitioning are defined at table creation. IAM roles at dataset level are too broad.

Row-level security filters rows but doesn't restrict domain.

37
MCQmedium

A company needs to process high-throughput streaming data with low latency. They are considering Cloud Pub/Sub for ingestion and Cloud Dataflow for processing. However, they are concerned about cost. Which alternative to Cloud Pub/Sub would reduce costs while still meeting the throughput requirements?

A.Cloud Pub/Sub with pull subscriptions
B.Cloud Tasks
C.Cloud Pub/Sub Lite
D.Cloud Pub/Sub with push subscriptions
AnswerC

Pub/Sub Lite offers lower cost for high-volume streaming with regional availability.

Why this answer

Pub/Sub Lite is a cost-effective alternative for high-throughput streaming when you don't need global availability or some advanced features of Pub/Sub.

38
MCQhard

A data pipeline uses Cloud Data Fusion to perform ETL jobs. The pipeline reads from BigQuery, transforms data using Wrangler, and writes to Cloud Storage. The team notices that the pipeline runs slower than expected. They suspect the Data Fusion instance is under-provisioned. Which action should be taken to improve performance?

A.Add more Dataproc Metastore instances
B.Change the Data Fusion instance type from Basic to Enterprise
C.Enable Data Fusion accelerator for BigQuery
D.Rewrite the pipeline using Cloud Dataprep instead
AnswerB

Enterprise edition provides a larger default Dataproc cluster and more powerful execution environment, improving performance for heavy ETL workloads.

Why this answer

Cloud Data Fusion uses Dataproc clusters for execution. The instance type (basic, standard, enterprise) determines the Dataproc cluster configuration. Upgrading to a higher edition or increasing the number of worker nodes directly improves throughput.

Wrangler transforms are executed on the Dataproc cluster, so more workers help.

39
MCQmedium

You are designing a batch data pipeline that runs daily to ingest data from an on-premises database into BigQuery. The ingestion volume is approximately 50 GB per day. The data must be available in BigQuery by 6 AM each day. The on-premises database supports change data capture (CDC) via logs. Which approach minimizes operational cost and complexity?

A.Use Cloud Dataproc with Spark Streaming to ingest CDC logs
B.Use Pub/Sub with a Dataflow streaming pipeline
C.Use Cloud Data Fusion with a batch pipeline
D.Use Cloud Dataflow with a JDBC source in batch mode to read CDC logs and write to BigQuery
AnswerD

Dataflow can read from JDBC in batch mode, handle CDC, and write to BigQuery. It is fully managed and cost-effective for this volume.

Why this answer

Using Dataflow with a JDBC source to read CDC logs in batch mode is straightforward and cost-effective for daily 50 GB loads. Dataproc could also work but requires cluster management. Pub/Sub with Dataflow would be more complex and costly for a daily batch.

Data Fusion adds a visual layer but is overkill for this simple batch ingestion.

40
Multi-Selecthard

A company uses Cloud Data Fusion for ETL pipelines. They need to transform sensitive data (PII) by masking certain columns before writing to BigQuery. They also need to ensure the pipeline can be monitored and restarted from failure points. Which THREE features should they use?

Select 3 answers
A.Use Cloud Composer to schedule and retry the pipeline
B.Create a Dataproc Metastore service to store pipeline metadata
C.Enable pipeline monitoring with alerts in Cloud Data Fusion
D.Configure pipeline checkpointing to allow restart from failure
E.Use Wrangler transformations to apply masking directives
AnswersC, D, E

Cloud Data Fusion provides monitoring dashboards and alerting for pipeline status, including failures.

Why this answer

Data Fusion Wrangler provides a step-by-step recipe for transformations, including masking. Pipeline monitoring and restart from failure are supported by the orchestration framework and checkpointing. Dataproc Metastore is for Hive metadata, not relevant.

Data Fusion Studio is the UI, not a feature for monitoring. Cloud Composer is for workflow orchestration, not needed if Data Fusion handles it.

41
Multi-Selectmedium

A data engineering team is designing a streaming pipeline using Dataflow to process real-time clickstream data from a website. They need to aggregate user session metrics (e.g., number of sessions, average duration) every 5 minutes. The pipeline must handle late-arriving events (up to 2 minutes late) and ensure exactly-once processing semantics. Which TWO of the following should they configure? (Choose two.)

Select 3 answers
A.Session windows with a 10-minute gap and allowed lateness of 2 minutes.
B.Fixed windows of 5 minutes with allowed lateness of 2 minutes.
C.Sliding windows of 5 minutes with a 1-minute period and allowed lateness of 2 minutes.
D.Dataflow pipeline with exactly-once processing mode.
E.Pub/Sub subscription with exactly-once delivery.
AnswersB, D, E

Fixed windows of 5 minutes match the requirement. Allowed lateness of 2 minutes ensures late events are included, while exactly-once handles duplicates.

Why this answer

To achieve exactly-once semantics in Dataflow, the pipeline must use exactly-once mode (the default) and the Pub/Sub subscription should be configured with exactly-once delivery. Fixed windows of 5 minutes are appropriate for the aggregation period. Sliding windows would cause overlapping windows, which is not needed.

Session windows are for grouping related events. Lateness with accumulation can cause duplicates, so it should not be used.

42
MCQmedium

You are designing a BigQuery data warehouse for a retail company. Queries frequently filter on order_date and customer_id. To optimize query performance and cost, which table design should you use?

A.Cluster by order_date and partition by customer_id
B.Partition by ingestion_time and cluster by order_date
C.Use a clustered table without partitioning
D.Partition by order_date and cluster by customer_id
AnswerD

This combination reduces scanned data and improves performance for filters on both columns.

Why this answer

Partitioning on order_date limits scans to relevant date ranges. Clustering on customer_id further organizes data within partitions, improving filter and aggregation queries on customer_id.

43
Multi-Selectmedium

A data engineer is designing a streaming pipeline using Dataflow with Apache Beam. The pipeline reads from Pub/Sub, performs a stateful transformation (e.g., session windowing), and writes to BigQuery. The pipeline must handle late data and ensure exactly-once semantics. Which THREE configurations are required?

Select 3 answers
A.Use the File Loads write method for BigQuery
B.Set allowed lateness on the window to accommodate late data
C.Configure an appropriate trigger to control output frequency
D.Use a custom watermark estimation function for Pub/Sub source
E.Enable exactly-once processing on the Dataflow pipeline
AnswersB, C, E

Allowed lateness specifies how long the window should wait for late data, ensuring completeness.

Why this answer

Exactly-once sink (BigQuery) ensures no duplicates. Idempotent writes are built into Dataflow's BigQuery sink when using exactly-once mode. Setting allowed lateness handles late data.

Watermark estimation is automatic; custom is not required. Triggers are optional; default trigger works. Windowing is inherent.

44
MCQhard

You are designing a BigQuery data warehouse for a multi-tenant SaaS application. Each tenant's data must be isolated and queried only by that tenant. You need to minimise management overhead and allow tenants to be added dynamically. Which approach should you use?

A.Use Cloud IAM conditions on the dataset to filter by tenant_id
B.Use a single dataset with authorized views that filter by tenant_id, granting each tenant access to their view
C.Create a separate dataset for each tenant and grant the tenant access to their dataset
D.Use a single table with a tenant_id column and enable column-level security to restrict access
AnswerB

Authorized views provide row-level security without duplicating data. Easy to add new tenants.

Why this answer

Authorized views in a shared dataset allow you to create row-level security. By creating a view per tenant that filters by tenant_id, you can grant each tenant access only to its view. This avoids managing multiple datasets and is scalable.

45
MCQhard

You are building a real-time fraud detection system using Dataflow. Events from Pub/Sub need to be grouped by user_id within a 5-minute window to detect suspicious patterns. Some events may be delayed by up to 2 minutes. How should you configure the window and trigger to balance accuracy and latency?

A.Sliding window of 5 minutes with a 1-minute period and no allowed lateness
B.Session window with a gap duration of 5 minutes
C.Fixed window of 5 minutes with no allowed lateness and default trigger
D.Fixed window of 5 minutes with allowed lateness of 2 minutes and early trigger every 1 minute
AnswerD

Early triggers provide low-latency results, and allowed lateness captures delayed events.

Why this answer

A fixed 5-minute window with allowed lateness of 2 minutes and a trigger that fires early every minute provides early results and captures late data within the allowed window.

46
MCQmedium

A data pipeline ingests streaming events into Pub/Sub and needs to join them with a slowly updating reference table (few thousand rows) from a Cloud Storage CSV file. The pipeline runs on Dataflow with Apache Beam. Which approach is most cost-effective and operationally simple?

A.Read the CSV in a DoFn and perform a BigQuery query each time an event is processed
B.Use a side input that reads the CSV once and broadcasts it to all workers
C.Implement a custom sink that writes events to Cloud SQL and performs a SQL JOIN there
D.Use CoGroupByKey to join the stream and batch PCollections by a common key after reading the CSV into a batch PCollection each window
AnswerB

Side inputs are designed for such use cases. The CSV is read as a bounded PCollection and used as a side input (e.g., as a Map), enabling efficient, low-latency joins without external calls.

Why this answer

Side inputs allow you to read the reference data once and distribute it to all workers, avoiding repeated lookups to an external service. For a small reference table, this is efficient and simple. Option A is complex and overkill; Option B adds latency; Option D is not a native Beam feature.

47
Multi-Selecthard

A data team is migrating an on-premises Hadoop cluster to Dataproc. The cluster runs a mix of long-running services (Hive, HBase) and transient Spark jobs. They want to minimize cost while maintaining performance. Which TWO strategies should they implement?

Select 2 answers
A.Consolidate all workloads into a single high-availability cluster
B.Use local SSDs for all nodes to improve I/O performance
C.Use preemptible instances for worker nodes in the transient Spark cluster
D.Use Dataproc on GKE to run long-running services
E.Separate long-running services into a dedicated cluster with standard instances
AnswersC, E

Preemptible workers reduce costs significantly for fault-tolerant batch jobs. Spark can handle node preemption via checkpointing.

Why this answer

Preemptible workers are cost-effective for fault-tolerant Spark jobs. Separating long-running services into a separate cluster avoids interference and allows independent scaling. Using a single-node cluster for services is not practical.

Dataproc on GKE adds complexity. Standard persistent disks are fine for HDFS.

48
MCQhard

A company has a BigQuery table that is partitioned by ingestion time and clustered by the 'customer_id' column. They notice that queries filtering on 'customer_id' are not benefiting from clustering as expected. What is the most likely cause?

A.The query is using a wildcard function that prevents clustering pruning
B.The clustering column must be the same as the partition column
C.The table is too small for clustering to be effective
D.Clustering does not work with ingestion-time partitioning
AnswerC

Clustering is most effective on large tables (>1 GB). On small tables, the benefits are minimal.

Why this answer

Clustering works best when the clustering column is used in a filter that limits data scanned. However, if the filter is on a non-clustering column, or if the clustering column has high cardinality with many distinct values, clustering may not help much. In this case, the issue could be that the filter on 'customer_id' is not selective enough, or the table is too small.

49
Multi-Selectmedium

A company wants to build a real-time dashboard for monitoring application logs. The logs are ingested via Pub/Sub and must be processed with low latency (sub-second). You need to enrich the logs with user metadata from Cloud SQL and store the results in BigQuery for analysis. Which TWO services should be used for the stream processing? (Choose two.)

Select 2 answers
A.Dataproc
B.Cloud Data Fusion
C.Dataflow
D.Cloud Functions
E.Pub/Sub
AnswersC, E

Dataflow handles stream processing with sub-second latency and side inputs.

Why this answer

Dataflow can read from Pub/Sub, enrich with side inputs from Cloud SQL, and write to BigQuery. Pub/Sub is the ingestion point. The pipeline uses Dataflow for stream processing.

50
MCQeasy

A startup needs a fully managed, serverless Spark service to run occasional data processing jobs without managing clusters. They want to pay only for the resources used during job execution. Which Google Cloud service should they use?

A.Dataproc Serverless
B.Dataflow
C.Cloud Data Fusion
D.Dataproc
AnswerA

Dataproc Serverless automatically manages resources for Spark jobs and charges per job.

Why this answer

Dataproc Serverless provides a serverless Spark environment where you pay per job execution. Cloud Data Fusion is for visual ETL. Dataproc is managed but not serverless.

Dataflow is serverless for Beam, not Spark.

51
MCQhard

A Dataproc cluster uses preemptible worker nodes to reduce costs. The cluster runs a long-running Spark job that occasionally experiences worker failures. How should the job be configured to handle preemptible worker failures gracefully?

A.Set spark.task.maxFailures to a high number to allow retries.
B.Disable preemptible workers for the job.
C.Use persistent disks for preemptible workers.
D.Enable automatic restart of the Spark driver on failure.
AnswerA

Increasing maxFailures allows tasks to be retried on remaining workers.

Why this answer

Spark jobs should use checkpointing and handle task retries to survive preemption.

52
Multi-Selectmedium

You are designing a Dataflow pipeline for processing real-time clickstream data. The pipeline must group events into 30-second windows and handle late data up to 5 minutes. You want to output partial results every 10 seconds for low-latency monitoring. Which TWO configurations should you use? (Choose two.)

Select 3 answers
A.Use sliding windows of 30 seconds with a 10-second period
B.Use a trigger that fires after the end of the window
C.Use fixed windows of 30 seconds
D.Set allowed lateness to 5 minutes
E.Use a trigger with early firings every 10 seconds
AnswersC, D, E

Fixed windows give non-overlapping 30-second intervals.

Why this answer

Fixed windows of 30 seconds with allowed lateness 5 minutes. Early firings every 10 seconds produce partial results.

53
MCQeasy

A data engineer wants to create a BigQuery table that is partitioned by day and clustered by user_id and product_id. Which SQL statement should they use?

A.CREATE TABLE mydataset.table (user_id INT64, product_id INT64, event_date DATE) PARTITION BY DATE(event_date) CLUSTER BY user_id, product_id;
B.CREATE TABLE mydataset.table (user_id INT64, product_id INT64, event_date DATE) PARTITION BY event_date CLUSTER BY user_id, product_id;
C.CREATE TABLE mydataset.table (user_id INT64, product_id INT64, event_date DATE) PARTITION BY event_date CLUSTER BY user_id, product_id;
D.CREATE TABLE mydataset.table (user_id INT64, product_id INT64, event_date DATE) CLUSTER BY user_id, product_id PARTITION BY event_date;
AnswerB, C

This is correct; the second correct answer noted for variety.

Why this answer

The correct syntax for creating a partitioned and clustered table in BigQuery uses PARTITION BY and CLUSTER BY.

54
MCQeasy

A company needs a messaging service for event-driven applications that require low cost for high-throughput, but can tolerate occasional message loss. Which Pub/Sub product should they choose?

A.Pub/Sub with pull subscriptions
B.Pub/Sub with dead letter topics
C.Pub/Sub with push subscriptions
D.Pub/Sub Lite
AnswerD

Pub/Sub Lite offers lower cost with reduced durability guarantees, acceptable for tolerant workloads.

Why this answer

Pub/Sub Lite is designed for cost-sensitive workloads with relaxed durability. Standard Pub/Sub offers at-least-once delivery and high durability. Push vs pull is irrelevant to cost.

55
Multi-Selectmedium

A company wants to use Dataproc Metastore to manage metadata for their Spark jobs. Which TWO benefits does Dataproc Metastore provide?

Select 2 answers
A.Automatic scaling of compute resources
B.High availability with automatic failover
C.Fully managed Hive metastore service
D.Integration with BigQuery
E.Built-in data lineage tracking
AnswersB, C

Yes, it provides HA.

Why this answer

Dataproc Metastore offers a managed Hive metastore with high availability and compatibility.

56
MCQmedium

A data engineer needs to design a streaming pipeline that ingests events from multiple sources, enriches them with a lookup table stored in BigQuery (updated every hour), and writes the results to a BigQuery table for real-time dashboards. The pipeline must handle late-arriving data up to 1 hour. Which Dataflow feature should be configured to manage late data?

A.A custom watermark estimation function
B.Using side inputs with a periodic refresh
C.A trigger that fires on every late element
D.Allowed lateness on the window
AnswerD

Setting allowed lateness (e.g., 1 hour) tells the pipeline to keep the window open for late data within that duration. This is the correct way to handle late-arriving events.

Why this answer

Allowed lateness in Apache Beam specifies the maximum delay of late data to be included in the window. Watermark estimation does not control late data acceptance; triggers specify when output is emitted, not how late data is handled. Side inputs are for enrichment, not late data management.

57
MCQeasy

An engineer needs to create a Pub/Sub subscription that sends messages to an HTTPS endpoint. The endpoint must be able to acknowledge messages individually. Which type of subscription should they use?

A.Pull subscription
B.Push subscription
C.BigQuery subscription
D.Cloud Storage subscription
AnswerB

Push subscription sends messages to a webhook endpoint; ack is via HTTP 200.

Why this answer

Push subscriptions deliver messages to a configured HTTPS endpoint. The endpoint can acknowledge by returning a 200 status.

58
MCQmedium

A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?

A.Cloud Hive Metastore (self-managed)
B.Cloud Bigtable
C.Dataproc Metastore
D.Cloud Data Catalog
AnswerC

Dataproc Metastore is a fully managed, Hive-compatible metastore service.

Why this answer

Dataproc Metastore provides a Hive-compatible metastore that can be used across clusters and services.

59
MCQmedium

A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?

A.Use Cloud Composer to orchestrate Dataproc jobs that run the transformations
B.Use Cloud Functions to trigger a Dataflow job that does the transformations
C.Use Cloud Data Fusion to design the pipeline and schedule it to run on a Dataproc cluster
D.Use Cloud Dataprep to design the transformation and export to BigQuery
AnswerC

Cloud Data Fusion orchestrates the execution on Dataproc, which is the expected approach.

Why this answer

Cloud Data Fusion supports scheduling pipelines and runs them on Dataproc. This is the standard way to run batch pipelines.

60
MCQeasy

A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?

A.Combine.perKey
B.ParDo
C.GroupByKey
D.CoGroupByKey
AnswerA

Combine.perKey applies a combining function (e.g., average) per key.

Why this answer

Combine.perKey with an averaging function computes per-key average. GroupByKey groups by key but requires manual combination. ParDo is for element-wise processing.

CoGroupByKey joins multiple PCollections.

61
MCQeasy

Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?

A.Dataproc
B.Cloud Data Fusion
C.Dataprep
D.BigQuery
AnswerB

Cloud Data Fusion is a visual ETL tool with CDAP plugins and a Hub for marketplace transforms.

Why this answer

Cloud Data Fusion offers a visual, code-free ETL tool with a rich set of plugins from the Hub. Dataprep is for data wrangling, not full ETL.

62
MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?

A.Use exactly-once processing by setting the pipeline's streaming engine to exactly-once and using a BigQuery sink with exactly-once semantics
B.Use at-least-once processing and rely on BigQuery's automatic deduplication
C.Use exactly-once processing and write results to a staging table, then use a scheduled merge query to combine with the main table
D.Use at-most-once processing to guarantee no duplicates, and accept data loss
AnswerA

Dataflow's exactly-once sink for BigQuery ensures no duplicates even with late data, using a combination of idempotent writes and deduplication.

Why this answer

To prevent duplicate aggregations, you need exactly-once processing. Dataflow supports exactly-once sinks (like BigQuery) when using the FILE_LOADS method or streaming inserts with exactly-once semantics. Using at-least-once with deduplication in BigQuery is not reliable.

Exactly-once semantics require idempotent writes; the recommended approach is to use the BigQuery sink with exactly-once support (by setting the trigger frequency appropriately and using a dedup key). However, the simplest way is to use the Dataflow streaming engine and the BigQuery sink with exactly-once enabled. Among the options, using the BigQuery streaming inserts with exactly-once semantics (available in Dataflow) is correct.

63
MCQeasy

You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?

A.Single-node cluster
B.Standard cluster with preemptible workers for the primary worker nodes
C.Standard cluster with preemptible secondary workers
D.Standard cluster with all standard (non-preemptible) workers
AnswerC

Secondary workers can be preemptible, reducing cost. Primary workers handle coordination and must be standard.

Why this answer

Preemptible VMs are significantly cheaper than standard VMs and suitable for fault-tolerant batch jobs like nightly Spark processing. Standard mode is fine but using preemptible workers reduces cost.

64
MCQeasy

You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?

A.roles/bigquery.dataOwner
B.roles/bigquery.dataViewer
C.roles/bigquery.jobUser
D.roles/bigquery.dataEditor
AnswerB

DataViewer grants read-only access to data and metadata, ideal for analysts.

Why this answer

BigQuery Data Viewer grants read-only access to datasets and tables. roles/bigquery.dataViewer allows running queries and viewing metadata but not modifying or deleting data.

65
MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?

A.Use global windows with a trigger that fires every 10 minutes
B.Use sliding windows of 10 minutes with a 5-minute period and allowed lateness of 1 hour
C.Use fixed windows of 10 minutes with allowed lateness of 0 seconds
D.Use fixed windows of 10 minutes with allowed lateness of 1 hour and a trigger that fires after watermark plus early firings
AnswerD

This allows late data up to 1 hour and provides timely results.

Why this answer

To handle late data, you need to set the allowed lateness to 1 hour. The trigger with AfterWatermark with early firings ensures results are emitted on time and updated when late data arrives.

66
MCQmedium

A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?

A.Switch to Dataproc and write a Spark job.
B.Use the Data Fusion Hub to download a custom plugin.
C.Use Dataprep to create the transformation.
D.Write a custom plugin using the CDAP SDK and deploy it.
AnswerD

The CDAP SDK allows building custom plugins.

Why this answer

Cloud Data Fusion supports custom plugins using the CDAP SDK.

67
MCQeasy

A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?

A.Cloud Pub/Sub + Cloud Data Fusion
B.Cloud Pub/Sub + Cloud Dataproc
C.Cloud Pub/Sub + Cloud Dataflow
D.Cloud Pub/Sub + Cloud Dataprep
AnswerC

Cloud Pub/Sub for ingestion with at-least-once delivery, combined with Cloud Dataflow which provides exactly-once processing via its streaming engine.

Why this answer

Dataflow supports exactly-once processing via its streaming engine and checkpointing. Pub/Sub is the ingest service. Together they provide the required semantics.

68
Multi-Selecthard

Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)

Select 3 answers
A.Change the worker machine type to a higher CPU/memory configuration
B.Decrease the window duration to reduce data per window
C.Enable Dataflow Streaming Engine
D.Increase the number of workers in the pipeline
E.Add additional Pub/Sub subscriptions to the same topic
AnswersA, C, D

More CPU/memory per worker can speed up processing if the transform is compute-intensive.

Why this answer

Increasing the number of workers allows the pipeline to process more data in parallel. Using streaming engine can improve throughput and reduce latency by offloading state management. Adjusting the worker machine type to use more CPU/memory can help if the processing is compute-intensive.

Adding more subscriptions would not help because the pipeline reads from a single subscription. Changing the window size affects business logic but not necessarily performance. Combining these three optimizations addresses common bottlenecks.

69
MCQmedium

A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?

A.Kappa architecture with Pub/Sub and Dataflow for both real-time and batch processing
B.Lambda architecture with Pub/Sub for streaming and Cloud Storage for batch, processed by Dataflow
C.Batch processing only with Dataflow and Cloud Storage, ignoring real-time needs
D.Kappa architecture with Pub/Sub Lite and Dataflow Serverless
AnswerA

Kappa architecture uses a single streaming pipeline, and Dataflow can replay from Pub/Sub for batch reanalysis, minimizing management.

Why this answer

Kappa architecture processes all data as a stream, avoiding separate batch/speed layers. Pub/Sub ingests streaming data, and Dataflow with Apache Beam can handle both real-time and batch (replay) pipelines, minimizing infrastructure management.

70
MCQmedium

You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?

A.Partition by customer_id and cluster by transaction_date
B.Partition by ingestion_time and cluster by customer_id
C.Cluster by transaction_date and customer_id without partitioning
D.Partition by transaction_date and cluster by customer_id
AnswerD

This design minimizes scanned bytes by pruning partitions on date and cluster blocks on customer_id.

Why this answer

Partitioning by transaction_date allows queries to scan only relevant partitions. Clustering by customer_id sorts data within each partition by customer_id, further reducing the amount of data scanned for queries filtering on customer_id. This combination is best for time-range queries with frequent customer_id filters.

71
MCQmedium

You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?

A.Use Pub/Sub with a pull subscription and set the message retention duration to 7 days
B.Use Cloud Pub/Sub Lite with a smaller retention period
C.Use Pub/Sub with a push subscription and increase the acknowledgment deadline
D.Use Pub/Sub with exactly-once delivery and Dataflow with at-least-once processing
AnswerA

Pull subscriptions allow Dataflow to control the pace. 7-day retention lets Dataflow catch up after spikes.

Why this answer

Pub/Sub stores messages for up to 7 days, allowing Dataflow to catch up. Dataflow uses checkpointing to track progress. This combination ensures no data loss.

72
MCQhard

A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?

A.Older partitions are stored on slower storage tiers
B.The older partitions lack clustering metadata because clustering was enabled after data was ingested
C.The table has too many partitions, causing high metadata overhead
D.Queries are using different SQL syntax for older partitions
AnswerB

Clustering is applied only to data ingested after it is enabled. Older partitions remain unclustered, slowing queries.

Why this answer

Clustering can improve query performance by sorting data within partitions. If older partitions were created before clustering was enabled, they are not clustered, leading to slower scans. Re-clustering only applies to new data unless you manually rewrite older partitions.

73
MCQeasy

Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?

A.Cloud Dataproc with Apache Spark Streaming
B.Cloud Data Fusion with batch pipelines
C.Cloud Dataflow with Apache Beam
D.Cloud Pub/Sub Lite with push subscriptions
AnswerC

Dataflow is fully managed, supports streaming with exactly-once semantics, and integrates well with Pub/Sub and BigQuery for real-time dashboards.

Why this answer

Dataflow with Apache Beam is the only service among the options that natively supports exactly-once processing and can handle both streaming and batch pipelines with minimal operational overhead. Dataflow is fully managed, handles autoscaling, and provides exactly-once guarantees for streaming data.

74
MCQeasy

Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?

A.Dataflow
B.Dataproc Serverless
C.Dataprep
D.Cloud Data Fusion
AnswerB

Dataproc Serverless provides a fully managed, serverless Spark runtime. You submit jobs and Google Cloud manages the cluster.

Why this answer

Dataproc Serverless allows you to submit Spark jobs that run on auto-scaled infrastructure without cluster management. Dataflow is for Beam pipelines. Dataprep is for data wrangling.

Data Fusion is for visual ETL.

75
Multi-Selectmedium

A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?

Select 3 answers
A.Grant authorized view access to limit data visibility
B.Set a custom quota for concurrent queries
C.Partition and cluster tables to reduce bytes processed
D.Create materialized views for all reporting queries
E.Use BigQuery reservations (flex slots) for predictable workloads
AnswersB, C, E

Custom quotas limit the number of concurrent queries or bytes processed, preventing excessive resource usage.

Why this answer

Custom quotas cap query usage. Reservation models (flex slots) provide dedicated resources and predictable pricing. Partitioning and clustering reduce data scanned, lowering cost.

Authorized views control access but not cost. Materialized views can reduce cost but are not a cost control mechanism per se.

Page 1 of 2 · 110 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Pde Designing Data Systems questions.

CCNA Pde Designing Data Systems Questions — Page 1 of 2 | Courseiva