CCNA Pde Ingestion Processing Questions — Page 1 of 2

Multi-Selectmedium

A company needs to stream data from a MySQL database to BigQuery with a latency under 10 seconds. They also need to handle schema changes automatically. Which TWO services should they combine?

Select 2 answers

A.BigQuery

B.Datastream

C.Dataflow

D.Pub/Sub

E.Cloud SQL

AnswersA, B

Target for the streamed data.

Why this answer

Datastream captures CDC and can write to BigQuery directly. Pub/Sub is not needed if Datastream writes directly.

Practice this question →

MCQmedium

A data engineer needs to transfer 5 PB of historical data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 30 days. Which transfer method should they use?

A.gsutil rsync over the internet

B.BigQuery Data Transfer Service

C.Storage Transfer Service for on-premises

D.Transfer Appliance

AnswerD

Why this answer

Option D is correct because the Transfer Appliance is a physical device designed for large-scale data transfers when network bandwidth is insufficient. With 5 PB of data and a 1 Gbps link, the theoretical maximum transfer time is over 500 days (5 PB × 8 bits/byte / 1 Gbps / 86400 seconds/day), far exceeding the 30-day window. The Transfer Appliance bypasses network constraints by shipping data physically to Google Cloud.

Exam trap

The trap here is that candidates may overestimate network transfer speeds or assume that cloud-native services like Storage Transfer Service can handle any volume, ignoring the fundamental bandwidth math that makes physical shipping the only viable option for 5 PB within 30 days.

How to eliminate wrong answers

Option A is wrong because gsutil rsync over the internet at 1 Gbps would take approximately 500 days to transfer 5 PB, which exceeds the 30-day deadline; it also lacks reliability for such massive transfers over a public network. Option B is wrong because BigQuery Data Transfer Service is designed for scheduled imports from SaaS applications (e.g., Google Ads, Amazon S3) and does not support direct on-premises Hadoop transfers. Option C is wrong because Storage Transfer Service for on-premises requires a network connection (typically via a staging bucket or partner interconnect) and still relies on the same 1 Gbps bandwidth, making it impossible to meet the 30-day requirement.

Practice this question →

MCQmedium

You are building a Dataflow pipeline in Python that reads messages from Pub/Sub, enriches them with data from a BigQuery table, and writes the results to BigQuery. The enrichment lookup table is large and changes infrequently. Which approach minimizes cost and latency?

A.Use a CoGroupByKey transform to join the incoming stream with a stream from BigQuery.

B.Use BigQuery IO to query the table for every incoming message.

C.Use a side input that reads the BigQuery table periodically and caches it.

D.Use a stateful DoFn and store the lookup in state per key.

AnswerC

Side inputs are ideal for distributing a static lookup table to all workers. The data can be refreshed on a schedule.

Why this answer

Option C is correct because using a side input that periodically reads the BigQuery table and caches it avoids querying BigQuery for every incoming message, which would be prohibitively expensive and high-latency. The side input is refreshed at a configurable interval (e.g., every 10 minutes) via a pipeline option, and the cached data is broadcast to all workers, enabling fast, in-memory lookups without per-element I/O. This approach minimizes cost by reducing BigQuery API calls and minimizes latency by avoiding synchronous queries for each message.

Exam trap

Cisco often tests the misconception that querying a database per message is acceptable in streaming pipelines, but the trap here is that candidates overlook the cost and latency implications of per-element I/O, especially with BigQuery's pricing model and query latency.

How to eliminate wrong answers

Option A is wrong because CoGroupByKey requires both inputs to be bounded or both unbounded streams; here, the BigQuery table is a bounded dataset, and Pub/Sub is unbounded, so CoGroupByKey would not work without windowing and would introduce unnecessary complexity and latency. Option B is wrong because querying BigQuery for every incoming message would cause extremely high API costs (BigQuery charges per byte processed) and high latency (each query takes hundreds of milliseconds to seconds), making it impractical for a streaming pipeline. Option D is wrong because storing the lookup in state per key would require partitioning the lookup table across keys, which is inefficient for a large, infrequently changing table; state is per-key and not shared across keys, so each worker would need to load and maintain its own copy, leading to memory waste and complex state management.

Practice this question →

MCQmedium

You need to perform a one-time migration of historical data from an on-premises Teradata data warehouse to BigQuery. The data volume is 50 TB and you have a high-speed network connection (10 Gbps). What is the most efficient way to load the data?

A.Export data from Teradata to CSV files, upload to GCS using gsutil, then load into BigQuery.

B.Use Dataproc to run a Spark job that reads from Teradata and writes to BigQuery.

C.Use Transfer Appliance to ship the data offline.

D.Use BigQuery Data Transfer Service for Teradata

AnswerD

This service automates the transfer from Teradata to BigQuery, handling schema and data types.

Why this answer

BigQuery Data Transfer Service for Teradata is designed for this purpose; it can directly connect to Teradata and transfer data to BigQuery. Exporting to CSV then loading via gsutil is possible but less efficient. Transfer Appliance is for offline transfer but you have high-speed network.

Dataproc is not needed.

Practice this question →

MCQmedium

A company uses Workflows to orchestrate a series of Google Cloud services for data processing. They need to call an external HTTP API as part of the workflow and handle potential failures with retries. Which Workflows feature should they use?

A.Retry policy on the step

B.Subworkflows

C.Parallel steps

D.Conditional steps

AnswerA

Retry policy allows specifying retry conditions and limits for a step.

Why this answer

Option A is correct because Workflows provides a built-in retry policy that can be configured on individual steps to automatically retry an HTTP call upon transient failures (e.g., 5xx server errors or network timeouts). This allows the workflow to handle external API failures without custom code, using exponential backoff and a maximum retry count.

Exam trap

Cisco often tests the distinction between workflow orchestration features (retry, subworkflows, parallel, conditional) and candidates mistakenly choose parallel steps or subworkflows thinking they inherently provide fault tolerance, but only a retry policy directly addresses automatic retries on failure.

How to eliminate wrong answers

Option B is wrong because subworkflows are used to encapsulate reusable sequences of steps, not to handle retries on a single HTTP call. Option C is wrong because parallel steps execute multiple branches concurrently, which does not provide retry logic for a single failing step. Option D is wrong because conditional steps (e.g., switch/if-else) control the flow based on conditions but do not automatically retry a failed HTTP request.

Practice this question →

MCQhard

Your team has a Dataflow pipeline that reads from BigQuery, transforms data, and writes to GCS. The pipeline is failing with 'Out of Memory' errors on the worker nodes. The input data is large but fits within the total cluster memory. Which configuration change is most likely to resolve the issue without increasing costs significantly?

A.Use a worker machine type with more memory, such as n2-highmem.

B.Shard the input into smaller reads using a BigQuery query.

C.Increase the disk size per worker.

D.Enable Dataflow Prime with vertical autoscaling.

AnswerA

High-memory machines provide more memory per core, addressing OOM.

Why this answer

The default Dataflow worker machine type may have insufficient memory per core for the pipeline's operations. Using a high-memory machine type (e.g., n2-highmem) increases memory per worker without necessarily increasing the number of workers, thus controlling costs.

Practice this question →

Multi-Selecteasy

A company uses Cloud Datastream to replicate data from a MySQL database to BigQuery in near real-time. Which TWO BigQuery features are automatically used by Datastream for optimal performance and consistency? (Choose TWO.)

Select 2 answers

A.BigQuery Data Transfer Service

B.BigQuery legacy streaming inserts

C.A Dataflow pipeline to transform the data

D.BigQuery Storage Write API

E.A materialized view that merges the change stream into the final table

AnswersD, E

Datastream uses the Storage Write API for streaming replication.

Why this answer

Datastream uses the BigQuery Storage Write API (option D) to stream change data capture (CDC) events into BigQuery with exactly-once semantics and high throughput. It also automatically creates a materialized view (option E) that merges the change stream into the final table, ensuring consistent, near real-time replication without manual merge logic.

Exam trap

Cisco often tests the misconception that Datastream requires an intermediate Dataflow pipeline or legacy streaming inserts, when in fact it natively leverages the Storage Write API and materialized views for optimal performance and consistency.

Practice this question →

MCQhard

A team uses dbt on BigQuery to transform data in their data warehouse. They have a large table with nested and repeated fields (arrays and structs). The transformation needs to normalize this data into a star schema. Which dbt feature and BigQuery SQL feature should they use together?

A.dbt hooks with BigQuery STRUCT access

B.dbt models with BigQuery UNNEST and CROSS JOIN

C.dbt snapshots with BigQuery JSON functions

D.dbt seeds with BigQuery ARRAY_AGG

AnswerB

UNNEST flattens arrays into rows, and CROSS JOIN with UNNEST is the standard way to normalize nested data in BigQuery.

Why this answer

To normalize nested and repeated fields (arrays and structs) into a star schema, you need to flatten the arrays into separate rows. BigQuery's UNNEST operator, when used with CROSS JOIN, expands each array element into its own row, effectively denormalizing the nested structure. dbt models (SQL SELECT statements) are the correct dbt feature to define these transformations as version-controlled, reusable SQL files. Together, they allow you to write a dbt model that uses CROSS JOIN UNNEST to produce dimension and fact tables from a single nested table.

Exam trap

Cisco often tests the distinction between features that manipulate data structure (UNNEST) versus features for data lifecycle (snapshots, hooks) or data loading (seeds), leading candidates to confuse the purpose of dbt hooks or snapshots with transformation logic.

How to eliminate wrong answers

Option A is wrong because dbt hooks are SQL or shell commands executed at specific points in the dbt run (e.g., before/after model builds) and are not designed for transforming nested data into a star schema; BigQuery STRUCT access alone cannot flatten arrays. Option C is wrong because dbt snapshots are used for slowly changing dimension (SCD) tracking over time, not for normalizing nested data; BigQuery JSON functions are for parsing JSON strings, not for unnesting native arrays and structs. Option D is wrong because dbt seeds are CSV files loaded into the warehouse as static lookup tables, not for transforming existing data; ARRAY_AGG is an aggregation function that creates arrays, the opposite of the flattening needed here.

Practice this question →

MCQeasy

Which Dataflow feature automatically scales the number of workers based on the pipeline's current workload, and also selects the optimal machine type for each worker based on the pipeline's resource requirements?

A.Dataflow Shuffle

B.Dataflow Prime

C.Dataflow Streaming Engine

D.Dataflow Flex Templates

AnswerB

Dataflow Prime offers vertical autoscaling and right-fitting of worker machine types.

Why this answer

Dataflow Prime is the correct answer because it is the only Dataflow feature that provides both automatic worker scaling (horizontal autoscaling) and intelligent machine type selection (vertical autoscaling). It dynamically adjusts the number of workers based on the pipeline's current workload and selects the optimal machine type (e.g., CPU, memory, or accelerator-optimized) for each worker based on the pipeline's resource requirements, such as CPU utilization, memory pressure, or shuffle throughput.

Exam trap

Cisco often tests the distinction between horizontal autoscaling (adding/removing workers) and vertical autoscaling (changing machine type), and the trap here is that candidates assume Dataflow Shuffle or Streaming Engine handle scaling, when in fact they only optimize specific pipeline phases (shuffle or state management) without affecting worker count or machine type.

How to eliminate wrong answers

Option A is wrong because Dataflow Shuffle is a service that separates the shuffle operation from worker VMs, improving scalability and reliability, but it does not handle worker scaling or machine type selection. Option C is wrong because Dataflow Streaming Engine moves state storage and computation away from worker VMs for streaming pipelines, reducing resource overhead, but it does not automatically scale workers or select machine types. Option D is wrong because Dataflow Flex Templates allow you to package and reuse pipeline code with custom container images, but they do not provide any autoscaling or machine type optimization; scaling is handled separately by the Dataflow service.

Practice this question →

MCQhard

Your team is processing a large dataset with Apache Beam on Dataflow. The pipeline sometimes fails due to transient errors when writing to a BigQuery sink. You need to ensure that failed records are not lost and can be reprocessed later without blocking the pipeline. What is the best approach?

A.Configure the pipeline to use at-least-once semantics and rely on Dataflow to retry the entire bundle.

B.Increase the number of workers to reduce the chance of transient errors.

C.Use a try-catch block in the DoFn and log the error; continue processing other elements.

D.Use a side output (e.g., via TupleTag) to write failed records to a dead letter sink (e.g., GCS or Pub/Sub) and continue processing the main output.

AnswerD

This pattern isolates bad records, allows the pipeline to continue, and stores the failed records for later reprocessing.

Why this answer

Using a dead letter pattern with a side output to write failed records to a GCS bucket (or Pub/Sub) allows the pipeline to continue processing healthy records while failed records are stored for later analysis and reprocessing.

Practice this question →

Multi-Selecteasy

A data engineer needs to load data from CSV files in Cloud Storage into BigQuery. The CSV files have a header row and some columns contain nested JSON strings. Which TWO methods can they use to load this data into BigQuery?

Select 2 answers

A.Use Datastream to load CSV files

B.Use the Storage Write API to write rows from a custom application

C.Create a federated query using an external table

D.Create a BigQuery load job with the CSV format

E.Use gsutil to copy files into BigQuery

AnswersB, D

The Storage Write API can be used to stream data from CSV after parsing.

Why this answer

Option B is correct because the Storage Write API allows a custom application to stream data row-by-row into BigQuery, which can handle CSV files with nested JSON strings by parsing them in the application code before writing. This method supports complex data transformations and is suitable for real-time or near-real-time ingestion.

Exam trap

Cisco often tests the distinction between loading data into BigQuery (permanent storage) versus querying external data sources (federated queries), causing candidates to mistakenly choose Option C as a valid loading method.

Practice this question →

MCQmedium

An organization needs to continuously replicate change data from a MySQL database to BigQuery with sub-minute latency. The database is running on-premises. Which Google Cloud service should they use?

A.Cloud Pub/Sub with a custom connector to MySQL

B.BigQuery Data Transfer Service for MySQL

C.Cloud Dataflow with a JDBC source

D.Cloud Datastream

AnswerD

Datastream is purpose-built for CDC from common databases to BigQuery or GCS.

Why this answer

Datastream is a serverless change data capture (CDC) service that can replicate from MySQL, PostgreSQL, and Oracle to BigQuery or GCS with low latency. Pub/Sub is a messaging service and does not natively connect to MySQL. Dataflow can process streams but requires a CDC connector.

BigQuery Data Transfer Service does not support live CDC from on-prem MySQL.

Practice this question →

MCQeasy

An organization wants to ingest on-premises Oracle database changes into BigQuery for real-time analytics with minimal latency. The Oracle database is version 19c and has a high transactional volume. Which Google Cloud service should they use?

A.Datastream

B.Pub/Sub

C.Dataflow

D.Storage Transfer Service

AnswerA

Datastream supports real-time CDC from Oracle to BigQuery and GCS.

Why this answer

Datastream is the correct service because it is purpose-built for real-time, serverless change data capture (CDC) from Oracle databases (including 19c) to BigQuery. It uses LogMiner to read redo logs with minimal overhead, supporting high transactional volumes and sub-second latency without requiring custom code or manual schema management.

Exam trap

The trap here is that candidates often confuse Pub/Sub or Dataflow as generic streaming solutions, overlooking that Datastream is the only Google Cloud service that natively supports Oracle CDC without requiring additional infrastructure or custom connectors.

How to eliminate wrong answers

Option B (Pub/Sub) is wrong because it is a messaging service for asynchronous event ingestion, not a CDC tool; it cannot directly read Oracle redo logs or handle schema evolution from a source database. Option C (Dataflow) is wrong because, while it can process streaming data, it requires a separate CDC connector (e.g., Debezium) and manual pipeline setup, adding complexity and latency compared to Datastream's managed Oracle-to-BigQuery integration. Option D (Storage Transfer Service) is wrong because it is designed for bulk file transfers (e.g., from on-premises NAS or S3) and cannot capture real-time database changes or connect to Oracle redo logs.

Practice this question →

MCQmedium

A data engineer is building a Dataflow pipeline in Python that reads from BigQuery, transforms data, and writes to Cloud Storage. The pipeline will be deployed in production. Which approach should they use to ensure the pipeline is reusable across environments with different configuration parameters?

A.Create a separate pipeline for each environment with hardcoded values

B.Use a Dataflow Classic Template

C.Use a Dataflow Flex Template

D.Run the pipeline using the DirectRunner for each environment

AnswerC

Flex Templates support runtime parameters and are best for production deployment across environments.

Why this answer

Option C is correct because Dataflow Flex Templates allow you to package a Docker container with your pipeline code and dependencies, enabling parameterization at runtime via the Dataflow UI, CLI, or API. This makes the pipeline reusable across environments (e.g., dev, staging, prod) by passing different configuration parameters (like project IDs, table names, or output paths) without modifying the code. Flex Templates support custom container images and are the recommended approach for production pipelines that need environment-agnostic deployment.

Exam trap

Cisco often tests the distinction between Classic Templates and Flex Templates, where candidates mistakenly choose Classic Templates because they are simpler, but Flex Templates are required for custom environments and parameterized production reuse.

How to eliminate wrong answers

Option A is wrong because creating separate pipelines with hardcoded values violates the principle of reusability and introduces maintenance overhead; any change requires updating multiple pipeline copies, increasing the risk of configuration drift. Option B is wrong because Classic Templates are limited to the Apache Beam SDK's built-in I/O transforms and do not support custom container images or complex dependencies, making them less flexible for production pipelines that may require custom code or third-party libraries. Option D is wrong because the DirectRunner is intended for local testing and development only; it runs the pipeline in a single JVM process and cannot handle the scalability, distributed execution, or environment-specific configuration needed for production deployment.

Practice this question →

Multi-Selecteasy

A data engineer needs to schedule a recurring transfer of data from a partner's Amazon S3 bucket to a Cloud Storage bucket for further processing. Which THREE components or configurations are necessary? (Choose 3)

Select 3 answers

A.A VPC network configuration

B.Specification of the source S3 bucket and destination GCS bucket

C.A scheduled transfer job in Storage Transfer Service

D.A Pub/Sub topic to notify completion

E.Authentication credentials for AWS (e.g., access key and secret)

AnswersB, C, E

Source and destination are required.

Why this answer

Scheduling, source and destination locations, and authentication/authorization are essential for a Storage Transfer Service job.

Practice this question →

MCQeasy

You are loading 10 GB of daily CSV files from a GCS bucket into a BigQuery table. The files contain some malformed rows that you want to skip. Which BigQuery load configuration should you use?

A.Use the 'skip_leading_rows' option.

B.Use the 'ignore_unknown_values' option.

C.Use the 'max_bad_records' option set to a value like 10.

D.Use the 'allow_jagged_rows' option.

AnswerC

max_bad_records specifies the number of allowed bad records; if the number of bad records exceeds this, the load fails.

Why this answer

BigQuery allows setting max_bad_records in load jobs; records exceeding this threshold cause the job to fail. Setting max_bad_records to a value greater than 0 allows the load to succeed while skipping malformed rows.

Practice this question →

MCQmedium

A data pipeline processes JSON files from Cloud Storage, transforms them using Apache Beam, and writes the output to BigQuery. Some records are malformed and cause the pipeline to fail. How should the engineer handle these errors to ensure the pipeline continues processing while preserving the malformed records for analysis?

A.Set the pipeline to retry malformed records indefinitely until they succeed.

B.Use a side input to send malformed records to a dead letter queue in Pub/Sub for later reprocessing.

C.Log the malformed records to Stackdriver and skip them in the pipeline.

D.Catch exceptions in a DoFn and write the malformed records to a separate Cloud Storage bucket using a FileIO sink.

AnswerD

This follows the dead letter pattern: malformed records are written to a separate sink, allowing the pipeline to continue and enabling later analysis.

Why this answer

Option D is correct because it allows the pipeline to continue processing by catching exceptions within a DoFn and writing malformed records to a separate Cloud Storage bucket using a FileIO sink. This preserves the malformed records for later analysis without blocking the main data flow, which is a standard pattern in Apache Beam for handling dead-letter records. The approach ensures fault tolerance while maintaining data integrity for debugging.

Exam trap

Cisco often tests the misconception that error handling in data pipelines should either retry indefinitely or use logging as a storage mechanism, but the correct approach is to isolate and persist malformed records to a durable sink like Cloud Storage for later analysis.

How to eliminate wrong answers

Option A is wrong because retrying malformed records indefinitely would cause the pipeline to hang or exhaust resources, as malformed records will never succeed due to inherent data issues. Option B is wrong because using a side input to send malformed records to a Pub/Sub dead letter queue is not a direct pattern in Apache Beam; side inputs are for broadcasting data to all elements, not for error handling, and Pub/Sub would require additional setup and does not inherently preserve the records for analysis without a separate sink. Option C is wrong because logging malformed records to Stackdriver and skipping them loses the data permanently, as logs are not designed for structured storage or reprocessing of the original records.

Practice this question →

MCQhard

You are migrating an on-premises Kafka cluster to Google Cloud. The cluster has 50 topics with a total throughput of 200 MB/s. You want to minimize operational overhead. Which approach is the most cost-effective?

A.Use Pub/Sub with Kafka-compatible client libraries

B.Use Dataproc to run Kafka on a managed cluster

C.Deploy Kafka on Compute Engine instances with a managed instance group

D.Use Cloud NAT to route traffic to an on-premises Kafka cluster

AnswerB

Why this answer

Managed Kafka services are not available natively on GCP; the recommended approach is to run Kafka on Dataproc, which provides a managed Hadoop/Spark environment with autoscaling.

Practice this question →

Multi-Selectmedium

A data engineer needs to schedule a nightly transfer of data from an Amazon S3 bucket to Cloud Storage. Which two steps are required to achieve this? (Choose TWO.)

Select 2 answers

A.Grant appropriate permissions to the transfer service account

B.Configure a VPC peering between AWS and GCP

C.Use gsutil rsync in a cron job

D.Create a Storage Transfer Service job with a schedule

E.Set up a Cloud Function to copy files

AnswersA, D

Why this answer

Option A is correct because the Storage Transfer Service uses a Google-managed service account to access the source S3 bucket. You must grant this service account the appropriate IAM permissions (e.g., `s3:GetObject` and `s3:ListBucket`) on the S3 bucket via an AWS IAM policy. Without these permissions, the transfer job cannot read the data from S3.

Exam trap

Cisco often tests the misconception that you can use a simple command-line tool like `gsutil rsync` in a cron job for production-scale scheduled transfers, but the correct approach is to use the managed Storage Transfer Service which handles permissions, scheduling, and reliability natively.

Practice this question →

Multi-Selectmedium

You are building a BigQuery table that contains nested and repeated fields (e.g., order with line items). You need to write a query that counts the number of line items per order. Which TWO SQL functions/techniques can you use?

Select 2 answers

A.Window function ROW_NUMBER

B.UNNEST with COUNT

C.STRUCT with aggregation

D.ARRAY_LENGTH

E.SELECT * EXCEPT

AnswersB, D

UNNEST expands the array, then COUNT(*) gives the number of line items per order.

Why this answer

Option B is correct because UNNEST flattens the repeated line items array into individual rows, allowing COUNT to aggregate the number of line items per order. Option D is correct because ARRAY_LENGTH directly returns the number of elements in the repeated field array, which corresponds to the line item count.

Exam trap

Cisco often tests the distinction between functions that operate on arrays directly (like ARRAY_LENGTH) versus those that require row-level expansion (like UNNEST), and candidates may mistakenly choose window functions or STRUCT-based aggregation that do not directly count array elements.

Practice this question →

MCQmedium

A data engineer needs to migrate 200 TB of on-premises Oracle data to BigQuery. The network bandwidth is limited to 100 Mbps, and the data must be loaded within 2 weeks. Which Google Cloud service is most appropriate for the initial data transfer?

A.Transfer Appliance

B.BigQuery Data Transfer Service for Oracle

C.Datastream

D.Storage Transfer Service

AnswerA

Transfer Appliance is the right choice for petabyte-scale offline data transfer when network bandwidth is insufficient.

Why this answer

Transfer Appliance is correct because it is a physical device designed for large-scale offline data transfers when network bandwidth is insufficient. With 200 TB at 100 Mbps, the theoretical transfer time exceeds 190 days, far beyond the 2-week window. Transfer Appliance allows shipping the data directly to Google, bypassing network constraints entirely.

Exam trap

The trap here is that candidates may assume online services like Storage Transfer Service or BigQuery Data Transfer Service can handle large volumes if given enough time, ignoring the hard bandwidth calculation that proves 200 TB at 100 Mbps is impossible within 2 weeks.

How to eliminate wrong answers

Option B is wrong because BigQuery Data Transfer Service for Oracle is a scheduled, incremental transfer service that relies on network connectivity and cannot handle the initial bulk load of 200 TB within the bandwidth limit. Option C is wrong because Datastream is a real-time change data capture (CDC) service for streaming changes, not designed for initial bulk transfers of large datasets. Option D is wrong because Storage Transfer Service is an online transfer tool that moves data over the network, which would be bottlenecked by the 100 Mbps link and cannot complete 200 TB within 2 weeks.

Practice this question →

MCQeasy

A data engineer needs to transfer 500 TB of archival data from an on-premises NAS to Cloud Storage. The on-premises network has limited bandwidth (100 Mbps). Which transfer method should they recommend?

A.Storage Transfer Service for on-premises

B.gsutil rsync

C.Transfer Appliance

D.Dataflow pipeline reading from NAS

AnswerC

Transfer Appliance is the best choice for large offline data transfer when network bandwidth is limited.

Why this answer

Option C is correct because the Transfer Appliance is a physical device designed for large-scale data transfers (up to petabytes) when network bandwidth is insufficient. With 500 TB of data and only 100 Mbps bandwidth, the theoretical transfer time would be over 500 days, making any online transfer method impractical. The Transfer Appliance bypasses network constraints entirely by shipping the data physically to Google Cloud.

Exam trap

Cisco often tests the misconception that any cloud-native tool (like Storage Transfer Service or gsutil) can handle large data volumes regardless of bandwidth, ignoring the physical reality of network transfer times for archival-scale data.

How to eliminate wrong answers

Option A is wrong because Storage Transfer Service for on-premises requires network connectivity and is designed for smaller, incremental transfers, not for 500 TB over a 100 Mbps link. Option B is wrong because gsutil rsync is a command-line tool that relies on network bandwidth and would take an impractical amount of time (over 500 days) to transfer 500 TB at 100 Mbps. Option D is wrong because a Dataflow pipeline reading from NAS would still need to stream data over the limited 100 Mbps network, resulting in the same bandwidth bottleneck and excessive transfer time.

Practice this question →

MCQhard

You are migrating an on-premises PostgreSQL database to Cloud SQL. You need to continuously replicate changes to BigQuery for real-time analytics with minimal latency. Which service should you use?

A.Dataflow with JDBC source

B.Pub/Sub with a Cloud Function that writes to BigQuery

C.Storage Transfer Service

D.Datastream

AnswerD

Datastream is the managed CDC service that can stream changes from PostgreSQL to BigQuery with minimal latency.

Why this answer

Datastream is designed for change data capture (CDC) from databases like PostgreSQL, MySQL, and Oracle to BigQuery or GCS. It provides low-latency replication. Pub/Sub and Dataflow are not directly for CDC.

Storage Transfer Service is for file transfers, not database replication.

Practice this question →

MCQeasy

A data engineer needs to load 10 TB of CSV files from Amazon S3 into Google BigQuery on a daily basis. Which service should they use to automate this transfer?

A.Dataproc

B.Cloud Data Fusion

C.BigQuery Data Transfer Service

D.Storage Transfer Service

AnswerC

BigQuery Data Transfer Service supports scheduled transfers from Amazon S3 directly into BigQuery.

Why this answer

Storage Transfer Service can transfer data from Amazon S3 to Google Cloud Storage, but it does not load directly into BigQuery. BigQuery Data Transfer Service can import from Amazon S3 directly into BigQuery tables. Other options are not suitable: Cloud Data Fusion is for ETL pipelines, not simple transfer; Transfer Appliance is for offline petabyte-scale transfers; Dataproc is for Spark/Hadoop jobs.

Practice this question →

Multi-Selecthard

A company is designing a real-time analytics pipeline using Pub/Sub and Dataflow. They need to ensure exactly-once processing and handle late-arriving data. Which two configurations should they implement? (Choose TWO.)

Select 2 answers

A.Set up a global window with no triggers

B.Enable exactly-once delivery on Pub/Sub subscription

C.Use Dataflow's default at-least-once mode

D.Use a fixed window with allowed lateness and a trigger

E.Write all data to Cloud Storage and then batch load to BigQuery

AnswersB, D

Why this answer

To achieve exactly-once semantics in Dataflow, enable exactly-once mode for the Pub/Sub source. To handle late data, use a sliding window with allowed lateness and a trigger to emit early results.

Practice this question →

MCQmedium

A company wants to use dbt (data build tool) to transform data in BigQuery. They have a Cloud Storage bucket containing raw CSV files that are loaded daily into BigQuery via an external table. Which dbt feature should they use to modularize the transformation logic and handle dependencies between models?

A.dbt tests

B.dbt snapshots

C.dbt models with ref()

D.dbt seeds

AnswerC

Models define transformations and dependencies; ref() handles lineage and ordering.

Why this answer

C is correct because dbt models with the `ref()` function allow you to modularize SQL transformation logic and automatically handle dependencies between models. When you use `ref('model_name')`, dbt builds a dependency graph, ensuring models are executed in the correct order based on their references. This is essential for transforming raw data from an external table into a structured, analytics-ready dataset in BigQuery.

Exam trap

Cisco often tests the distinction between features that manage data transformation logic (models with ref()) versus features that handle data quality (tests), historical tracking (snapshots), or static data loading (seeds), leading candidates to confuse the purpose of each dbt component.

How to eliminate wrong answers

Option A is wrong because dbt tests are used for validating data quality (e.g., uniqueness, not null) and do not handle transformation logic or dependency management. Option B is wrong because dbt snapshots are designed to capture historical changes in slowly changing dimensions (Type 2 SCDs), not to modularize transformation logic or manage model dependencies. Option D is wrong because dbt seeds are used to load static CSV files directly into the warehouse as tables, not to transform data or manage dependencies between models.

Practice this question →

MCQmedium

A company wants to build an event-driven application that processes images uploaded to a Cloud Storage bucket. The processing takes up to 10 minutes per image and should be automatically triggered. Which compute option should they use?

A.Cloud Functions (2nd gen) with Eventarc trigger

B.App Engine

C.Cloud Functions (1st gen)

D.Cloud Run on Eventarc trigger

AnswerD

Cloud Run can handle long-running requests (up to 60 minutes) and is triggered by Eventarc for GCS events.

Why this answer

Cloud Functions have a 9-minute timeout; Cloud Run can handle up to 60 minutes and is triggered by Eventarc for GCS events.

Practice this question →

MCQhard

A data pipeline uses Pub/Sub to ingest events, a Dataflow streaming pipeline to process them, and writes results to BigQuery. The pipeline must handle occasional duplicate events without causing duplicate rows in BigQuery. What is the best approach?

A.Use BigQuery legacy streaming inserts with insertId for deduplication

B.Use the Storage Write API with the committed stream

C.Set a unique constraint on the BigQuery table

D.Enable exactly-once processing in Pub/Sub

AnswerA

Legacy streaming inserts use insertId to deduplicate within a short window.

Why this answer

BigQuery does not enforce primary keys; deduplication must be handled in the pipeline using idempotent writes or a dedup step.

Practice this question →

Multi-Selecthard

A company is migrating on-premises Apache Kafka workloads to Google Cloud. They want to minimize changes to existing producer and consumer applications while leveraging managed services. Which TWO services should they consider? (Choose 2)

Select 2 answers

A.BigQuery

B.Cloud Pub/Sub

C.Dataproc with Apache Kafka

D.Confluent Cloud on Google Cloud

E.Cloud Dataflow

AnswersC, D

Managed Kafka cluster on Dataproc; compatible with existing applications.

Why this answer

Kafka on Dataproc provides a managed Kafka cluster that is fully compatible, minimizing application changes. Confluent Cloud on Google Cloud can be used but is not a Google-managed service; however, it is a viable partner solution. Pub/Sub is not Kafka API-compatible.

Dataflow is not a replacement for Kafka. BigQuery is a data warehouse, not a streaming broker.

Practice this question →

MCQmedium

A company wants to move data from an on-premises MySQL database to BigQuery for analytics. They need to capture all changes (inserts, updates, deletes) in near real-time and also perform an initial historical load. Which approach meets these requirements with minimal operational overhead?

A.Use a Dataflow pipeline with a JDBC source to read the entire table periodically

B.Use Datastream to backfill historical data and then stream CDC changes to BigQuery

C.Use a one-time export to CSV and load into BigQuery, then set up a cron job to export incremental changes

D.Use Cloud SQL as an intermediary and enable binary logging, then stream to Pub/Sub via a custom connector

AnswerB

Datastream handles both backfill and CDC seamlessly.

Why this answer

Datastream can perform a backfill of historical data and then stream CDC changes from MySQL to BigQuery in near real-time, providing a single service for both tasks.

Practice this question →

MCQhard

A data engineer is designing a Dataflow pipeline in Python that reads from Pub/Sub, applies complex transformations using external libraries, and writes to BigQuery. The pipeline must be deployed as a reusable, version-controlled template that can be easily updated without re-uploading the pipeline code each time. Which approach should they use?

A.Use a Flex Template with a custom Docker image that contains the pipeline code and dependencies

B.Use a Classic Template and store the pipeline code in a Cloud Storage bucket

C.Use Cloud Composer to trigger the pipeline each time with updated parameters

D.Use Dataflow Prime and deploy the pipeline directly from the Apache Beam SDK

AnswerA

Flex Templates allow packaging the pipeline in a Docker image, enabling version control and easy updates via image tags.

Why this answer

A Flex Template allows you to package both the pipeline code and its dependencies (including external libraries) into a custom Docker image. This image is stored in Artifact Registry and can be version-controlled, enabling updates by simply rebuilding and pushing a new image tag without re-uploading the pipeline code each time. This meets the requirement for a reusable, version-controlled template that can be easily updated.

Exam trap

Cisco often tests the distinction between Classic Templates (which only support a limited set of built-in transforms and require re-uploading code) and Flex Templates (which support custom Docker images and version-controlled updates), leading candidates to mistakenly choose Classic Templates for custom dependency scenarios.

How to eliminate wrong answers

Option B is wrong because a Classic Template stores the pipeline code in a Cloud Storage bucket, but it does not support custom Docker images or external library dependencies; any change to the code requires re-uploading the pipeline specification file. Option C is wrong because Cloud Composer is an orchestration tool for scheduling and managing workflows, not a template mechanism; it still requires the pipeline code to be deployed separately and does not provide a reusable, version-controlled template. Option D is wrong because Dataflow Prime is a feature for optimizing resource utilization and autoscaling, not a template deployment method; deploying directly from the Apache Beam SDK does not create a reusable template and requires re-uploading code for each update.

Practice this question →

Multi-Selecthard

A company uses Dataflow to process data with Apache Beam in Python. The pipeline reads from Pub/Sub, applies a ParDo that calls an external API for enrichment, and writes to BigQuery. The external API has rate limits and occasionally fails. To improve reliability, which THREE strategies should be implemented? (Choose 3)

Select 3 answers

A.Switch from Python to Java SDK for better performance

B.Increase the number of Dataflow workers to reduce load per worker

C.Batch multiple requests to the external API using a side input

D.Implement retry logic with exponential backoff in the external API call

E.Use a dead letter pattern to write failed records to a separate sink

AnswersC, D, E

Batching reduces the number of API calls, helping with rate limits.

Why this answer

Retry logic with exponential backoff, a dead letter queue for failed records, and batching requests reduce API pressure and handle failures gracefully.

Practice this question →

Multi-Selectmedium

Which three of the following are valid BigQuery data loading methods? (Choose THREE.)

Select 3 answers

A.Data Transfer Service from Amazon S3

B.Using Cloud SQL to write to BigQuery

C.Direct file upload via Cloud Dataproc

D.Batch load from Cloud Storage

E.Streaming inserts using the legacy streaming API

AnswersA, D, E

Why this answer

The BigQuery Data Transfer Service supports automated ingestion from Amazon S3, allowing you to schedule and manage recurring transfers of data stored in S3 buckets directly into BigQuery. This is a fully managed service that handles the extraction, transformation, and loading (ETL) process without requiring custom code or infrastructure.

Exam trap

Cisco often tests the distinction between data processing services (like Dataproc) and actual data loading methods, leading candidates to confuse a processing step with a direct ingestion path.

Practice this question →

MCQhard

A company uses Eventarc to trigger a Cloud Run service when new objects appear in a GCS bucket. Recently, the Cloud Run service has been failing with 429 errors (too many requests) during high-velocity uploads. They need to handle the load without losing events. What should they do?

A.Use Cloud Functions instead of Cloud Run

B.Increase the maximum number of retries on the Eventarc trigger

C.Increase the Cloud Run service's request timeout

D.Configure the Eventarc trigger to send events to a Pub/Sub topic, and have the Cloud Run service pull from Pub/Sub

AnswerD

This decouples the event source from the consumer, allowing the Cloud Run service to process at its own pace and reducing 429 errors.

Why this answer

Option D is correct because sending events to a Pub/Sub topic decouples event production from consumption. Pub/Sub acts as a buffer that can absorb spikes in event volume, and the Cloud Run service can pull messages at its own pace, preventing 429 errors. This also ensures no events are lost, as Pub/Sub retains unacknowledged messages and retries delivery.

Exam trap

Cisco often tests the concept of decoupling event sources from consumers using a message queue or buffer, and the trap here is that candidates may think increasing retries or switching to Cloud Functions solves the load issue, when the real need is to absorb bursts via Pub/Sub.

How to eliminate wrong answers

Option A is wrong because Cloud Functions has similar concurrency limits and would also suffer from 429 errors under high load; it does not provide buffering. Option B is wrong because increasing retries on the Eventarc trigger only re-delivers failed events but does not address the root cause of the Cloud Run service being overwhelmed by too many concurrent requests. Option C is wrong because increasing the request timeout does not reduce the number of concurrent requests; it only allows longer processing time per request, which does not prevent the service from being overloaded.

Practice this question →

MCQmedium

A company runs a Dataflow pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. The pipeline uses classic templates and is deployed in batch mode. They notice that the pipeline does not scale well under high load, causing a backlog in Pub/Sub. Which improvement would BEST address the scaling issue?

A.Change the pipeline to use batch mode instead of streaming

B.Switch to Dataflow Prime to enable vertical autoscaling and right-fitting

C.Use a larger machine type for all workers

D.Increase the number of workers manually in the pipeline configuration

AnswerB

Dataflow Prime automatically adjusts resources for optimal performance under varying loads.

Why this answer

Dataflow Prime provides vertical autoscaling and right-fitting, automatically adjusting resources to handle variable loads, which is ideal for streaming pipelines with bursty traffic.

Practice this question →

MCQhard

You are designing a Dataflow pipeline that needs to exactly-once process events from Pub/Sub and write to BigQuery using the Storage Write API. The pipeline may restart and could reprocess some messages. What setting ensures exactly-once semantics for the output?

A.Use the legacy streaming inserts with insertId for deduplication

B.Use at-least-once delivery on Pub/Sub and idempotent writes to BigQuery

C.Use the Storage Write API in buffered mode with deduplication logic

D.Use the Storage Write API in committed mode and enable exactly-once semantic in Dataflow

AnswerD

Committed mode guarantees exactly-once writes, and Dataflow can coordinate with Pub/Sub to avoid duplicates.

Why this answer

The Storage Write API supports exactly-once semantics when used with the 'committed' mode, which ensures each record is written exactly once. The pipeline also needs to use Pub/Sub with message IDs and deduplication. The other options either do not provide exactly-once or are unreliable.

Practice this question →

Multi-Selectmedium

Which TWO statements are true about BigQuery Data Transfer Service? (Choose 2)

Select 2 answers

A.It supports data transformation during transfer.

B.It can transfer data from Cloud SQL to BigQuery directly.

C.It is only available in the US and EU regions.

D.It supports scheduled transfers from Amazon S3 and Redshift.

E.It can transfer data from Google Ads, YouTube, and Google Ad Manager into BigQuery.

AnswersD, E

Why this answer

Option D is correct because BigQuery Data Transfer Service supports scheduled, fully managed transfers from Amazon S3 and Amazon Redshift, enabling automated data ingestion into BigQuery for cross-cloud analytics. Option E is correct because the service natively integrates with Google Ads, YouTube, and Google Ad Manager to pull advertising and performance data directly into BigQuery on a recurring schedule.

Exam trap

Cisco often tests the misconception that BigQuery Data Transfer Service can perform ETL transformations during transfer, but it is strictly an EL (Extract and Load) service without built-in transformation capabilities.

Practice this question →

MCQmedium

A team wants to transfer data from an on-premises Hadoop cluster to Cloud Storage for processing. The cluster is located in a remote area with limited bandwidth. They need to transfer 500 TB of data. Which service should they use?

A.Transfer Appliance

B.BigQuery Data Transfer Service

C.Storage Transfer Service

D.Dataproc with gsutil

AnswerA

Offline physical appliance for large data transfers; ideal for remote areas with low bandwidth.

Why this answer

Transfer Appliance is designed for petabyte-scale offline transfers when bandwidth is limited.

Practice this question →

MCQmedium

A company wants to ingest data from an on-premises Oracle database into BigQuery in near real-time with minimal latency. The database has a high volume of inserts and updates. Which service should they use?

A.Datastream

B.BigQuery Data Transfer Service

C.Pub/Sub

D.Storage Transfer Service

AnswerA

Datastream streams change data from Oracle, MySQL, PostgreSQL to BigQuery or GCS in near real-time.

Why this answer

Datastream is designed for CDC from Oracle and other sources to BigQuery or GCS in near real-time.

Practice this question →

Multi-Selecthard

A data team needs to transfer 200 TB of data from Amazon S3 to GCS. The transfer must be incremental, and they need to monitor the transfer progress. Which THREE components should they use?

Select 3 answers

A.Cloud Monitoring

B.IAM service account

C.Dataflow

D.Transfer Appliance

E.Storage Transfer Service

AnswersA, B, E

Provides dashboards and alerts for transfer progress.

Why this answer

Storage Transfer Service (STS) can transfer from S3 to GCS with incremental sync. Cloud Monitoring tracks progress. Service account for permissions.

Practice this question →

MCQeasy

A data engineer needs to orchestrate a series of tasks that include calling external APIs, running BigQuery queries, and sending notifications. The workflow involves conditional branching and parallel steps. Which Google Cloud service should be used?

A.Workflows

B.Cloud Scheduler

C.Cloud Composer

D.Dataflow

AnswerA

Workflows is serverless, supports conditional branching and parallel steps, and integrates with many services.

Why this answer

Workflows is the correct choice because it is a fully managed orchestration service designed specifically for coordinating multi-step, event-driven workflows that involve conditional branching, parallel execution, and integration with external APIs, BigQuery, and notifications via HTTP calls or service integrations. It provides built-in error handling, retries, and a declarative YAML-based syntax that directly supports the described requirements without needing to manage infrastructure or schedule tasks.

Exam trap

The trap here is that candidates often confuse Cloud Scheduler as an orchestrator because it can trigger workflows, but it lacks the conditional branching and parallel execution capabilities required for this scenario, while Cloud Composer is mistakenly chosen due to its familiarity with Airflow, despite being heavier than necessary for a simple orchestration task.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler is a cron-based job scheduler that triggers tasks on a fixed schedule, not an orchestrator for complex workflows with conditional branching and parallel steps. Option C is wrong because Cloud Composer is a managed Apache Airflow service that can orchestrate workflows, but it is overkill for this use case, requires managing DAGs and infrastructure overhead, and is not the simplest or most cost-effective choice for a lightweight, event-driven workflow. Option D is wrong because Dataflow is a stream and batch data processing service based on Apache Beam, designed for transforming and analyzing data pipelines, not for orchestrating tasks like API calls, BigQuery queries, and notifications with conditional logic.

Practice this question →

MCQeasy

Which BigQuery feature allows you to query data directly from Cloud Storage without loading it into BigQuery storage?

A.BigQuery Omni

B.BigQuery ML

C.Federated queries

D.External tables

AnswerD

External tables in BigQuery reference data in GCS and can be queried directly.

Why this answer

BigQuery Omni is for multi-cloud, not external tables. External tables allow querying data in GCS.

Practice this question →

MCQmedium

A company wants to migrate their on-premises Teradata data warehouse to BigQuery. They need an automated, one-time transfer of historical data (10 TB) and ongoing incremental daily syncs. Which Google Cloud service should they use?

A.BigQuery Data Transfer Service for Teradata

B.Dataflow custom pipeline

C.Storage Transfer Service

D.Datastream

AnswerA

This service is designed to schedule and automate transfers from Teradata to BigQuery, both initial and incremental.

Why this answer

BigQuery Data Transfer Service supports Teradata as a source for both one-time and scheduled transfers. Storage Transfer Service is for file-based transfers. Datastream is for CDC, not Teradata.

Dataflow could be custom-built but Data Transfer Service is purpose-built for this scenario.

Practice this question →

MCQeasy

A company wants to stream real-time user click events from their web application into BigQuery for immediate analysis. Which combination of services is the most scalable and cost-effective for this use case?

A.Pub/Sub to Dataflow to BigQuery

B.Cloud Dataproc Spark Streaming to BigQuery

C.App Engine pushing logs to BigQuery

D.Cloud Functions writing directly to BigQuery

AnswerA

Why this answer

Pub/Sub is the recommended service for ingesting streaming events, and Dataflow can read from Pub/Sub and write to BigQuery using the Storage Write API for high-throughput, exactly-once semantics.

Practice this question →

MCQeasy

Which BigQuery feature allows you to write data with exactly-once semantics, high throughput, and the ability to buffer data before making it available for queries?

A.BigQuery load jobs

B.BigQuery Data Transfer Service

C.Legacy streaming inserts

D.Storage Write API with buffered mode

AnswerD

Why this answer

The Storage Write API with buffered mode (option D) is correct because it provides exactly-once semantics for data ingestion, high throughput via gRPC streaming, and the ability to buffer data in memory before making it available for queries. This mode allows you to commit rows in a stream, ensuring no duplicates, while the buffering stage gives you control over when data becomes visible in BigQuery.

Exam trap

Cisco often tests the misconception that legacy streaming inserts (option C) provide exactly-once semantics, but they actually offer at-least-once delivery, making the Storage Write API with buffered mode the only correct choice for exactly-once, high-throughput, buffered writes.

How to eliminate wrong answers

Option A is wrong because BigQuery load jobs offer at-least-once semantics (duplicates possible on retry) and do not support buffering before query availability; they write data directly to tables. Option B is wrong because BigQuery Data Transfer Service is a scheduled, managed service for importing data from external sources (e.g., Google Ads, Amazon S3) and does not provide high-throughput streaming or buffered write semantics. Option C is wrong because legacy streaming inserts (tabledata.insertAll) provide at-least-once semantics (duplicates can occur) and lack the buffered mode that defers query visibility; they also have lower throughput and no exactly-once guarantee.

Practice this question →

MCQmedium

Your company uses Kafka for event streaming. You want to run Kafka on Google Cloud with the ability to auto-scale clusters and use managed infrastructure. Which service should you choose?

A.Cloud Pub/Sub

B.Confluent Cloud on GCP

C.Cloud Dataflow

D.Dataproc

AnswerD

Dataproc supports running Kafka as an optional component on managed clusters, giving you control and scalability.

Why this answer

Dataproc is the correct choice because it is a managed Spark and Hadoop service on Google Cloud that supports running Kafka clusters. It allows auto-scaling of worker nodes and integrates with GCP storage and networking, providing the managed infrastructure required for Kafka event streaming.

Exam trap

The trap here is that candidates may confuse managed Kafka services (like Confluent Cloud) with GCP-native managed infrastructure, or assume Cloud Pub/Sub is equivalent to Kafka for event streaming, when Dataproc is the correct choice for running Kafka itself on GCP with auto-scaling.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub is a fully managed messaging service, not a Kafka-compatible platform; it does not run Kafka clusters or support auto-scaling of Kafka-specific infrastructure. Option B is wrong because Confluent Cloud on GCP is a third-party managed Kafka service, not a native GCP service, and while it offers auto-scaling, the question asks for a service you choose to run Kafka on Google Cloud with managed infrastructure, implying a GCP-native solution; Confluent Cloud is a separate platform, not a GCP service. Option C is wrong because Cloud Dataflow is a stream and batch processing service based on Apache Beam, not a Kafka cluster management service; it can consume from Kafka but does not host or auto-scale Kafka clusters.

Practice this question →

MCQmedium

Your team uses dbt to transform data in BigQuery. You need to schedule dbt runs to refresh materialized tables and views every hour. The transformations include both full refreshes and incremental models. What is the most efficient way to orchestrate these dbt runs on Google Cloud?

A.Use Cloud Composer (Airflow) to schedule and run dbt commands.

B.Use Cloud Build with a trigger to run dbt every hour.

C.Use Cloud Scheduler to trigger a Cloud Function that runs dbt.

D.Set up a cron job on a Compute Engine instance to run dbt.

AnswerA

Cloud Composer is managed Airflow, ideal for scheduling and orchestrating dbt runs with dependencies.

Why this answer

Cloud Composer (Airflow) is the recommended orchestration tool for complex workflows like dbt runs, supporting dependencies, retries, and scheduling. Cloud Scheduler alone cannot run dbt directly; it can trigger a Cloud Function to run dbt, but that is less maintainable. Cloud Build is CI/CD, not scheduling.

Using a cron job on Compute Engine is possible but not managed.

Practice this question →

MCQmedium

You are designing a near-real-time CDC pipeline to replicate changes from an on-premises PostgreSQL database to BigQuery for analytics. The source database has high transaction volume and you must ensure minimal impact on the source. Which Google Cloud service should you use to ingest the change data?

A.Pub/Sub with a custom connector that polls the database every minute.

B.Use BigQuery Data Transfer Service for PostgreSQL.

C.Use Dataflow with a JDBC IO connector to read from PostgreSQL.

D.Datastream to stream changes to GCS, then load into BigQuery.

AnswerD

Datastream captures changes from the database logs and streams them to GCS or BigQuery directly, with low impact.

Why this answer

Datastream is purpose-built for CDC from MySQL, PostgreSQL, and Oracle to BigQuery or GCS. It reads the database logs (e.g., WAL) to capture changes with low latency and minimal impact on the source.

Practice this question →

MCQmedium

A data engineer needs to move 500 TB of archival data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 30 days. Which method is most cost-effective and reliable?

A.Use Dataproc to copy data in parallel

B.Use a VPN and gsutil rsync

C.Use Storage Transfer Service over the internet

D.Use Transfer Appliance to ship the data offline

AnswerD

Transfer Appliance allows offline shipping of large data volumes, bypassing bandwidth limits.

Why this answer

With 100 Mbps, transferring 500 TB over the network would take > 500 days, exceeding the 30-day window. Transfer Appliance is designed for petabyte-scale offline transfers, making it the only feasible option.

Practice this question →

MCQmedium

An organization needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Google Cloud Storage. The network bandwidth is limited to 100 Mbps. Which transfer method is MOST cost-effective and time-efficient?

A.Transfer Appliance

B.Storage Transfer Service over the network

C.BigQuery Data Transfer Service for Hadoop

D.gsutil cp with parallel composite uploads

AnswerA

Transfer Appliance allows shipping data physically, bypassing network bandwidth limitations for large datasets.

Why this answer

Transfer Appliance is designed for petabyte-scale offline transfers, shipping physical devices to Google for upload, which is much faster than using limited network bandwidth for 50 TB.

Practice this question →

MCQmedium

A company needs to load data from a MySQL database into BigQuery daily. The data volume is 10 GB per day and the schema changes occasionally. They want to minimize costs and operational overhead. What is the MOST appropriate approach?

A.Use Datastream to stream changes from MySQL to BigQuery

B.Use BigQuery Data Transfer Service for MySQL

C.Export MySQL data to CSV, upload to GCS, and use BigQuery load jobs

D.Use Cloud SQL federated query from BigQuery

AnswerA

Datastream handles schema changes automatically and provides low-latency streaming with minimal manual intervention.

Why this answer

Datastream is the most appropriate approach because it provides a serverless, change data capture (CDC) solution that continuously replicates changes from MySQL to BigQuery with minimal latency. It handles schema evolution automatically, reducing operational overhead, and its pay-per-GB pricing model minimizes costs for the 10 GB daily volume. This eliminates the need for manual exports or batch loads while ensuring data freshness.

Exam trap

Cisco often tests the misconception that BigQuery Data Transfer Service supports any database source, but it only supports specific SaaS and cloud storage sources, not direct MySQL connections.

How to eliminate wrong answers

Option B is wrong because BigQuery Data Transfer Service for MySQL is not a supported service; BigQuery Data Transfer Service supports sources like Google Ads, Amazon S3, and Teradata, but not direct MySQL connections. Option C is wrong because exporting MySQL to CSV, uploading to GCS, and using load jobs incurs higher operational overhead (manual scripting, schema management) and does not handle schema changes gracefully, requiring manual intervention for each change. Option D is wrong because Cloud SQL federated queries from BigQuery are designed for ad-hoc querying of live Cloud SQL data, not for daily bulk ingestion, and they do not persist data in BigQuery, leading to repeated query costs and no historical retention.

Practice this question →

MCQeasy

A data analyst needs to transform nested and repeated fields in BigQuery. They have a table with a column of type ARRAY<STRUCT<...>>. Which SQL function should they use to flatten the array into individual rows for analysis?

A.STRUCT

B.CAST

C.UNNEST

D.REPLACE

AnswerC

UNNEST converts array elements into rows, allowing analysis of nested data.

Why this answer

UNNEST is used to flatten arrays into rows. STRUCT is used to group fields. CAST is for type conversion.

REPLACE is for string replacement.

Practice this question →

MCQhard

You need to process a large volume of event data from Cloud Storage, apply complex transformations using Apache Spark, and then load the results into BigQuery. The data arrives in batches every hour. You want to minimize costs by using preemptible VMs. Which service should you use?

A.Cloud Composer

B.BigQuery

C.Dataproc

D.Dataflow

AnswerC

Dataproc clusters can use preemptible VMs for cost-efficient batch processing with Spark.

Why this answer

Dataproc supports preemptible (now called spot) VMs for cost savings. Dataflow does not support preemptible VMs for workers; it uses standard VMs. Cloud Composer is orchestration only.

BigQuery is not for running Spark.

Practice this question →

MCQmedium

A company uses dbt on BigQuery to transform data. They want to run dbt models on a schedule and manage environments (dev, prod). Which GCP service should they use to run dbt jobs?

A.Dataflow

B.Cloud Composer

C.Cloud Scheduler

D.Cloud Build

AnswerB

Managed Airflow with DAGs, scheduling, and environment separation.

Why this answer

Cloud Composer is an Apache Airflow managed service that can schedule dbt runs.

Practice this question →

MCQeasy

A company wants to transfer 500 TB of data from an on-premises Hadoop cluster to Google Cloud Storage (GCS) for processing with Dataproc. The on-premises network has a 1 Gbps dedicated link to Google Cloud. The data must be transferred as quickly as possible, minimizing network usage. Which transfer method should they use?

A.Use Storage Transfer Service over the 1 Gbps link.

B.Use gsutil cp in parallel with multiple threads.

C.Use Transfer Appliance to physically ship the data.

D.Use BigQuery Data Transfer Service for Hadoop.

AnswerC

Transfer Appliance can handle 500 TB in a single appliance, transferring the data offline within days.

Why this answer

Transfer Appliance is the correct method because the dataset is 500 TB and the network link is only 1 Gbps. At 1 Gbps, the theoretical maximum transfer time is over 46 days, and real-world throughput (due to overhead, congestion, and Hadoop data characteristics) would be even longer. Transfer Appliance physically ships the data, bypassing the network bottleneck entirely and minimizing network usage, which is the stated requirement.

Exam trap

The trap here is that candidates assume parallel transfers (gsutil cp) or managed services (Storage Transfer Service) can overcome bandwidth limitations, but they ignore the fundamental physics of a 1 Gbps link and the sheer size of 500 TB.

How to eliminate wrong answers

Option A is wrong because Storage Transfer Service still uses the 1 Gbps network link, which would take weeks to transfer 500 TB, failing the 'as quickly as possible' and 'minimizing network usage' requirements. Option B is wrong because gsutil cp with parallel threads still operates over the same 1 Gbps link and cannot exceed its bandwidth; it also does not minimize network usage. Option D is wrong because BigQuery Data Transfer Service for Hadoop is designed for scheduled, incremental loads from Hadoop to BigQuery, not for bulk initial transfer to GCS, and it still uses the network link.

Practice this question →

MCQmedium

A data engineer is creating a Dataflow Flex Template for a batch pipeline that reads from BigQuery and writes to Cloud Storage. They need to pass a runtime parameter for the output bucket. How should they define this parameter?

A.Set an environment variable in Cloud Shell

B.Use the --parameters flag with the pipeline options

C.Hardcode the bucket name in the template

D.Define the parameter in the pipeline's code and use ValueProvider

AnswerD

Why this answer

Option D is correct because Dataflow Flex Templates require runtime parameters to be defined as `ValueProvider` objects in the pipeline code. This allows the parameter value to be supplied at job submission time via the `--parameters` flag, enabling the same template to be reused with different output buckets without recompilation.

Exam trap

Cisco often tests the distinction between defining a parameter (using `ValueProvider` in code) and supplying its value (using `--parameters` at submission), leading candidates to mistakenly choose Option B as the complete solution.

How to eliminate wrong answers

Option A is wrong because environment variables in Cloud Shell are not accessible to the Dataflow service at runtime; they are only available in the shell session and cannot be passed into a Flex Template job. Option B is wrong because the `--parameters` flag is used to supply values to `ValueProvider` parameters at job submission, but it does not define the parameter itself — the parameter must first be declared as a `ValueProvider` in the pipeline code. Option C is wrong because hardcoding the bucket name defeats the purpose of using a Flex Template, which is designed to be parameterized and reusable across different environments and runs.

Practice this question →

MCQhard

You are using Dataproc to run a Spark job that reads data from Cloud Storage, performs aggregations, and writes results back to Cloud Storage. The job is failing with out-of-memory errors on the shuffle. Which optimization should you apply?

A.Increase spark.sql.shuffle.partitions

B.Use RDDs instead of DataFrames

C.Increase spark.executor.memory

D.Decrease the number of executors

AnswerA

Why this answer

For shuffle-heavy operations, increasing the number of partitions reduces the size of each partition, reducing memory pressure. Alternatively, using DataFrames with optimized serialization (e.g., Kryo) helps.

Practice this question →

MCQmedium

You have a Dataflow pipeline that processes streaming data with high throughput. You notice that the pipeline is experiencing high latency and the workers are underutilized. Which Dataflow feature can automatically optimize resource allocation?

A.Flex Templates

B.Horizontal autoscaling

C.Streaming Engine

D.Dataflow Prime

AnswerD

Dataflow Prime offers vertical autoscaling and right-fitting to optimize worker resources.

Why this answer

Dataflow Prime (also known as right-fitting) provides vertical autoscaling and resource optimization based on actual usage. Horizontal autoscaling is standard but may not address underutilization. Streaming engine is for scaling the streaming writes but not worker tuning.

Flex templates are for deployment, not runtime optimization.

Practice this question →

MCQeasy

A company wants to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which event-driven solution should they use?

A.Eventarc with Cloud Storage trigger and Cloud Run destination

B.Cloud Scheduler to periodically poll the bucket

C.Cloud Functions triggered by Cloud Storage

D.Pub/Sub with a push subscription to Cloud Run

AnswerA

Eventarc natively supports Cloud Storage events and routes to Cloud Run.

Why this answer

Eventarc is the recommended service for routing events from Cloud Storage to Cloud Run because it provides a fully managed, event-driven architecture with built-in filtering and retry logic. When a new file is uploaded, Cloud Storage emits a notification that Eventarc captures and delivers directly to the Cloud Run service as an HTTP request, enabling serverless processing without polling or additional infrastructure.

Exam trap

The trap here is that candidates confuse Cloud Functions (option C) as the only serverless compute option for Cloud Storage events, overlooking that Eventarc is the modern, preferred service for routing events to Cloud Run, and that Pub/Sub (option D) requires manual setup not shown in the question.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler is a cron job service for scheduled, not event-driven, tasks; periodically polling a bucket introduces latency and inefficiency, and it cannot react instantly to uploads. Option C is wrong because Cloud Functions triggered by Cloud Storage is a valid event-driven approach, but the question specifically asks for a Cloud Run destination, and Cloud Functions cannot directly invoke Cloud Run without additional integration. Option D is wrong because Pub/Sub with a push subscription to Cloud Run requires manually configuring Cloud Storage to publish to Pub/Sub, which adds complexity and is not the native, recommended pattern for Cloud Storage events; Eventarc abstracts this by directly managing the event flow from Cloud Storage to Cloud Run.

Practice this question →

MCQhard

A company uses BigQuery's Storage Write API in committed mode to stream data. They notice that some writes are failing with 'DEADLINE_EXCEEDED' errors during peak traffic. The pipeline is a Dataflow job using the Beam SDK. What is the MOST likely cause and solution?

A.The Dataflow workers lack sufficient memory; increase worker memory.

B.The default RPC timeout is too low for the write throughput; increase the timeout in the Storage Write API configuration.

C.The Pub/Sub subscription is not sending acknowledgments; check the subscription.

D.The row schema has changed; update the schema before writing.

AnswerB

High traffic can cause RPCs to exceed the default timeout; increasing the timeout allows more time for acknowledgment.

Why this answer

Committed mode requires immediate acknowledgment from BigQuery. Under high traffic, the default timeout may be exceeded. The solution is to increase the timeout or switch to buffered mode, which provides higher throughput by batching.

The error is not due to schema mismatch or permissions; those would cause different errors. Pub/Sub is not involved in the write path.

Practice this question →

MCQmedium

A company wants to transform data using dbt (data build tool) on BigQuery. They have a CI/CD pipeline and need to version-control their transformations. Which setup is recommended?

A.Create Dataflow pipelines for each transformation

B.Deploy dbt models in a Cloud Build pipeline that runs dbt run

C.Use Cloud Composer to orchestrate dbt jobs

D.Run dbt directly on BigQuery using scripting

AnswerB

Why this answer

Option B is correct because dbt is designed for version-controlled, SQL-based transformations, and integrating it with Cloud Build allows you to run `dbt run` as part of a CI/CD pipeline. This setup ensures that every change to dbt models is automatically tested and deployed, which aligns with the requirement for version control and automated deployment on BigQuery.

Exam trap

Cisco often tests the distinction between orchestration (Cloud Composer) and CI/CD (Cloud Build), so candidates mistakenly choose Cloud Composer because they think scheduling equals version control, but the question explicitly requires version control and CI/CD, not just scheduling.

How to eliminate wrong answers

Option A is wrong because Dataflow pipelines are intended for stream or batch data processing using Apache Beam, not for version-controlled SQL transformations; they add unnecessary complexity and cost for simple transformation logic. Option C is wrong because Cloud Composer (Apache Airflow) is an orchestration tool for scheduling and monitoring workflows, not a CI/CD pipeline for version-controlled dbt models; while it can run dbt, it is not the recommended setup for version control and automated deployment. Option D is wrong because running dbt directly on BigQuery using scripting bypasses version control, CI/CD integration, and proper environment management, leading to manual, error-prone processes.

Practice this question →

MCQeasy

A data engineer needs to load a 10 GB CSV file from GCS into BigQuery. The file contains some malformed rows that should be skipped. Which approach is most efficient?

A.Use Dataproc to run a Spark job that cleans the data and writes to BigQuery

B.Use the Storage Write API to stream each row, skipping bad ones in code

C.Use a Dataflow pipeline to read CSV, filter bad rows, and write to BigQuery

D.Use the bq command-line tool with the --max_bad_records flag

AnswerD

bq load with --max_bad_records skips malformed rows efficiently.

Why this answer

Option D is correct because the `bq` command-line tool's `--max_bad_records` flag allows BigQuery's native CSV loader to skip malformed rows up to a specified limit during a load job. This is the most efficient approach for a one-time batch load of a 10 GB file, as it avoids the overhead of spinning up separate processing clusters (Dataproc, Dataflow) or streaming each row individually, leveraging BigQuery's optimized ingestion pipeline.

Exam trap

Cisco often tests the misconception that complex ETL pipelines (Spark, Dataflow) are always required for data cleaning, when in fact BigQuery's native load options like `--max_bad_records` can handle common malformed row scenarios directly and more efficiently.

How to eliminate wrong answers

Option A is wrong because using Dataproc to run a Spark job introduces unnecessary complexity and cost; for a simple load with malformed row skipping, a native BigQuery load job is far more efficient without needing a separate cluster. Option B is wrong because the Storage Write API is designed for real-time streaming, not batch loading a 10 GB file; streaming each row would be slower, more expensive, and less reliable than a single batch load with `--max_bad_records`. Option C is wrong because a Dataflow pipeline adds unnecessary processing overhead and cost; while it can filter bad rows, BigQuery's native load job with `--max_bad_records` achieves the same result more directly without requiring a separate data processing service.

Practice this question →

Multi-Selectmedium

A company is migrating a Spark batch job from on-premises to Dataproc. The job uses RDDs for custom transformations and writes output to BigQuery. They want to optimize the job for performance and cost on Dataproc. Which THREE practices should they adopt? (Choose 3)

Select 3 answers

A.Write output to BigQuery using the BigQuery connector

B.Use DataFrames or Datasets instead of RDDs where possible

C.Use PySpark instead of Scala for simplicity

D.Increase the number of partitions to match the number of cores exactly

E.Use preemptible VM instances for worker nodes to reduce cost

AnswersA, B, E

The connector provides efficient bulk loading.

Why this answer

The BigQuery connector for Spark (com.google.cloud.spark.bigquery) allows direct, efficient writes to BigQuery without intermediate storage. It leverages the BigQuery Storage Write API for high-throughput, exactly-once delivery, which is far more performant than writing via JDBC or saving to GCS and then loading into BigQuery.

Exam trap

Cisco often tests the misconception that PySpark is always simpler and faster for batch jobs, but the exam expects you to know that Scala/Java DataFrames leverage the Catalyst optimizer and Tungsten execution for superior performance on Dataproc.

Practice this question →

Multi-Selectmedium

A company wants to use Eventarc to trigger a Cloud Run service when new objects are created in a GCS bucket. They also need to filter events for a specific bucket and object prefix. Which THREE resources must exist or be created?

Select 3 answers

A.Cloud Storage bucket

B.Pub/Sub topic

C.Cloud Scheduler job

D.Eventarc trigger

E.Cloud Run service

AnswersA, D, E

The source of events.

Why this answer

Eventarc trigger, Cloud Run service, and the GCS bucket. The trigger references the bucket and prefix.

Practice this question →

MCQeasy

Which Google Cloud service is designed to replicate data from MySQL, PostgreSQL, and Oracle databases to BigQuery or Cloud Storage in near real-time?

A.Cloud Data Fusion

B.Datastream

C.Dataflow

D.Pub/Sub

AnswerB

Why this answer

Datastream is a serverless CDC service that ingests change data from relational databases into GCS or BigQuery.

Practice this question →

MCQmedium

Your organization uses dbt (data build tool) for transformations on BigQuery. You need to run dbt models on a schedule and manage versions. Which Google Cloud service can execute dbt jobs in a serverless manner?

A.Dataflow

B.Cloud Build

C.Cloud Run

D.Cloud Composer

AnswerD

Cloud Composer (Airflow) is designed for orchestrating data pipelines and can schedule dbt runs with dependencies and version control.

Why this answer

Cloud Composer (Airflow) can orchestrate dbt runs, but Cloud Build is not designed for scheduling. Dataflow is not for dbt. Cloud Scheduler can trigger a Cloud Run job that runs dbt, but Cloud Composer is the managed service for workflow orchestration including dbt.

However, the question asks for a service that can execute dbt jobs in a serverless manner. Cloud Run can run dbt as a container, but it's not specifically for dbt. Cloud Composer is the typical choice for scheduling dbt models.

But note: Cloud Composer is not serverless; it runs on GKE. The closest serverless option is Cloud Run triggered by Cloud Scheduler. However, the common practice is to use Cloud Composer.

Given the options, Cloud Composer is the best fit for scheduling and versioning.

Practice this question →

MCQeasy

Your company is migrating an on-premises Hadoop cluster to Google Cloud. You need to transform large datasets using Spark SQL. Which Google Cloud service should you use?

A.Dataflow

B.Dataproc

C.BigQuery

D.Cloud Dataprep

AnswerB

Dataproc provides managed Spark clusters where you can run Spark SQL, DataFrames, and RDDs.

Why this answer

Dataproc is the managed Spark and Hadoop service on Google Cloud, purpose-built for running existing Spark SQL workloads with minimal changes. It allows you to spin up a cluster, run your Spark SQL transformations on large datasets stored in Cloud Storage or BigQuery, and then tear it down, making it the direct equivalent of an on-premises Hadoop cluster in the cloud.

Exam trap

Cisco often tests the distinction between managed Spark (Dataproc) and serverless SQL (BigQuery) or Beam-based processing (Dataflow), trapping candidates who see 'SQL' and immediately think of BigQuery without recognizing the Spark SQL execution context.

How to eliminate wrong answers

Option A is wrong because Dataflow is a unified stream and batch processing service based on Apache Beam, not Spark SQL; migrating Spark SQL code to Dataflow would require rewriting the entire pipeline in Beam. Option C is wrong because BigQuery is a serverless data warehouse for SQL analytics, not a platform for running Spark SQL transformations; it does not support Spark execution engines. Option D is wrong because Cloud Dataprep is a visual data preparation tool for cleaning and structuring data, not a service for running Spark SQL jobs; it cannot execute custom Spark code.

Practice this question →

MCQmedium

A company wants to orchestrate a multi-step data processing workflow that includes calling a Cloud Run service, waiting for its completion, and then running a BigQuery query. The workflow should be serverless and integrate with Cloud Events. Which Google Cloud service should they use?

A.Eventarc

B.Cloud Workflows

C.Cloud Composer

D.Cloud Dataflow

AnswerB

Workflows is a serverless orchestration service that can call Cloud Run, BigQuery, and other APIs, and can be triggered by Eventarc.

Why this answer

Cloud Workflows is the correct choice because it is a serverless workflow orchestrator that can coordinate multi-step processes involving Cloud Run and BigQuery. It natively supports waiting for asynchronous operations (like Cloud Run job completion) via its 'call' and 'wait' steps, and it can trigger subsequent steps such as BigQuery queries. Additionally, Cloud Workflows can be triggered by Cloud Events, making it fully integrated with the event-driven architecture described.

Exam trap

The trap here is that candidates confuse Eventarc (event routing) with workflow orchestration, assuming that routing events alone can handle sequencing and waiting, when in fact Eventarc lacks the state management and step coordination required for multi-step workflows.

How to eliminate wrong answers

Option A is wrong because Eventarc is a service for routing events from various sources to targets (like Cloud Run, Cloud Functions), but it does not provide workflow orchestration capabilities such as waiting for completion or sequencing steps. Option C is wrong because Cloud Composer is a managed Apache Airflow service that is not serverless (it requires provisioning and managing a cluster of workers) and is overkill for a simple multi-step workflow; it is designed for complex, scheduled pipelines, not lightweight event-driven orchestration. Option D is wrong because Cloud Dataflow is a stream and batch data processing service (based on Apache Beam) that focuses on transforming data pipelines, not on orchestrating heterogeneous services like Cloud Run and BigQuery queries; it lacks native workflow sequencing and event-driven triggers.

Practice this question →

MCQhard

A data engineer is designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery using the Storage Write API in exactly-once mode. The pipeline must handle late-arriving data (up to 1 hour) and maintain correct aggregation results. Which trigger configuration should they use?

A.Default trigger without allowed lateness

B.After count trigger with count of 1000

C.Fixed time trigger every 5 minutes

D.After watermark trigger with allowed lateness of 1 hour

AnswerD

This configuration ensures late data within 1 hour is included in aggregations.

Why this answer

Option D is correct because the After watermark trigger with allowed lateness of 1 hour ensures that the pipeline waits for the watermark to pass before emitting a pane, while allowing late-arriving data up to 1 hour to be included in the correct aggregation window. This is required for exactly-once processing with the Storage Write API, as it prevents duplicate or dropped records from late data.

Exam trap

Cisco often tests the misconception that any trigger with a time-based interval (like fixed 5-minute triggers) can handle late data, but only watermark-based triggers with allowed lateness correctly align with event-time windows and late-arriving records.

How to eliminate wrong answers

Option A is wrong because the default trigger fires only when the watermark passes, with no allowed lateness, so any data arriving more than a few seconds late would be dropped, violating the requirement to handle up to 1 hour of late data. Option B is wrong because an After count trigger of 1000 fires after every 1000 elements regardless of window boundaries or watermark, which does not respect the 1-hour lateness requirement and can cause incomplete or overlapping aggregations. Option C is wrong because a Fixed time trigger every 5 minutes fires periodically without regard to watermark or lateness, leading to premature emissions and incorrect aggregation results when late data arrives after the trigger fires.

Practice this question →

Multi-Selectmedium

A data engineering team needs to ingest streaming data from an existing Kafka cluster (on-premises) into Google Cloud for real-time analytics. They want to minimize changes to the existing Kafka setup and avoid long-term operational overhead. Which TWO approaches should they consider?

Select 2 answers

A.Use Storage Transfer Service to copy Kafka logs from on-premises to GCS

B.Deploy a Kafka Connect cluster on Google Cloud with the Pub/Sub sink connector

C.Use Datastream to capture changes from Kafka

D.Replace the on-premises Kafka cluster with Google Cloud Pub/Sub

E.Set up a Dataproc cluster with Kafka and use MirrorMaker to replicate data to the cloud Kafka cluster

AnswersB, E

This allows streaming from on-prem Kafka to Pub/Sub without modifying the existing Kafka setup.

Why this answer

Using Kafka Connect with the Pub/Sub connector or setting up a Dataproc cluster running Kafka with mirroring are two ways to bridge on-premises Kafka to GCP with minimal changes.

Practice this question →

Multi-Selectmedium

A company is building a data pipeline that ingests streaming data from Pub/Sub, transforms it with Dataflow, and loads it into BigQuery. They want to handle malformed messages that cannot be parsed. Which TWO actions should they implement for error handling? (Choose 2)

Select 2 answers

A.Configure the pipeline to drop malformed messages silently

B.Use a side input to filter out malformed messages

C.Use a dead letter sink to write malformed messages to Cloud Storage or Pub/Sub for later analysis

D.Raise an exception in the DoFn to fail the pipeline immediately

E.Log the error and continue processing the next message

AnswersC, E

This allows reprocessing without blocking the main pipeline.

Why this answer

Option C is correct because a dead letter sink (e.g., writing malformed messages to Cloud Storage or a separate Pub/Sub topic) allows the pipeline to continue processing valid data while preserving the problematic records for offline inspection, retries, or debugging. This pattern is a standard best practice in streaming pipelines to avoid data loss and enable recovery without blocking the main data flow.

Exam trap

Cisco often tests the misconception that raising an exception (Option D) is acceptable for error handling in streaming pipelines, but the correct approach is to isolate failures using a dead letter sink (Option C) while logging errors (Option E) to maintain pipeline continuity.

Practice this question →

Multi-Selectmedium

A retail company wants to trigger a Cloud Run service whenever a new CSV file is uploaded to a specific Cloud Storage bucket. Which THREE components are needed to set up this event-driven architecture? (Choose 3)

Select 3 answers

A.Cloud Storage bucket with notifications enabled

B.Eventarc trigger

C.Cloud Dataflow pipeline

D.Cloud Run service

E.Cloud Pub/Sub topic

AnswersA, B, D

The source of events; bucket must be configured to send notifications.

Why this answer

Option A is correct because Cloud Storage buckets must have notifications enabled to publish events to Pub/Sub when objects are created. Without enabling notifications, the bucket cannot emit events that trigger downstream services. This is typically done by configuring the bucket to send notifications to a Pub/Sub topic for each new object upload.

Exam trap

Cisco often tests the misconception that you must manually create a Pub/Sub topic as a separate component, when in fact Eventarc manages it automatically, making the Pub/Sub topic an implicit part of the Eventarc trigger rather than a distinct required component.

Practice this question →

Multi-Selectmedium

You need to ingest streaming data from a custom application into BigQuery with exactly-once semantics and low latency. The data volume is up to 10 MB/s. Which TWO services should you combine?

Select 2 answers

A.Pub/Sub

B.Cloud Functions

C.BigQuery legacy streaming inserts

D.Dataflow with Storage Write API

E.Datastream

AnswersA, D

Pub/Sub is the recommended message ingestion service for streaming data.

Why this answer

Pub/Sub provides reliable, low-latency message ingestion, and Dataflow can read from Pub/Sub and write to BigQuery using the Storage Write API, which supports exactly-once semantics. The Storage Write API with committed mode ensures exactly-once delivery.

Practice this question →

MCQeasy

A data engineer needs to load 10 GB of CSV files from Amazon S3 into BigQuery on a daily basis. The files arrive in a specific S3 bucket at 3 AM UTC each day. Which service should be used to automate this transfer?

A.Cloud Storage Transfer Service

B.Dataflow with Pub/Sub

C.Transfer Appliance

D.BigQuery Data Transfer Service

AnswerD

BigQuery Data Transfer Service can schedule and automate data loads from Amazon S3 directly into BigQuery.

Why this answer

BigQuery Data Transfer Service supports scheduled transfers from Amazon S3 directly to BigQuery, making it the appropriate choice for this recurring batch load.

Practice this question →

MCQmedium

You are building a streaming pipeline to ingest real-time clickstream data from a website into BigQuery for immediate analysis. The data must be available in BigQuery within seconds and you need to handle late-arriving data (e.g., browser offline events) that may arrive hours later. Which approach should you use?

A.Use Pub/Sub with Dataflow, writing to BigQuery using the Storage Write API in committed mode.

B.Use Cloud Logging to capture logs and export to BigQuery via a sink.

C.Use Pub/Sub with Cloud Functions, writing each event directly via BigQuery legacy streaming inserts.

D.Use Datastream to stream clickstream data from Cloud SQL to BigQuery.

AnswerA

This provides low-latency streaming, late data handling via Dataflow's triggers, and efficient writes.

Why this answer

Option A is correct because Pub/Sub provides a scalable, durable ingestion layer for real-time clickstream data, and Dataflow can handle late-arriving data via its built-in watermark and trigger mechanisms. The Storage Write API in committed mode ensures exactly-once semantics and low-latency writes to BigQuery, meeting the sub-second availability requirement while preserving data consistency for delayed events.

Exam trap

The trap here is that candidates assume legacy streaming inserts (Option C) are sufficient for real-time needs, but they overlook the 1-hour buffer delay and lack of late-data handling, which are explicitly tested in the PDE exam's focus on streaming pipelines with out-of-order events.

How to eliminate wrong answers

Option B is wrong because Cloud Logging is designed for log ingestion and analysis, not for high-throughput real-time clickstream pipelines; exporting logs via a sink introduces latency (typically minutes) and cannot guarantee sub-second BigQuery availability. Option C is wrong because BigQuery legacy streaming inserts have a 1-hour buffer before data is available for queries, do not support exactly-once semantics, and Cloud Functions lack the stateful processing capabilities (e.g., windowing, triggers) needed to handle late-arriving data correctly. Option D is wrong because Datastream is built for continuous replication from databases like Cloud SQL to BigQuery, not for ingesting raw clickstream events from a website; it requires an intermediary database, which adds unnecessary complexity and latency.

Practice this question →