Knowledge + Practice

CCNA Pde Ingestion Processing Questions

51 of 126 questions · Page 2/2 · Pde Ingestion Processing topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQeasy

A team needs to orchestrate a multi-step workflow that involves calling external APIs, running BigQuery queries, and conditionally executing Cloud Functions. Which Google Cloud service is best suited for this?

A.Dataflow

B.Workflows

C.Cloud Composer

D.Cloud Scheduler

AnswerB

Lightweight orchestration service with steps, conditions, and error handling.

Why this answer

Workflows is a serverless orchestration service that allows you to define multi-step workflows as a sequence of steps, including HTTP calls to external APIs, BigQuery queries, and conditional logic to invoke Cloud Functions. It integrates natively with other Google Cloud services via the Workflows API and supports error handling, retries, and parallel steps, making it ideal for this use case.

Exam trap

Cisco often tests the distinction between orchestration services (Workflows) and data processing services (Dataflow) or scheduling services (Cloud Scheduler), leading candidates to choose Dataflow because they confuse data processing with workflow orchestration.

How to eliminate wrong answers

Option A is wrong because Dataflow is a stream and batch data processing service based on Apache Beam, not an orchestration tool for coordinating API calls, BigQuery queries, and Cloud Functions. Option C is wrong because Cloud Composer is a managed Apache Airflow service that is designed for complex, scheduled workflows with dependencies, but it is heavier, requires more setup, and is overkill for a simple multi-step orchestration that Workflows handles more efficiently. Option D is wrong because Cloud Scheduler is a cron job service for triggering tasks on a schedule, but it cannot orchestrate conditional logic, API calls, or BigQuery queries within a single workflow.

Practice this question →

77

MCQmedium

A data engineer needs to ingest daily Salesforce reports into BigQuery without writing custom code. The reports are exported to an Amazon S3 bucket on a schedule. Which service should they use to automate the transfer?

A.Cloud Dataproc

B.BigQuery Data Transfer Service

C.Cloud Composer

D.Cloud Storage Transfer Service

AnswerB

Supports Amazon S3 as a source for scheduled transfers directly into BigQuery.

Why this answer

The BigQuery Data Transfer Service (BQDTS) is the correct choice because it natively supports scheduled, automatic ingestion of Salesforce reports into BigQuery without requiring any custom code. It connects directly to the Salesforce API, extracts the reports, and loads them into BigQuery tables on a user-defined schedule, handling incremental updates and schema detection automatically.

Exam trap

The trap here is that candidates often confuse Cloud Storage Transfer Service (which only moves files between storage buckets) with BigQuery Data Transfer Service (which directly ingests from SaaS applications like Salesforce into BigQuery), leading them to pick option D when the requirement is for a no-code, direct-to-BigQuery solution.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc is a managed Spark/Hadoop service for running big data processing jobs, not a no-code data ingestion tool; it would require writing custom code to extract from Salesforce and load into BigQuery. Option C is wrong because Cloud Composer is a managed Apache Airflow service for orchestrating workflows; while it could be used to build a custom pipeline, it requires writing DAGs and code, which contradicts the 'without writing custom code' requirement. Option D is wrong because Cloud Storage Transfer Service is designed for moving data between cloud storage systems (e.g., S3 to GCS) and does not directly ingest data into BigQuery or connect to Salesforce APIs.

Practice this question →

78

MCQeasy

You need to ingest Google Ads performance data into BigQuery on a daily basis for reporting. Which service should you use?

A.BigQuery Data Transfer Service for Google Ads

B.Cloud Scheduler to call Google Ads API and load to BigQuery

C.Pub/Sub with a Google Ads subscriber

D.Storage Transfer Service for Google Ads

AnswerA

This service is specifically designed to import data from Google Ads into BigQuery on a scheduled basis.

Why this answer

The BigQuery Data Transfer Service for Google Ads is the correct choice because it provides a fully managed, scheduled connector that automatically ingests Google Ads performance data into BigQuery on a daily basis without requiring any custom code. It handles authentication, schema mapping, and incremental loads, making it the simplest and most reliable solution for this specific use case.

Exam trap

Cisco often tests the distinction between fully managed services (like BigQuery Data Transfer Service) and generic infrastructure components (like Cloud Scheduler or Pub/Sub) that require custom development, leading candidates to overcomplicate the solution by choosing a more flexible but less appropriate option.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler is a cron job service that can trigger HTTP requests, but it does not natively integrate with the Google Ads API or handle the complex authentication, pagination, and schema mapping required to load data into BigQuery; you would still need to build and maintain a custom application. Option C is wrong because Pub/Sub is a messaging service for asynchronous event streaming, not a batch ingestion tool; while you could theoretically publish Google Ads data to Pub/Sub, there is no native Google Ads subscriber, and you would need to build a custom subscriber to write to BigQuery, which is far more complex than using the dedicated transfer service. Option D is wrong because Storage Transfer Service is designed for moving data from on-premises or cloud storage (like S3 or HTTP endpoints) into Google Cloud Storage, not for directly ingesting data from Google Ads into BigQuery.

Practice this question →

79

MCQmedium

A data team wants to load millions of small JSON files (each <1 MB) from GCS into BigQuery daily with the lowest cost and fastest performance. They need exactly-once semantics and the ability to detect new files automatically. Which approach is most suitable?

A.Use Dataflow to read files from GCS, combine them, and write to BigQuery using the Storage Write API in exactly-once mode

B.Use Cloud Functions to trigger on new files and stream each row via the legacy streaming inserts API

C.Use Storage Transfer Service to copy files to a staging bucket, then run a scheduled query to load them

D.Use BigQuery batch load jobs with a wildcard URI to load all files directly

AnswerA

Dataflow can efficiently combine small files and write to BigQuery with exactly-once semantics using the Storage Write API.

Why this answer

BigQuery batch loads from GCS with wildcard URIs and the 'auto detect' option can handle many small files efficiently. However, for many small files, loading them directly may be slow. A better approach is to combine files first.

Dataflow with file processing can combine and load with exactly-once. Storage Write API is for streaming, not batch. Transfer Service is for copying data, not loading into BigQuery.

Practice this question →

80

MCQeasy

You need to react to changes in a GCS bucket (e.g., new object creation) and trigger a Cloud Run service to process the new file. Which Google Cloud service should you use to route the event?

A.Pub/Sub directly with a Cloud Run subscription

B.Cloud Tasks

C.Eventarc

D.Cloud Scheduler

AnswerC

Eventarc handles events from GCS and other sources, routing them to Cloud Run.

Why this answer

Eventarc is the correct choice because it is purpose-built to route events from Google Cloud sources (like Cloud Storage) to Cloud Run. It directly supports Cloud Storage audit logs and Pub/Sub event triggers, allowing you to react to object creation events without custom middleware. Eventarc handles the event routing, filtering, and delivery to your Cloud Run service automatically.

Exam trap

Cisco often tests the misconception that Pub/Sub is the direct answer for any event routing, but the trap here is that Eventarc is the managed service that simplifies the integration between GCS and Cloud Run, making it the correct choice over raw Pub/Sub.

How to eliminate wrong answers

Option A is wrong because Pub/Sub directly with a Cloud Run subscription requires you to manually configure a Pub/Sub topic and subscription, and Cloud Run can only pull messages via a push subscription; Eventarc abstracts this complexity and provides native integration with Cloud Storage events. Option B is wrong because Cloud Tasks is a task queue for asynchronous execution of HTTP requests, not designed for event-driven routing from GCS; it would require you to manually publish tasks in response to events, adding unnecessary overhead. Option D is wrong because Cloud Scheduler is a cron job scheduler for periodic tasks, not an event router; it cannot react to real-time object creation events in a GCS bucket.

Practice this question →

81

MCQmedium

You are migrating an existing Kafka cluster to Google Cloud using Dataproc. The cluster handles high-throughput streaming data with strict ordering requirements per partition. Which choice of Dataproc configuration is most appropriate?

A.Use Dataflow with Kafka IO instead of Dataproc.

B.Use Dataproc with local SSDs for better performance, and enable autoscaling.

C.Use Dataproc with preemptible workers to reduce cost, and attach standard persistent disks.

D.Use Dataproc with non-preemptible workers and persistent SSD storage for brokers.

AnswerD

Non-preemptible workers provide stability for Kafka brokers, and SSDs offer low latency for high-throughput streaming.

Why this answer

Option D is correct because Kafka brokers in a Dataproc cluster require persistent, non-preemptible workers to maintain data durability and strict ordering per partition. Preemptible workers can be terminated at any time, causing data loss or rebalancing that violates ordering guarantees. Persistent SSD storage provides the low-latency I/O needed for high-throughput Kafka workloads, while non-preemptible instances ensure broker stability and consistent replication.

Exam trap

Cisco often tests the misconception that preemptible VMs or local SSDs are acceptable for stateful, ordered workloads like Kafka, when in fact they violate durability and ordering guarantees due to ephemeral storage and abrupt termination.

How to eliminate wrong answers

Option A is wrong because Dataflow with Kafka IO is a serverless stream processing service, not a Kafka cluster migration target; the question asks about migrating an existing Kafka cluster to Dataproc, not replacing it with a different processing paradigm. Option B is wrong because local SSDs are ephemeral and lose data on instance termination, which is incompatible with Kafka's durability and ordering requirements; autoscaling can cause partition rebalancing that disrupts strict ordering. Option C is wrong because preemptible workers can be terminated at any time, leading to data loss and partition leader re-elections that break ordering guarantees; standard persistent disks have higher latency than SSDs, degrading Kafka's throughput.

Practice this question →

82

MCQmedium

A data engineer needs to load 2 TB of Avro files stored in Cloud Storage into BigQuery on a daily schedule. The schema is static and the data should overwrite the existing table each day. What is the most efficient way to accomplish this?

A.Create a BigQuery Data Transfer Service from Cloud Storage

B.Create a Dataflow pipeline to read Avro files and stream to BigQuery

C.Mount Cloud Storage as a filesystem and use SELECT INTO

D.Use bq load with --replace flag in a cron job

AnswerD

Why this answer

The bq load command with the --replace flag is the most efficient method because it directly loads Avro files from Cloud Storage into BigQuery in a single, serverless operation without requiring any intermediate processing. Since the schema is static and the data overwrites the existing table daily, a simple cron job invoking bq load --replace is the simplest, fastest, and most cost-effective solution, avoiding the overhead of additional services like Dataflow or Data Transfer Service.

Exam trap

Cisco often tests the misconception that managed services like Data Transfer Service or Dataflow are always the best choice for scheduled loads, but here the trap is that the Data Transfer Service cannot overwrite tables on a schedule, and Dataflow adds unnecessary overhead for a simple batch load with a static schema.

How to eliminate wrong answers

Option A is wrong because the BigQuery Data Transfer Service for Cloud Storage does not support overwriting an existing table on a schedule; it only supports appending or creating new tables, and it cannot use the --replace flag. Option B is wrong because a Dataflow pipeline introduces unnecessary complexity, cost, and latency for a straightforward batch load; streaming to BigQuery is not needed for daily overwrites of static Avro files, and batch loads are more efficient. Option C is wrong because mounting Cloud Storage as a filesystem (e.g., via gcsfuse) and using SELECT INTO is not a native BigQuery operation; BigQuery does not support SELECT INTO from a mounted filesystem, and this approach would require an external processing layer, defeating efficiency.

Practice this question →

83

MCQmedium

A company wants to use dbt to transform data in BigQuery. Their source data is loaded daily into staging tables. They need to run dbt transformations on a schedule and only process tables that have changed. Which dbt feature should they use?

A.dbt snapshots

B.dbt incremental models

C.dbt seeds

D.dbt sources

AnswerB

Incremental models only process new/changed records, reducing cost and runtime.

Why this answer

dbt incremental models allow processing only new or changed records based on a configured timestamp or unique key. dbt snapshots capture historical changes. dbt seeds load CSV files. dbt sources are for configuration, not incremental processing.

Practice this question →

84

MCQeasy

A company wants to stream real-time clickstream data from a website into BigQuery for near-real-time analytics. They expect peaks of 10,000 events per second. Which combination of services is most suitable for ingestion?

A.Cloud Storage → Cloud Functions → BigQuery

B.Direct Web → Dataflow → BigQuery

C.Pub/Sub → Dataflow → BigQuery (Storage Write API)

D.Pub/Sub → Dataflow → BigQuery (legacy streaming inserts)

AnswerC

This is the modern recommended architecture: Pub/Sub for ingestion, Dataflow for processing, Storage Write API for high-throughput streaming ingestion into BigQuery.

Why this answer

Pub/Sub is designed for ingesting high-throughput event streams, Dataflow can process and transform the data in real time, and the BigQuery Storage Write API provides exactly-once semantics and higher throughput than legacy streaming inserts. Option C uses the correct pipeline. Option A uses Dataproc which is suitable for batch processing, not streaming.

Option B uses legacy streaming inserts which are deprecated and have lower throughput. Option D uses Cloud Functions which are not designed for high-throughput stream processing.

Practice this question →

85

MCQmedium

A company runs Apache Kafka on Dataproc for real-time event streaming. They want to archive the Kafka topics to Cloud Storage for long-term retention and later analysis in BigQuery. Which approach is the most cost-effective and operationally simple?

A.Use Apache Spark streaming on Dataproc to read from Kafka and write to GCS

B.Use Kafka MirrorMaker to replicate topics to a second cluster that writes to GCS

C.Use the Pub/Sub connector to publish Kafka messages to Pub/Sub, then a Dataflow job to write to GCS

D.Use Kafka Connect with the GCS Sink Connector to write directly to Cloud Storage

AnswerD

Kafka Connect GCS Sink Connector is purpose-built, simple to configure, and runs on the same Dataproc cluster.

Why this answer

Option D is correct because Kafka Connect with the GCS Sink Connector is purpose-built for exactly this use case: it directly streams Kafka topics to Cloud Storage in Avro, Parquet, or JSON format without requiring intermediate processing clusters or services. This approach minimizes operational overhead (no Spark or Dataflow jobs to manage) and is cost-effective since it runs as a lightweight connector within the existing Kafka ecosystem, leveraging Dataproc's managed Kafka cluster.

Exam trap

Cisco often tests the misconception that streaming data to Cloud Storage requires a full streaming pipeline (Spark, Dataflow) or an intermediary service like Pub/Sub, when in fact Kafka Connect provides a native, lightweight, and cost-effective sink directly to GCS.

How to eliminate wrong answers

Option A is wrong because using Apache Spark streaming on Dataproc to read from Kafka and write to GCS introduces unnecessary compute overhead, latency, and operational complexity (managing Spark jobs, checkpointing, and resource scaling) compared to a direct connector. Option B is wrong because Kafka MirrorMaker is designed for cross-cluster replication, not for writing to GCS; it would require an additional sink to write to GCS, adding complexity and cost without any benefit. Option C is wrong because routing Kafka messages through Pub/Sub adds latency, extra cost (Pub/Sub egress and Dataflow processing), and operational complexity (managing a Pub/Sub topic, subscription, and Dataflow pipeline) when a direct connector to GCS exists.

Practice this question →

86

MCQhard

Your Dataflow pipeline reads from Pub/Sub, performs transformations, and writes to BigQuery. You notice that the pipeline's autoscaling is not keeping up with sudden spikes in traffic, causing increased lag. The pipeline uses Classic Templates. Which change would most effectively improve autoscaling responsiveness?

A.Enable Dataflow Streaming Engine on the pipeline.

B.Switch to Dataflow Prime with Vertical Autoscaling enabled.

C.Increase the initial number of workers to handle the spike.

D.Use Flex Templates instead of Classic Templates.

AnswerA

Streaming Engine improves autoscaling by decoupling compute from state, allowing workers to scale more quickly.

Why this answer

Enabling Dataflow Streaming Engine reduces the overhead of checkpointing and state management by offloading them to the service side, which allows the pipeline to scale more quickly in response to sudden traffic spikes. This directly addresses the autoscaling lag because Streaming Engine decouples compute from state, enabling faster worker adjustments without the bottleneck of persistent disk-based shuffle.

Exam trap

Cisco often tests the misconception that Flex Templates improve runtime performance or autoscaling, when in fact they only affect deployment flexibility, not the underlying execution engine's scaling behavior.

How to eliminate wrong answers

Option B is wrong because Dataflow Prime with Vertical Autoscaling adjusts the CPU/memory of existing workers, not the number of workers, so it does not improve horizontal autoscaling responsiveness to sudden traffic spikes. Option C is wrong because increasing the initial number of workers only sets a starting point; it does not improve the pipeline's ability to scale up dynamically during a spike, and it may waste resources during low traffic. Option D is wrong because Flex Templates only affect how the pipeline is deployed and parameterized, not the runtime autoscaling behavior; Classic Templates and Flex Templates share the same autoscaling mechanisms.

Practice this question →

87

MCQmedium

A company uses Pub/Sub to ingest clickstream data. Each message contains a JSON payload with a nested array of user actions. The data must be written to BigQuery, with each action in the array becoming a separate row. Which BigQuery feature or approach should be used to achieve this transformation?

A.Use a Dataflow pipeline with a ParDo that explodes the array

B.Load the JSON as-is into BigQuery and use UNNEST in a query

C.Use a BigQuery scripting loop to iterate over the array

D.Preprocess the data with a Dataflow pipeline and write to BigQuery

AnswerB

UNNEST can be used in a query to flatten the array into rows, which is a cost-effective approach.

Why this answer

BigQuery's UNNEST function is designed to flatten arrays into separate rows, which is exactly what is needed here.

Practice this question →

88

MCQhard

A company needs to continuously synchronize customer data changes from an on-premises Oracle database to BigQuery for near-real-time analytics. The Oracle database has Change Data Capture (CDC) enabled. Which Google Cloud service should be used to stream these changes with minimal latency and schema evolution support?

A.Deploy a Dataflow pipeline with a JDBC source and Pub/Sub

B.Use Cloud SQL with a read replica and enable binary logging

C.Use Transfer Appliance to copy Oracle data periodically

D.Use Datastream to stream CDC changes from Oracle to BigQuery

AnswerD

Datastream directly supports Oracle CDC and streams to BigQuery with schema evolution.

Why this answer

Datastream is designed to stream CDC from Oracle (and MySQL/PostgreSQL) directly to BigQuery or GCS, supporting schema evolution and low-latency replication.

Practice this question →

89

MCQeasy

A data engineer needs to ingest on-premises Oracle CDC data into BigQuery in near real-time with minimal operational overhead. Which service should they use?

A.Pub/Sub + Dataflow

B.Storage Transfer Service

C.Transfer Appliance

D.Datastream

AnswerD

Datastream is purpose-built for serverless CDC from databases to Google Cloud destinations like BigQuery and GCS.

Why this answer

Datastream is purpose-built for streaming change data capture (CDC) from Oracle and other sources into BigQuery with near-real-time latency and minimal operational overhead. It handles schema propagation, checkpointing, and automatic retries, eliminating the need to manage custom ingestion pipelines.

Exam trap

Cisco often tests the distinction between batch migration tools (Storage Transfer Service, Transfer Appliance) and streaming CDC services (Datastream), leading candidates to choose a batch option when the question explicitly requires near-real-time ingestion.

How to eliminate wrong answers

Option A is wrong because Pub/Sub + Dataflow requires building and maintaining a custom pipeline to handle Oracle CDC, including log mining and transformation logic, which increases operational overhead compared to a managed service. Option B is wrong because Storage Transfer Service is designed for bulk batch transfers of files from cloud or on-premises storage to Google Cloud, not for streaming CDC from a live database. Option C is wrong because Transfer Appliance is a physical device for offline, high-volume data migration, which cannot provide near-real-time streaming and introduces significant latency.

Practice this question →

90

MCQmedium

An organization needs to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which service should they use to set up this event-driven architecture?

A.Eventarc with a trigger for Cloud Storage events

B.Pub/Sub notifications on the bucket with a push subscription to Cloud Run

C.Cloud Scheduler calling Cloud Run on a schedule

D.Cloud Functions with a GCS trigger

AnswerA

Why this answer

Eventarc can capture Cloud Storage events (e.g., OBJECT_FINALIZE) and route them to Cloud Run, Cloud Functions, or Workflows. It supports CloudEvents standard.

Practice this question →

91

Multi-Selectmedium

A company is building a real-time anomaly detection pipeline using Dataflow. Events are ingested from Pub/Sub, and the pipeline must compute a sliding window average every minute over a 1-hour window. Which TWO configurations are required for this pipeline? (Choose 2)

Select 2 answers

A.Set the pipeline to use event time for watermarking.

B.Use a Sliding window of 1 hour with a 1-minute slide.

C.Use a Fixed window of 1 minute.

D.Use stateful processing with a custom timer.

E.Set the pipeline to use processing time for watermarking.

AnswersA, B

Event time ensures windows based on actual event occurrence time, necessary for correct sliding window semantics.

Why this answer

A sliding window of 1-hour length with a 1-minute slide period fits the requirement (every minute, compute over last hour). Fixed window of 1 minute would compute only per-minute, not sliding. Using stateful processing with timers is an alternative but not standard for sliding windows.

Dataflow's default watermark is based on event time; processing time would cause incorrect results. The window type and period are the key.

Practice this question →

92

Multi-Selectmedium

A data engineer needs to build a Dataflow pipeline that reads JSON messages from Pub/Sub, transforms them (including filtering, grouping, and enrichment), and writes the results to BigQuery. The pipeline must handle schema evolution in the input messages and minimize data loss. Which THREE settings or features should the engineer use? (Choose THREE.)

Select 3 answers

A.Use side inputs to enrich the data with reference data from BigQuery

B.Set the `withAllowedLateness` to 0 for windowing to minimize latency

C.Set up a dead letter queue (DLQ) for messages that fail to parse or validate

D.Enable autoscaling to handle spikes in message volume

E.Enable Streaming Engine to reduce checkpoint size

AnswersA, C, D

Side inputs allow joining with slowly changing reference data.

Why this answer

To handle schema evolution, using a dead letter queue (option A) is essential to capture messages that do not conform to the current schema. Using side inputs (option B) is a good practice for enrichment with reference data. Enabling autoscaling (option C) ensures the pipeline can handle varying throughput.

Option D is not necessary for schema evolution; setting a limit on number of shards is for grouping. Option E is incorrect: Streaming Engine is a separate feature that manages state, but it is not directly related to schema evolution or data loss minimization.

Practice this question →

93

MCQmedium

A company uses Google Ads and wants to automatically load their advertising data into BigQuery daily. They also need to transform the data with SQL and schedule a recurring query. Which combination of services meets these requirements with minimal operational overhead?

A.Cloud Functions triggered by Cloud Scheduler to call Google Ads API and load into BigQuery

B.Cloud Composer to extract Google Ads API and Dataflow to transform

C.Storage Transfer Service to move CSV files to GCS, then load into BigQuery

D.BigQuery Data Transfer Service for Google Ads and scheduled queries

AnswerD

Direct integration with scheduled queries for transformation.

Why this answer

BigQuery Data Transfer Service can automatically load Google Ads data; scheduled queries handle transformation.

Practice this question →

94

MCQmedium

A financial services company receives real-time stock trade data via Pub/Sub. They need to enrich each trade with reference data from a Cloud SQL table and write the results to BigQuery for real-time analytics. The enrichment must handle late-arriving data and ensure exactly-once processing. Which Dataflow streaming pipeline configuration should be used?

A.Use a Dataflow Flex Template that reads from Pub/Sub, joins in memory, and writes to BigQuery using legacy streaming inserts

B.Use Pub/Sub to BigQuery template with streaming inserts and a side input from Cloud SQL

C.Build a custom Dataflow pipeline using the Storage Write API with exactly-once semantics and a side input from Cloud SQL

D.Deploy a Dataproc Spark Streaming job that reads from Pub/Sub, enriches via JDBC, and writes to BigQuery

AnswerC

Storage Write API with exactly-once ensures no duplicates, and side input allows enrichment from Cloud SQL.

Why this answer

Using the Storage Write API with exactly-once semantics and side inputs to join with reference data provides the required enrichment and exactly-once guarantees.

Practice this question →

95

MCQhard

A financial services company needs to ingest real-time trade data from multiple sources into BigQuery for immediate fraud detection. The data volume is high (1 million messages per second) and each message must be available for queries within seconds. They are considering the Storage Write API. Which stream mode should they choose to balance data availability and cost?

A.Legacy streaming inserts

B.Pending mode

C.Buffered mode

D.Committed mode

AnswerC

Buffered mode provides low-latency streaming with data available within seconds, and is cost-effective for high-volume ingestion.

Why this answer

Buffered mode (option C) is correct because it provides the best balance between data availability and cost for high-volume, real-time fraud detection. In buffered mode, data is written to BigQuery's managed storage within seconds, making it available for queries almost immediately, while the cost is lower than committed mode because buffered mode does not require an additional commit step. This mode is ideal for streaming use cases where latency is critical but cost efficiency is also a priority.

Exam trap

Cisco often tests the misconception that 'committed mode' is always the best for data availability, but the trap here is that committed mode's higher cost and explicit commit requirement make it overkill for scenarios where near-real-time availability (buffered mode) is sufficient and cost is a concern.

How to eliminate wrong answers

Option A is wrong because legacy streaming inserts are deprecated and do not support the Storage Write API; they use the older tabledata.insertAll method, which has higher latency and is not optimized for the 1 million messages per second throughput required. Option B is wrong because pending mode is used for two-phase commit scenarios where data must be explicitly committed before it becomes visible, which adds latency and is unsuitable for immediate fraud detection. Option D is wrong because committed mode provides the strongest consistency guarantees but incurs higher cost due to the need for an explicit commit operation, making it less cost-effective for high-volume streaming without the need for such guarantees.

Practice this question →

96

Multi-Selectmedium

A data engineer needs to schedule a recurring batch load of CSV files from an on-premises SFTP server into BigQuery. The files are generated daily and need to be loaded into a partitioned table by date. Which THREE steps should the engineer take? (Choose THREE.)

Select 3 answers

A.Create a Cloud Function triggered by Cloud Scheduler to load files directly from SFTP to BigQuery

B.Use Storage Transfer Service to copy files from the SFTP server to Cloud Storage every day

C.Use BigQuery Data Transfer Service with SFTP as a source

D.Set up a scheduled BigQuery load job using the Cloud Console or `bq` command to load from Cloud Storage

E.Configure the load job to write to a specific partition using `--time_partitioning_field` or `--range_partitioning`

AnswersB, D, E

Storage Transfer Service supports scheduled transfers from SFTP to Cloud Storage.

Why this answer

Option B is correct because the Storage Transfer Service is designed to move data from on-premises sources (including SFTP servers) into Cloud Storage on a scheduled basis. This is the recommended first step for ingesting files from an external SFTP server into Google Cloud, as it handles the network transfer, retries, and scheduling natively without requiring custom code.

Exam trap

Cisco often tests the misconception that BigQuery Data Transfer Service can directly ingest from SFTP, but it only supports a limited set of SaaS and cloud sources, not on-premises SFTP servers.

Practice this question →

97

MCQmedium

A media company streams real-time viewer data from Pub/Sub to BigQuery using a Dataflow pipeline. They need to handle occasional malformed messages without losing valid data. Which pattern should they implement?

A.Raise an exception in the pipeline and stop processing

B.Use retry logic in the pipeline to reprocess malformed messages indefinitely

C.Implement a dead letter sink to store malformed messages for later analysis

D.Discard malformed messages and log an error

AnswerC

Dead letter sinks store problematic records without blocking the pipeline, enabling later inspection and reprocessing.

Why this answer

Option C is correct because a dead letter sink (e.g., a separate Pub/Sub topic or a BigQuery error table) allows the Dataflow pipeline to route malformed messages out of the main processing path while continuing to process valid data. This pattern ensures no valid data is lost and provides a durable location for later analysis or reprocessing of the malformed records, which is essential for streaming pipelines where data quality issues are intermittent.

Exam trap

Cisco often tests the dead letter pattern to see if candidates understand that streaming pipelines must handle bad data gracefully without stopping or losing valid records, and the trap is that many candidates choose retry logic (Option B) because they confuse transient errors with permanent data quality issues.

How to eliminate wrong answers

Option A is wrong because raising an exception and stopping the pipeline would cause all processing to halt, leading to data loss for valid messages and violating the requirement to handle malformed messages without losing valid data. Option B is wrong because retrying malformed messages indefinitely would cause the pipeline to stall on bad records, potentially blocking the processing of subsequent valid messages and increasing latency; Dataflow's retry mechanisms are intended for transient errors, not for permanently malformed data. Option D is wrong because discarding malformed messages and logging an error results in permanent data loss, which contradicts the requirement to preserve data for later analysis and violates best practices for data integrity in streaming pipelines.

Practice this question →

98

MCQmedium

You are designing a Dataflow pipeline to process streaming data. The pipeline may encounter malformed records. You need to handle these errors without failing the entire pipeline and store the bad records for later analysis. What is the best practice?

A.Use a dead letter sink to write malformed records to a separate Pub/Sub topic or GCS location.

B.Catch the exception and log it, then continue processing.

C.Write all records to BigQuery using the Storage Write API and handle errors in the write operation.

D.Raise an exception in the DoFn to stop the pipeline for manual intervention.

AnswerA

This is the recommended pattern: isolate bad records for later reprocessing while allowing the pipeline to continue.

Why this answer

Dead letter sinks are a common pattern: route erroneous records to a separate output (e.g., Pub/Sub topic or GCS) for later investigation. Writing to BigQuery using Storage Write API with error handling is good, but for malformed records you want to isolate them. Raising exceptions would fail the pipeline.

Logging only loses the data.

Practice this question →

99

Multi-Selecthard

A large enterprise is migrating its data warehouse from Teradata to BigQuery. They need to transfer historical data (100 TB) and set up ongoing daily incremental loads. They also need to transform the data using dbt. Which THREE Google Cloud services should they use?

Select 3 answers

A.Datastream

B.BigQuery Data Transfer Service for Teradata

C.Transfer Appliance

D.dbt (data build tool)

E.Cloud Composer

AnswersB, D, E

Supports both backfill and incremental transfers from Teradata to BigQuery.

Why this answer

BigQuery Data Transfer Service for Teradata handles both historical and incremental loads, dbt runs on BigQuery for transformations, and Cloud Composer can orchestrate the dbt runs on a schedule.

Practice this question →

100

MCQeasy

A company wants to migrate 500 TB of on-premises archival data to Cloud Storage. The data is stored on a SAN and the network link is limited to 1 Gbps. The migration must complete within 10 days. What is the MOST cost-effective approach?

A.Set up a Cloud VPN and use rsync over the encrypted connection.

B.Use BigQuery Data Transfer Service to load the data directly into BigQuery.

C.Order a Transfer Appliance, copy data locally, and ship it to Google for ingestion.

D.Use Storage Transfer Service to copy data from on-premises to GCS over the existing network.

AnswerC

Transfer Appliance is designed for large offline transfers when network speed is a constraint.

Why this answer

Option C is correct because the Transfer Appliance is designed for large-scale data migrations where network bandwidth is insufficient. With 500 TB at 1 Gbps, the theoretical transfer time is over 46 days, far exceeding the 10-day window. The appliance allows you to physically ship the data, bypassing network constraints entirely, making it the most cost-effective and timely solution.

Exam trap

The trap here is that candidates underestimate the time required for network transfer at 1 Gbps and overestimate the practicality of compression or incremental sync, failing to recognize that physical shipping is the only viable option for multi-petabyte data within a tight deadline.

How to eliminate wrong answers

Option A is wrong because rsync over a 1 Gbps Cloud VPN would take approximately 46 days for 500 TB (assuming full utilization, which is unrealistic due to overhead and encryption), far exceeding the 10-day deadline. Option B is wrong because BigQuery Data Transfer Service is for loading data from SaaS applications (e.g., Google Ads, Amazon S3) or other cloud sources into BigQuery, not for ingesting on-premises archival data into Cloud Storage. Option D is wrong because Storage Transfer Service relies on the existing 1 Gbps network link, which would require over 46 days for 500 TB, violating the 10-day requirement.

Practice this question →

101

MCQhard

A data engineer is using Spark on Dataproc to process a large dataset. They notice the job is slow due to excessive shuffling. They want to optimize the job by using a more efficient data structure that reduces serialization overhead and provides better memory management. Which Spark API should they use?

A.Spark SQL

B.Spark Streaming

C.RDDs

D.DataFrames or Datasets

AnswerD

DataFrames and Datasets use the Catalyst optimizer and Tungsten execution engine, improving performance and memory efficiency.

Why this answer

Spark DataFrames/Datasets use Tungsten execution engine, which provides optimized serialization and memory management. RDDs lack these optimizations. Spark SQL is a module, not an API.

Spark Streaming is for streaming.

Practice this question →

102

MCQhard

A company is using Pub/Sub to ingest clickstream events and Dataflow to write to BigQuery. They observe that some events are malformed and cause the pipeline to fail. They need a solution that captures malformed events without blocking the pipeline and allows reprocessing later. Which Dataflow pattern should they implement?

A.Use a side input to filter malformed events before the main pipeline

B.Use the Reshuffle transform to reattempt failures

C.Write malformed events to a dead letter sink (e.g., another Pub/Sub topic or GCS bucket) and continue processing healthy events

D.Use logging alerts to notify the team and stop the pipeline on error

AnswerC

Dead letter sink is the correct pattern: isolate bad records and let the pipeline proceed.

Why this answer

Dead letter sinks (DLQ) are the standard pattern for handling bad records in Dataflow. The pipeline writes malformed records to a separate sink (e.g., Pub/Sub topic or GCS) for later analysis. Side inputs are for enriching data, not error handling.

Reshuffle doesn't apply. Output tags (side outputs) can also be used, but explicit dead letter pattern is more standard.

Practice this question →

103

MCQhard

A streaming pipeline ingests events from Pub/Sub, enriches them via a slow REST API call, and writes the result to BigQuery. The API has a limit of 10 requests per second per client. The pipeline processes 1000 messages per second. Which approach minimizes latency while respecting API limits?

A.Use a global window with a trigger that fires every second, and inside the DoFn limit concurrent API calls to 10.

B.Fan out the stream to multiple REST API instances using Pub/Sub topic splitting.

C.Use a Dataflow Flex Template to run multiple pipelines, each processing a subset of messages.

D.Assign each message a random key and use a sliding window of 10 seconds; the API call will be distributed across workers.

AnswerA

Groups messages into batches per second, then controls concurrency to stay within the 10 req/s limit.

Why this answer

Using a global window of 1 second groups 1000 messages and then throttles API calls to 10 concurrent requests (e.g., via a fixed-size thread pool in a DoFn). This respects the limit while batching work. Beam does not automatically throttle; using a global window on a single key would create a bottleneck.

Fanning out to multiple API endpoints does not help if the limit is per client. Dataflow Flex Templates are irrelevant to throttling.

Practice this question →

104

Multi-Selecthard

Your company has a Dataproc cluster that runs Spark jobs. You need to choose between RDDs, DataFrames, and Datasets for a new job that performs complex aggregations on structured data. Which TWO statements are correct regarding performance and ease of use?

Select 2 answers

A.DataFrames and Datasets are both available in PySpark.

B.DataFrames store data in a columnar format, allowing better compression.

C.RDDs are easier to use than DataFrames for complex aggregations.

D.DataFrames are optimized by Spark's Catalyst optimizer, leading to faster execution.

E.Datasets provide compile-time type safety and are always faster than DataFrames.

AnswersB, D

DataFrames use Spark's internal binary format (Tungsten) with columnar storage, enabling efficient compression and serialization.

Why this answer

DataFrames are optimized with Catalyst optimizer and Tungsten execution, providing better performance than RDDs for structured data. Datasets combine type safety with optimized execution, but for most analytics workloads, DataFrames are sufficient and simpler.

Practice this question →

105

MCQeasy

You need to stream real-time user click events from your application into BigQuery for immediate analysis. The events must be available for query within seconds. Which approach is recommended?

A.Use Pub/Sub to Dataflow to BigQuery with the Storage Write API for high-throughput streaming.

B.Use Cloud Data Fusion to ingest streaming data from Pub/Sub into BigQuery.

C.Use Cloud Functions to receive events from Pub/Sub and insert them into BigQuery using the legacy streaming API.

D.Use Pub/Sub with a BigQuery subscription to directly write events into BigQuery.

AnswerA

This is the recommended architecture: Pub/Sub for ingestion, Dataflow for stream processing, and Storage Write API for low-latency streaming writes.

Why this answer

Pub/Sub to Dataflow to BigQuery using the Storage Write API provides the highest throughput and reliability with near-real-time latency. Legacy streaming inserts are limited and have higher latency. Direct Pub/Sub to BigQuery subscription is not a native feature.

Cloud Functions is not suitable for high-throughput streaming.

Practice this question →

106

MCQeasy

A data engineer needs to query a BigQuery table that contains an array of structs. They want to expand the array into separate rows for each element. Which SQL function should they use?

A.STRUCT

B.UNNEST

C.ARRAY_AGG

D.SPLIT

AnswerB

UNNEST expands an array into rows; it is typically used with CROSS JOIN.

Why this answer

UNNEST is used to flatten arrays into a set of rows. CROSS JOIN UNNEST is standard. STRUCT is for creating structs, ARRAY_AGG is for aggregation, and SPLIT is for strings.

The question asks to expand an array, which is exactly UNNEST.

Practice this question →

107

MCQhard

A company uses BigQuery to store event data. They need to load data from multiple sources with different schemas and expect frequent schema changes. Which approach provides the most flexibility for schema evolution while minimizing load failures and performance impact?

A.Load data as JSON files in Cloud Storage and use external tables

B.Use the Storage Write API in buffered mode with schema auto-detection

C.Use legacy streaming inserts with schema auto-detect enabled

D.Use Dataflow to preprocess and write to BigQuery using Storage Write API in committed mode

AnswerB

Buffered mode allows schema updates and auto-detection, reducing failures and handling schema evolution well.

Why this answer

Using the Storage Write API with buffered mode allows schema auto-detection and flexible schema updates without failing loads, and provides better performance than legacy streaming inserts.

Practice this question →

108

Multi-Selectmedium

A data engineer is designing a batch processing pipeline that runs daily. The pipeline reads CSV files from GCS, transforms them using Python, and writes the results to BigQuery. They need to parameterize the pipeline for different environments and run it on a schedule. Which THREE components should they use? (Choose 3)

Select 2 answers

A.Cloud Composer

B.Dataproc

C.Dataflow Flex Template

D.Storage Transfer Service

E.Cloud Functions

AnswersA, C

Cloud Composer schedules and orchestrates the pipeline.

Why this answer

Cloud Composer (A) is correct because it is a managed Apache Airflow service that natively supports scheduling, parameterization, and orchestration of batch pipelines. It allows you to define DAGs that run daily, pass environment-specific parameters via Airflow variables or environment configurations, and trigger Python transforms or Dataflow jobs on a schedule.

Exam trap

Cisco often tests the distinction between orchestration/scheduling services (Cloud Composer) and compute/processing services (Dataproc, Cloud Functions), leading candidates to mistakenly choose Dataproc for scheduling or Cloud Functions for batch processing.

Practice this question →

109

MCQhard

A Dataflow pipeline reads from Pub/Sub, applies a keyed stateful ParDo that uses state variables to deduplicate events based on event ID, and writes to BigQuery. During a pipeline update, some events are duplicated in BigQuery. The state is not preserved across updates. Which configuration ensures exactly-once semantics during updates?

A.Drain the pipeline and start the updated pipeline; all in-flight data will be processed.

B.Cancel the pipeline and restart it; Pub/Sub subscriptions will be rewound.

C.Use the Storage Write API's exactly-once delivery mode.

D.Take a snapshot of the pipeline before updating, then start the new pipeline from the snapshot.

AnswerD

Snapshots preserve the state of the pipeline, including deduplication state, allowing the new pipeline to resume without reprocessing duplicates.

Why this answer

Draining the pipeline stops it and completes processing in-flight, then the updated pipeline can start fresh. However, because state is lost, duplicates may still occur if the new pipeline processes events that were already committed. To preserve state, use snapshotting: take a snapshot before update and start the new pipeline from the snapshot.

BigQuery's Storage Write API with exactly-once semantics can help at the sink but does not prevent duplicate processing if state is lost. As long as the deduplication state is recovered from the snapshot, duplicates are avoided.

Practice this question →

110

Multi-Selecthard

You are building a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline must handle late-arriving data and ensure that the windowing and triggering are correct. Which THREE configurations should you consider? (Choose 3)

Select 3 answers

A.Enable Dataflow Streaming Engine for exactly-once processing.

B.Use side inputs to enrich streaming data with static data.

C.Use the BigQuery Storage Write API with committed mode to ensure exactly-once writes.

D.Set an allowed lateness duration to handle late-arriving data.

E.Configure a triggering frequency to control how often results are emitted.

AnswersC, D, E

Why this answer

Option C is correct because the BigQuery Storage Write API with committed mode provides exactly-once write semantics, which is essential for ensuring that late-arriving data processed by the pipeline does not result in duplicate rows in BigQuery. This mode uses stream offsets to track writes, guaranteeing that each record is written exactly once even if the pipeline retries.

Exam trap

Cisco often tests the misconception that Dataflow Streaming Engine alone provides exactly-once processing, but in reality it is the combination of source/sink semantics (like the Storage Write API) that ensures exactly-once, not the engine itself.

Practice this question →

111

MCQmedium

You are designing a streaming pipeline that ingests events from Pub/Sub, enriches them with a machine learning model, and writes the results to BigQuery. The ML model is deployed on Cloud Run and has a high latency (500ms per request). You need to minimize the impact of slow ML inference on the overall pipeline throughput. Which approach should you take?

A.Use Dataflow to write events to Pub/Sub, then use a separate Dataflow pipeline that batches calls to Cloud Run.

B.Increase the number of Dataflow workers to compensate for the latency.

C.Use Cloud Functions to call Cloud Run and write directly to BigQuery.

D.Use Dataflow's ParDo with synchronous calls to Cloud Run for each element.

AnswerA

Decoupling via Pub/Sub allows batching and async processing, improving throughput.

Why this answer

Option A is correct because it uses Dataflow to batch events before sending them to Cloud Run, which amortizes the 500ms per-request latency over multiple events, significantly increasing throughput. By writing events to Pub/Sub and then processing them in a separate Dataflow pipeline with batched calls, you decouple the ingestion from the inference and avoid blocking on each individual request.

Exam trap

The trap here is that candidates assume parallelism (more workers) or faster invocation methods (Cloud Functions) can overcome high per-request latency, when the real solution is to batch requests to reduce the number of round trips.

How to eliminate wrong answers

Option B is wrong because increasing the number of Dataflow workers does not reduce the per-element latency of synchronous calls; it only adds parallelism, which can lead to excessive concurrent calls to Cloud Run and potential throttling or cost spikes. Option C is wrong because Cloud Functions are not designed for high-throughput streaming pipelines and would still make synchronous calls to Cloud Run for each event, suffering the same latency bottleneck. Option D is wrong because using ParDo with synchronous calls per element means each element waits 500ms before the next element is processed, severely limiting throughput and not leveraging batching.

Practice this question →

112

MCQmedium

A data engineer needs to create a Dataflow pipeline that reads from Pub/Sub, applies a Python transformation, and writes to BigQuery. The pipeline should be reusable across environments with different parameters. Which deployment method is most appropriate?

A.Classic Template

B.Flex Template

C.Direct pipeline submission with gcloud dataflow jobs run

D.Cloud Composer to trigger Dataflow jobs

AnswerB

Flex Templates support any SDK (including Python) and allow runtime parameters.

Why this answer

Flex Templates (Option B) are the most appropriate deployment method because they allow you to package a custom Docker image containing your Python transformation code and dependencies, making the pipeline reusable across environments with different runtime parameters. Unlike Classic Templates, Flex Templates support arbitrary pipeline code and can be parameterized at runtime via the Dataflow UI or API, which is essential for a multi-environment deployment strategy.

Exam trap

The trap here is that candidates often confuse Classic Templates with Flex Templates, assuming both support custom code, but Classic Templates are limited to Google-provided templates and cannot run arbitrary Python transformations, making Flex Templates the only correct choice for custom, reusable pipelines.

How to eliminate wrong answers

Option A is wrong because Classic Templates are pre-built, Google-provided templates that do not support custom Python transformations; they are limited to a fixed set of template parameters and cannot be easily parameterized for different environments. Option C is wrong because direct pipeline submission with gcloud dataflow jobs run does not provide a reusable, parameterized template mechanism; each submission requires the full pipeline code and configuration, making it unsuitable for repeated deployment across environments. Option D is wrong because Cloud Composer is an orchestration tool for scheduling and monitoring workflows, not a deployment method for creating reusable, parameterized Dataflow templates; it can trigger Dataflow jobs but does not solve the need for a template that can be reused with different parameters.

Practice this question →

113

MCQhard

A company uses Kafka on Dataproc to ingest streaming data. They want to process the data with Spark Structured Streaming and write results to BigQuery. The team is using Dataproc clusters. Which approach minimizes cost while maintaining performance?

A.Use a Dataproc cluster with all preemptible VMs

B.Use a single-node Dataproc cluster

C.Use a Dataproc cluster with standard master nodes and preemptible worker nodes

D.Use a Dataproc cluster with standard nodes and enable autoscaling

AnswerC

Workers can be preemptible; master should be standard for stability.

Why this answer

Preemptible VMs are cost-effective for worker nodes; master nodes should be standard for reliability.

Practice this question →

114

Multi-Selectmedium

A company needs to stream real-time user activity data from their application into BigQuery for immediate dashboarding. They want to minimize latency (under 5 seconds) and ensure exactly-once delivery. Which TWO options should they consider? (Choose 2)

Select 2 answers

A.Use Cloud Functions to receive events and call the BigQuery REST API

B.Use BigQuery Storage Write API in committed mode

C.Use BigQuery legacy streaming inserts directly from the application

D.Use Apache Kafka on Dataproc and write to BigQuery via the BigQuery Kafka connector

E.Stream data to Pub/Sub, then use Dataflow to write to BigQuery with exactly-once guarantees

AnswersB, E

Committed mode provides exactly-once semantics and low latency (sub-second).

Why this answer

Option B is correct because the BigQuery Storage Write API in committed mode provides exactly-once delivery semantics and low-latency streaming (typically under 5 seconds) directly into BigQuery. It is designed for real-time ingestion with strong consistency guarantees, making it ideal for immediate dashboarding.

Exam trap

Cisco often tests the distinction between legacy streaming inserts (at-least-once) and the Storage Write API (exactly-once), and candidates mistakenly choose legacy inserts because they are simpler to implement, ignoring the exactly-once requirement.

Practice this question →

115

MCQhard

A Dataflow streaming pipeline is experiencing high latency and frequent OOM errors when processing variable-sized JSON messages from Pub/Sub. The team suspects that the autoscaling is not effective. Which feature should they enable to improve resource utilization?

A.Horizontal autoscaling

B.Dataflow Prime

C.FlexRS

D.Streaming Engine

AnswerB

Dataflow Prime offers vertical scaling and right-fitting, which helps with variable-sized messages and OOM errors.

Why this answer

Dataflow Prime is the correct choice because it provides intelligent resource management that automatically adjusts worker resources (CPU, memory) based on the pipeline's processing demands, which is critical for variable-sized JSON messages. It addresses both high latency and OOM errors by optimizing resource utilization beyond simple autoscaling, including predictive autoscaling and flexible resource scheduling to handle spikes in message size without manual tuning.

Exam trap

Cisco often tests the misconception that Streaming Engine solves all streaming performance issues, but it specifically addresses shuffle and state persistence, not worker memory management for variable payloads.

How to eliminate wrong answers

Option A is wrong because Horizontal autoscaling is a basic feature already enabled by default in Dataflow; it only scales the number of workers horizontally and does not address memory inefficiencies or OOM errors caused by variable-sized messages. Option C is wrong because FlexRS is designed for batch pipelines with flexible scheduling to reduce costs, not for streaming pipelines requiring low latency and real-time processing. Option D is wrong because Streaming Engine offloads shuffle and state storage to backend services to reduce disk I/O and checkpoint latency, but it does not directly manage per-worker memory allocation or prevent OOM errors from variable-sized payloads.

Practice this question →

116

MCQeasy

A data engineer needs to transfer 500 TB of on-premises data to Google Cloud Storage. The data is stored on NAS devices and the network bandwidth is limited to 100 Mbps. What is the most cost-effective and timely transfer method?

A.Use Storage Transfer Service over the internet

B.Use a VPN connection and rsync

C.Use gsutil cp in parallel

D.Use Transfer Appliance

AnswerD

Transfer Appliance is designed for offline petabyte-scale transfers, avoiding bandwidth limitations.

Why this answer

At 100 Mbps, transferring 500 TB over the network would take over 500 days. Transfer Appliance is designed for petabyte-scale offline transfer, shipping a physical appliance to your data center. Other options are not feasible due to bandwidth constraints.

Practice this question →

117

MCQmedium

A data engineer is using Apache Spark on Dataproc to process a large dataset. They need to perform complex aggregation and transformation with high performance. The dataset has a known schema and they want to take advantage of Catalyst optimizer. Which Spark API should they use?

A.Spark SQL only

B.DataFrames

C.Datasets

D.RDDs

AnswerB

DataFrames have Catalyst optimizer, which improves performance for complex transformations.

Why this answer

DataFrames provide high-level API with Catalyst optimizer for performance, making them ideal for complex aggregations and transformations on structured data.

Practice this question →

118

MCQmedium

A data engineer needs to create a Dataflow pipeline template that can be reused across multiple environments (dev, staging, prod) with different parameters (e.g., input Pub/Sub topic, output BigQuery table). Which template type should they use?

A.Dataflow Prime

B.Flex Template

C.Classic Template

D.Cloud Composer workflow template

AnswerB

Flex Templates support custom Docker images and runtime parameters, making them suitable for multi-environment reuse.

Why this answer

Flex Templates (B) are the correct choice because they package a Dataflow pipeline as a Docker image, allowing environment-specific parameters (e.g., Pub/Sub topic, BigQuery table) to be passed at runtime via the --parameters flag. This enables true reusability across dev, staging, and prod without modifying the template code, unlike Classic Templates which require compile-time parameterization.

Exam trap

Cisco often tests the distinction between Classic Templates (compile-time parameterization) and Flex Templates (runtime parameterization), trapping candidates who assume all templates support the same level of parameter flexibility.

How to eliminate wrong answers

Option A is wrong because Dataflow Prime is a managed service for optimizing resource utilization and autoscaling, not a template type for parameterized reuse. Option C is wrong because Classic Templates require parameters to be baked in at staging time, making them less flexible for multi-environment reuse without rebuilding the template. Option D is wrong because Cloud Composer is an Apache Airflow orchestration service used to schedule and monitor workflows, not a Dataflow template type for parameterized pipeline reuse.

Practice this question →

119

MCQeasy

A data engineer is building a Dataflow pipeline that reads from BigQuery, transforms data using Apache Beam, and writes results to Cloud Storage in Avro format. They need to ensure the pipeline can be easily redeployed with different parameters without modifying code. Which deployment method should they use?

A.Dataflow Flex Templates

B.Direct deployment using the gcloud command with parameters

C.Dataflow Classic Templates

D.Deploy as a Cloud Function triggered by Cloud Scheduler

AnswerA

Flex Templates use Docker images and support arbitrary pipeline options, including custom parameters.

Why this answer

Dataflow Flex Templates allow you to package a pipeline as a Docker image and pass runtime parameters, enabling parameterized deployments without code changes.

Practice this question →

120

Multi-Selecthard

A company uses Pub/Sub to ingest IoT sensor data and wants to process it with a Dataflow pipeline that uses fixed windows of 1 minute to compute average temperature. The pipeline also needs to handle malformed messages by routing them to a dead letter queue. Which TWO configurations should the engineer implement? (Choose TWO.)

Select 2 answers

A.Configure the pipeline to ignore malformed messages using `withCoder()`

B.Add a `GroupByKey` transform to deduplicate messages based on a unique ID

C.Enable exactly-once processing for the Pub/Sub subscription used by the pipeline

D.Set up a Cloud Function to reprocess messages from the dead letter topic

E.Use a dead letter sink (e.g., another Pub/Sub topic) for messages that exceed the retry limit

AnswersC, E

Exactly-once processing ensures each message is processed once, preventing duplicates and gaps.

Why this answer

To handle malformed messages, a dead letter sink (e.g., Pub/Sub topic or BigQuery table) should be used to capture messages that fail processing after retries. The Dataflow pipeline should use the Apache Beam `PubsubIO` source and apply windowing. Enabling exactly-once processing for the Pub/Sub subscription ensures that messages are not lost or duplicated during failures.

Option C is correct because dead letter sinks capture failed messages. Option E is correct because exactly-once processing ensures data integrity for streaming pipelines. Option A is not recommended as it discards data.

Option B is irrelevant as deduplication is not needed with exactly-once. Option D is not necessary for this scenario.

Practice this question →

121

MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Some incoming messages are malformed and fail to parse. How should you handle these messages to ensure the pipeline continues processing without data loss?

A.Configure Pub/Sub to retry indefinitely until the message is processed

B.Use a try-catch block in the pipeline and ignore malformed messages

C.Write malformed messages to a dead-letter sink (e.g., Pub/Sub topic or GCS) and continue processing

D.Set the pipeline to fail and alert the team via Cloud Monitoring

AnswerC

Why this answer

The recommended pattern is to use a dead-letter queue (e.g., a separate Pub/Sub topic or a GCS bucket) to store failed messages after a retry threshold is reached. This preserves messages for later analysis without blocking the main pipeline.

Practice this question →

122

MCQeasy

A data engineer needs to schedule recurring nightly loads from Amazon S3 to Google Cloud Storage. The data is in CSV format and the volume is approximately 500 GB per night. Which Google Cloud service should they use?

A.Transfer Appliance

B.Storage Transfer Service

C.BigQuery Data Transfer Service

D.Datastream

AnswerB

Storage Transfer Service is designed for online transfers between storage systems including S3 to GCS.

Why this answer

The Storage Transfer Service is designed for online data transfers from external cloud providers like Amazon S3 to Google Cloud Storage. It supports scheduling recurring nightly transfers, handles large volumes (500 GB/night), and automatically retries failed transfers, making it the correct choice for this use case.

Exam trap

Cisco often tests the distinction between services that transfer data to Cloud Storage (Storage Transfer Service) versus services that load data directly into BigQuery (BigQuery Data Transfer Service), causing candidates to confuse the destination.

How to eliminate wrong answers

Option A is wrong because Transfer Appliance is a physical device for offline data transfer, used for very large datasets (hundreds of TB to PB) where network transfer is impractical, not for recurring nightly loads. Option C is wrong because BigQuery Data Transfer Service is for loading data into BigQuery tables from sources like Google Ads or Amazon S3, but it does not directly transfer files to Cloud Storage; it loads data into BigQuery, not Cloud Storage. Option D is wrong because Datastream is for real-time change data capture (CDC) from databases like MySQL or PostgreSQL to BigQuery or Cloud Storage, not for batch CSV file transfers from S3.

Practice this question →

123

MCQhard

A company is running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They notice that the number of workers is not scaling up to handle increased throughput, causing latency spikes. The pipeline uses a GlobalWindow with default triggering. What is the most likely cause of the under-scaling?

A.The pipeline includes a GroupByKey that creates a hot key, limiting parallelism

B.The Pub/Sub subscription has a large backlog, but Dataflow automatically scales to handle it

C.The pipeline uses the default worker machine type, which is too small

D.The pipeline is using legacy streaming inserts instead of the Storage Write API

AnswerA

Hot keys prevent splitting the work across workers, causing underutilization and scaling issues.

Why this answer

Dataflow's autoscaling is based on CPU utilization and throughput. If the pipeline uses a GroupByKey with hot keys, parallelism is limited and workers may not scale effectively.

Practice this question →

124

Multi-Selectmedium

A data engineer needs to perform a one-time migration of 10 TB of data from on-premises Hadoop HDFS to Cloud Storage. The network link is 1 Gbps. Which TWO services or tools should they consider? (Choose 2)

Select 2 answers

A.Dataproc with DistCp

B.Cloud Storage Transfer Service

C.BigQuery Data Transfer Service

D.gsutil rsync with parallel composite uploads

E.Transfer Appliance

AnswersB, E

Can transfer from HDFS via a Hadoop URL, suitable for this volume over a 1 Gbps link.

Why this answer

Storage Transfer Service can transfer data from HDFS (via an intermediary) but Transfer Appliance is also feasible for large volumes, especially if bandwidth is limited.

Practice this question →

125

Multi-Selecthard

A company uses Workflows to orchestrate a multi-step data pipeline. One step calls an HTTP endpoint that may take up to 10 minutes, but the default Workflows timeout is too short. They also need to handle transient errors with retries. Which TWO configurations should they apply? (Choose 2)

Select 2 answers

A.Set a step timeout of 600 seconds for the HTTP call step

B.Configure a dead letter queue for failed steps

C.Use the default retry policy on the step

D.Set the workflow execution timeout to 600 seconds

E.Add a retry policy on the step with appropriate conditions for transient errors

AnswersA, E

This extends the timeout for that specific step to 10 minutes.

Why this answer

To extend the timeout, set a step timeout of 600 seconds (10 minutes). To handle transient errors, use a retry policy with appropriate conditions. Setting the entire workflow timeout to 10 minutes is not necessary if individual step timeouts are set.

The default retry policy does not cover all transient errors. Adding a dead letter queue is for event-driven patterns, not Workflows.

Practice this question →

126

MCQmedium

A company needs to run a Spark ML training job on a Dataproc cluster with high memory per node, but the cluster should automatically scale down when idle to save costs. Which configuration should they use?

A.Use a single-node cluster with preemptible VMs

B.Enable Dataproc's default autoscaling with primary workers as preemptible

C.Create a cluster with custom machine types and no autoscaling

D.Use a Dataproc cluster with preemptible secondary workers and cluster autoscaling

AnswerD

Why this answer

Option D is correct because it combines preemptible secondary workers for cost-effective high-memory compute with cluster autoscaling, which automatically scales down the cluster when idle. Preemptible VMs are ideal for stateless Spark ML training tasks, and autoscaling ensures the cluster shrinks to save costs during inactivity. This configuration meets the requirement of high memory per node (via primary workers) while minimizing costs through idle scaling.

Exam trap

Cisco often tests the misconception that preemptible VMs can be used as primary workers or that autoscaling works with preemptible primary workers, but in Dataproc, preemptible VMs are restricted to secondary workers to maintain cluster stability.

How to eliminate wrong answers

Option A is wrong because a single-node cluster cannot provide high memory per node in a distributed sense and preemptible VMs on a single node risk job failure if the VM is reclaimed; also, there is no autoscaling. Option B is wrong because Dataproc's default autoscaling does not support primary workers as preemptible—preemptible VMs are only allowed as secondary workers, and using them as primary would cause instability. Option C is wrong because custom machine types without autoscaling do not automatically scale down when idle, leading to unnecessary costs.

Practice this question →

← PreviousPage 2 of 2 · 126 questions total

Ready to test yourself?

Try a timed practice session using only Pde Ingestion Processing questions.

Start 20-question session

CCNA Pde Ingestion Processing Questions — Page 2 of 2 | Courseiva