PDE Ingesting and Processing the Data — All Questions With Answers

Question 1easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 10 TB of CSV files from Amazon S3 into Google BigQuery on a daily basis. Which service should they use to automate this transfer?

Question 2easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to stream real-time user click events from your application into BigQuery for immediate analysis. The events must be available for query within seconds. Which approach is recommended?

Question 3easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Your company is migrating an on-premises Hadoop cluster to Google Cloud. You need to transform large datasets using Spark SQL. Which Google Cloud service should you use?

Question 4easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 500 TB of on-premises data to Google Cloud Storage. The data is stored on NAS devices and the network bandwidth is limited to 100 Mbps. What is the most cost-effective and timely transfer method?

Question 5mediummultiple choice

Study the full Python automation breakdown →

You are building a Dataflow pipeline in Python that reads messages from Pub/Sub, enriches them with data from a BigQuery table, and writes the results to BigQuery. The enrichment lookup table is large and changes infrequently. Which approach minimizes cost and latency?

Question 6mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline to process streaming data. The pipeline may encounter malformed records. You need to handle these errors without failing the entire pipeline and store the bad records for later analysis. What is the best practice?

Question 7mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your company uses Kafka for event streaming. You want to run Kafka on Google Cloud with the ability to auto-scale clusters and use managed infrastructure. Which service should you choose?

Question 8mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to perform a one-time migration of historical data from an on-premises Teradata data warehouse to BigQuery. The data volume is 50 TB and you have a high-speed network connection (10 Gbps). What is the most efficient way to load the data?

Question 9mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You have a Dataflow pipeline that processes streaming data with high throughput. You notice that the pipeline is experiencing high latency and the workers are underutilized. Which Dataflow feature can automatically optimize resource allocation?

Question 10mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your organization uses dbt (data build tool) for transformations on BigQuery. You need to run dbt models on a schedule and manage versions. Which Google Cloud service can execute dbt jobs in a serverless manner?

Question 11hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an on-premises PostgreSQL database to Cloud SQL. You need to continuously replicate changes to BigQuery for real-time analytics with minimal latency. Which service should you use?

Question 12hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline that needs to exactly-once process events from Pub/Sub and write to BigQuery using the Storage Write API. The pipeline may restart and could reprocess some messages. What setting ensures exactly-once semantics for the output?

Question 13hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to process a large volume of event data from Cloud Storage, apply complex transformations using Apache Spark, and then load the results into BigQuery. The data arrives in batches every hour. You want to minimize costs by using preemptible VMs. Which service should you use?

Question 14mediummulti select

Read the full Ingesting and Processing the Data explanation →

Which TWO statements are true about BigQuery Data Transfer Service? (Choose 2)

Question 15hardmulti select

Read the full Ingesting and Processing the Data explanation →

You are building a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline must handle late-arriving data and ensure that the windowing and triggering are correct. Which THREE configurations should you consider? (Choose 3)

Question 16easymultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization wants to ingest on-premises Oracle database changes into BigQuery for real-time analytics with minimal latency. The Oracle database is version 19c and has a high transactional volume. Which Google Cloud service should they use?

Question 17easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule recurring nightly loads from Amazon S3 to Google Cloud Storage. The data is in CSV format and the volume is approximately 500 GB per night. Which Google Cloud service should they use?

Question 18mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company runs a Dataflow pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. The pipeline uses classic templates and is deployed in batch mode. They notice that the pipeline does not scale well under high load, causing a backlog in Pub/Sub. Which improvement would BEST address the scaling issue?

Question 19mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to load data from a MySQL database into BigQuery daily. The data volume is 10 GB per day and the schema changes occasionally. They want to minimize costs and operational overhead. What is the MOST appropriate approach?

Question 20mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A media company streams real-time viewer data from Pub/Sub to BigQuery using a Dataflow pipeline. They need to handle occasional malformed messages without losing valid data. Which pattern should they implement?

Question 21mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Google Cloud Storage. The network bandwidth is limited to 100 Mbps. Which transfer method is MOST cost-effective and time-efficient?

Question 22hardmultiple choice

Study the full Python automation breakdown →

A data engineer is designing a Dataflow pipeline in Python that reads from Pub/Sub, applies complex transformations using external libraries, and writes to BigQuery. The pipeline must be deployed as a reusable, version-controlled template that can be easily updated without re-uploading the pipeline code each time. Which approach should they use?

Question 23hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A financial services company needs to ingest real-time trade data from multiple sources into BigQuery for immediate fraud detection. The data volume is high (1 million messages per second) and each message must be available for queries within seconds. They are considering the Storage Write API. Which stream mode should they choose to balance data availability and cost?

Question 24hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A team uses dbt on BigQuery to transform data in their data warehouse. They have a large table with nested and repeated fields (arrays and structs). The transformation needs to normalize this data into a star schema. Which dbt feature and BigQuery SQL feature should they use together?

Question 25easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which event-driven solution should they use?

Question 26mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is using Apache Spark on Dataproc to process a large dataset. They need to perform complex aggregation and transformation with high performance. The dataset has a known schema and they want to take advantage of Catalyst optimizer. Which Spark API should they use?

Question 27mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Workflows to orchestrate a series of Google Cloud services for data processing. They need to call an external HTTP API as part of the workflow and handle potential failures with retries. Which Workflows feature should they use?

Question 28mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineering team needs to ingest streaming data from an existing Kafka cluster (on-premises) into Google Cloud for real-time analytics. They want to minimize changes to the existing Kafka setup and avoid long-term operational overhead. Which TWO approaches should they consider?

Question 29hardmulti select

Read the full Ingesting and Processing the Data explanation →

A large enterprise is migrating its data warehouse from Teradata to BigQuery. They need to transfer historical data (100 TB) and set up ongoing daily incremental loads. They also need to transform the data using dbt. Which THREE Google Cloud services should they use?

Question 30easymulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load data from CSV files in Cloud Storage into BigQuery. The CSV files have a header row and some columns contain nested JSON strings. Which TWO methods can they use to load this data into BigQuery?

Question 31mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are building a streaming pipeline to ingest real-time clickstream data from a website into BigQuery for immediate analysis. The data must be available in BigQuery within seconds and you need to handle late-arriving data (e.g., browser offline events) that may arrive hours later. Which approach should you use?

Question 32easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to transfer 500 TB of data from an on-premises Hadoop cluster to Google Cloud Storage (GCS) for processing with Dataproc. The on-premises network has a 1 Gbps dedicated link to Google Cloud. The data must be transferred as quickly as possible, minimizing network usage. Which transfer method should they use?

Question 33hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team is processing a large dataset with Apache Beam on Dataflow. The pipeline sometimes fails due to transient errors when writing to a BigQuery sink. You need to ensure that failed records are not lost and can be reprocessed later without blocking the pipeline. What is the best approach?

Question 34mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a near-real-time CDC pipeline to replicate changes from an on-premises PostgreSQL database to BigQuery for analytics. The source database has high transaction volume and you must ensure minimal impact on the source. Which Google Cloud service should you use to ingest the change data?

Question 35easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You are loading 10 GB of daily CSV files from a GCS bucket into a BigQuery table. The files contain some malformed rows that you want to skip. Which BigQuery load configuration should you use?

Question 36hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your Dataflow pipeline reads from Pub/Sub, performs transformations, and writes to BigQuery. You notice that the pipeline's autoscaling is not keeping up with sudden spikes in traffic, causing increased lag. The pipeline uses Classic Templates. Which change would most effectively improve autoscaling responsiveness?

Question 37mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an existing Kafka cluster to Google Cloud using Dataproc. The cluster handles high-throughput streaming data with strict ordering requirements per partition. Which choice of Dataproc configuration is most appropriate?

Question 38mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team uses dbt to transform data in BigQuery. You need to schedule dbt runs to refresh materialized tables and views every hour. The transformations include both full refreshes and incremental models. What is the most efficient way to orchestrate these dbt runs on Google Cloud?

Question 39easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to ingest Google Ads performance data into BigQuery on a daily basis for reporting. Which service should you use?

Question 40mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a streaming pipeline that ingests events from Pub/Sub, enriches them with a machine learning model, and writes the results to BigQuery. The ML model is deployed on Cloud Run and has a high latency (500ms per request). You need to minimize the impact of slow ML inference on the overall pipeline throughput. Which approach should you take?

Question 41hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team has a Dataflow pipeline that reads from BigQuery, transforms data, and writes to GCS. The pipeline is failing with 'Out of Memory' errors on the worker nodes. The input data is large but fits within the total cluster memory. Which configuration change is most likely to resolve the issue without increasing costs significantly?

Question 42easymultiple choice

Review the full routing breakdown →

You need to react to changes in a GCS bucket (e.g., new object creation) and trigger a Cloud Run service to process the new file. Which Google Cloud service should you use to route the event?

Question 43mediummulti select

Read the full Ingesting and Processing the Data explanation →

You need to ingest streaming data from a custom application into BigQuery with exactly-once semantics and low latency. The data volume is up to 10 MB/s. Which TWO services should you combine?

Question 44hardmulti select

Read the full Ingesting and Processing the Data explanation →

Your company has a Dataproc cluster that runs Spark jobs. You need to choose between RDDs, DataFrames, and Datasets for a new job that performs complex aggregations on structured data. Which TWO statements are correct regarding performance and ease of use?

Question 45mediummulti select

Read the full Ingesting and Processing the Data explanation →

You are building a BigQuery table that contains nested and repeated fields (e.g., order with line items). You need to write a query that counts the number of line items per order. Which TWO SQL functions/techniques can you use?

Question 46mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 5 PB of historical data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 30 days. Which transfer method should they use?

Question 47easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to stream real-time user click events from their web application into BigQuery for immediate analysis. Which combination of services is the most scalable and cost-effective for this use case?

Question 48hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Some incoming messages are malformed and fail to parse. How should you handle these messages to ensure the pipeline continues processing without data loss?

Question 49mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 2 TB of Avro files stored in Cloud Storage into BigQuery on a daily schedule. The schema is static and the data should overwrite the existing table each day. What is the most efficient way to accomplish this?

Question 50easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which Google Cloud service is designed to replicate data from MySQL, PostgreSQL, and Oracle databases to BigQuery or Cloud Storage in near real-time?

Question 51mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to run a Spark ML training job on a Dataproc cluster with high memory per node, but the cluster should automatically scale down when idle to save costs. Which configuration should they use?

Question 52hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an on-premises Kafka cluster to Google Cloud. The cluster has 50 topics with a total throughput of 200 MB/s. You want to minimize operational overhead. Which approach is the most cost-effective?

Question 53mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is creating a Dataflow Flex Template for a batch pipeline that reads from BigQuery and writes to Cloud Storage. They need to pass a runtime parameter for the output bucket. How should they define this parameter?

Question 54easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which BigQuery feature allows you to write data with exactly-once semantics, high throughput, and the ability to buffer data before making it available for queries?

Question 55mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to transform data using dbt (data build tool) on BigQuery. They have a CI/CD pipeline and need to version-control their transformations. Which setup is recommended?

Question 56hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are using Dataproc to run a Spark job that reads data from Cloud Storage, performs aggregations, and writes results back to Cloud Storage. The job is failing with out-of-memory errors on the shuffle. Which optimization should you apply?

Question 57mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which service should they use to set up this event-driven architecture?

Question 58mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a nightly transfer of data from an Amazon S3 bucket to Cloud Storage. Which two steps are required to achieve this? (Choose TWO.)

Question 59mediummulti select

Read the full Ingesting and Processing the Data explanation →

Which three of the following are valid BigQuery data loading methods? (Choose THREE.)

Question 60hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company is designing a real-time analytics pipeline using Pub/Sub and Dataflow. They need to ensure exactly-once processing and handle late-arriving data. Which two configurations should they implement? (Choose TWO.)

Question 61mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to ingest daily Salesforce reports into BigQuery without writing custom code. The reports are exported to an Amazon S3 bucket on a schedule. Which service should they use to automate the transfer?

Question 62easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to migrate 500 TB of on-premises archival data to Cloud Storage. The data is stored on a SAN and the network link is limited to 1 Gbps. The migration must complete within 10 days. What is the MOST cost-effective approach?

Question 63hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A streaming pipeline ingests events from Pub/Sub, enriches them via a slow REST API call, and writes the result to BigQuery. The API has a limit of 10 requests per second per client. The pipeline processes 1000 messages per second. Which approach minimizes latency while respecting API limits?

Question 64mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to continuously replicate change data from a MySQL database to BigQuery with sub-minute latency. The database is running on-premises. Which Google Cloud service should they use?

Question 65mediummultiple choice

Study the full Python automation breakdown →

A data engineer is building a Dataflow pipeline in Python that reads from BigQuery, transforms data, and writes to Cloud Storage. The pipeline will be deployed in production. Which approach should they use to ensure the pipeline is reusable across environments with different configuration parameters?

Question 66hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses BigQuery's Storage Write API in committed mode to stream data. They notice that some writes are failing with 'DEADLINE_EXCEEDED' errors during peak traffic. The pipeline is a Dataflow job using the Beam SDK. What is the MOST likely cause and solution?

Question 67easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which Dataflow feature automatically scales the number of workers based on the pipeline's current workload, and also selects the optimal machine type for each worker based on the pipeline's resource requirements?

Question 68mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data pipeline processes JSON files from Cloud Storage, transforms them using Apache Beam, and writes the output to BigQuery. Some records are malformed and cause the pipeline to fail. How should the engineer handle these errors to ensure the pipeline continues processing while preserving the malformed records for analysis?

Question 69mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to orchestrate a multi-step data processing workflow that includes calling a Cloud Run service, waiting for its completion, and then running a BigQuery query. The workflow should be serverless and integrate with Cloud Events. Which Google Cloud service should they use?

Question 70easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data analyst needs to transform nested and repeated fields in BigQuery. They have a table with a column of type ARRAY<STRUCT<...>>. Which SQL function should they use to flatten the array into individual rows for analysis?

Question 71hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A Dataflow pipeline reads from Pub/Sub, applies a keyed stateful ParDo that uses state variables to deduplicate events based on event ID, and writes to BigQuery. During a pipeline update, some events are duplicated in BigQuery. The state is not preserved across updates. Which configuration ensures exactly-once semantics during updates?

Question 72mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to use dbt (data build tool) to transform data in BigQuery. They have a Cloud Storage bucket containing raw CSV files that are loaded daily into BigQuery via an external table. Which dbt feature should they use to modularize the transformation logic and handle dependencies between models?

Question 73mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is building a real-time anomaly detection pipeline using Dataflow. Events are ingested from Pub/Sub, and the pipeline must compute a sliding window average every minute over a 1-hour window. Which TWO configurations are required for this pipeline? (Choose 2)

Question 74mediummulti select

Read the full Ingesting and Processing the Data explanation →

A retail company wants to trigger a Cloud Run service whenever a new CSV file is uploaded to a specific Cloud Storage bucket. Which THREE components are needed to set up this event-driven architecture? (Choose 3)

Question 75hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company is migrating on-premises Apache Kafka workloads to Google Cloud. They want to minimize changes to existing producer and consumer applications while leveraging managed services. Which TWO services should they consider? (Choose 2)

Question 76easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to ingest on-premises Oracle CDC data into BigQuery in near real-time with minimal operational overhead. Which service should they use?

Question 77mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data team wants to load millions of small JSON files (each <1 MB) from GCS into BigQuery daily with the lowest cost and fastest performance. They need exactly-once semantics and the ability to detect new files automatically. Which approach is most suitable?

Question 78hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company is using Pub/Sub to ingest clickstream events and Dataflow to write to BigQuery. They observe that some events are malformed and cause the pipeline to fail. They need a solution that captures malformed events without blocking the pipeline and allows reprocessing later. Which Dataflow pattern should they implement?

Question 79easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 500 TB of archival data from an on-premises NAS to Cloud Storage. The on-premises network has limited bandwidth (100 Mbps). Which transfer method should they recommend?

Question 80mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company runs Apache Kafka on Dataproc for real-time event streaming. They want to archive the Kafka topics to Cloud Storage for long-term retention and later analysis in BigQuery. Which approach is the most cost-effective and operationally simple?

Question 81hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery using the Storage Write API in exactly-once mode. The pipeline must handle late-arriving data (up to 1 hour) and maintain correct aggregation results. Which trigger configuration should they use?

Question 82mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to use dbt to transform data in BigQuery. Their source data is loaded daily into staging tables. They need to run dbt transformations on a schedule and only process tables that have changed. Which dbt feature should they use?

Question 83mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to create a Dataflow pipeline template that can be reused across multiple environments (dev, staging, prod) with different parameters (e.g., input Pub/Sub topic, output BigQuery table). Which template type should they use?

Question 84hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Eventarc to trigger a Cloud Run service when new objects appear in a GCS bucket. Recently, the Cloud Run service has been failing with 429 errors (too many requests) during high-velocity uploads. They need to handle the load without losing events. What should they do?

Question 85easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to query a BigQuery table that contains an array of structs. They want to expand the array into separate rows for each element. Which SQL function should they use?

Question 86mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to migrate their on-premises Teradata data warehouse to BigQuery. They need an automated, one-time transfer of historical data (10 TB) and ongoing incremental daily syncs. Which Google Cloud service should they use?

Question 87hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is using Spark on Dataproc to process a large dataset. They notice the job is slow due to excessive shuffling. They want to optimize the job by using a more efficient data structure that reduces serialization overhead and provides better memory management. Which Spark API should they use?

Question 88mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company needs to stream real-time user activity data from their application into BigQuery for immediate dashboarding. They want to minimize latency (under 5 seconds) and ensure exactly-once delivery. Which TWO options should they consider? (Choose 2)

Question 89mediummulti select

Study the full Python automation breakdown →

A data engineer is designing a batch processing pipeline that runs daily. The pipeline reads CSV files from GCS, transforms them using Python, and writes the results to BigQuery. They need to parameterize the pipeline for different environments and run it on a schedule. Which THREE components should they use? (Choose 3)

Question 90hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company uses Workflows to orchestrate a multi-step data pipeline. One step calls an HTTP endpoint that may take up to 10 minutes, but the default Workflows timeout is too short. They also need to handle transient errors with retries. Which TWO configurations should they apply? (Choose 2)

Question 91mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to ingest data from an on-premises Oracle database into BigQuery in near real-time with minimal latency. The database has a high volume of inserts and updates. Which service should they use?

Question 92easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load a 10 GB CSV file from GCS into BigQuery. The file contains some malformed rows that should be skipped. Which approach is most efficient?

Question 93mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A team wants to transfer data from an on-premises Hadoop cluster to Cloud Storage for processing. The cluster is located in a remote area with limited bandwidth. They need to transfer 500 TB of data. Which service should they use?

Question 94hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data pipeline uses Pub/Sub to ingest events, a Dataflow streaming pipeline to process them, and writes results to BigQuery. The pipeline must handle occasional duplicate events without causing duplicate rows in BigQuery. What is the best approach?

Question 95mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Google Ads and wants to automatically load their advertising data into BigQuery daily. They also need to transform the data with SQL and schedule a recurring query. Which combination of services meets these requirements with minimal operational overhead?

Question 96mediummultiple choice

Study the full Python automation breakdown →

A data engineer needs to create a Dataflow pipeline that reads from Pub/Sub, applies a Python transformation, and writes to BigQuery. The pipeline should be reusable across environments with different parameters. Which deployment method is most appropriate?

Question 97hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Kafka on Dataproc to ingest streaming data. They want to process the data with Spark Structured Streaming and write results to BigQuery. The team is using Dataproc clusters. Which approach minimizes cost while maintaining performance?

Question 98easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which BigQuery feature allows you to query data directly from Cloud Storage without loading it into BigQuery storage?

Question 99mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to build an event-driven application that processes images uploaded to a Cloud Storage bucket. The processing takes up to 10 minutes per image and should be automatically triggered. Which compute option should they use?

Question 100hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A Dataflow streaming pipeline is experiencing high latency and frequent OOM errors when processing variable-sized JSON messages from Pub/Sub. The team suspects that the autoscaling is not effective. Which feature should they enable to improve resource utilization?

Question 101easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A team needs to orchestrate a multi-step workflow that involves calling external APIs, running BigQuery queries, and conditionally executing Cloud Functions. Which Google Cloud service is best suited for this?

Question 102mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses dbt on BigQuery to transform data. They want to run dbt models on a schedule and manage environments (dev, prod). Which GCP service should they use to run dbt jobs?

Question 103mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company needs to stream data from a MySQL database to BigQuery with a latency under 10 seconds. They also need to handle schema changes automatically. Which TWO services should they combine?

Question 104hardmulti select

Read the full Ingesting and Processing the Data explanation →

A data team needs to transfer 200 TB of data from Amazon S3 to GCS. The transfer must be incremental, and they need to monitor the transfer progress. Which THREE components should they use?

Question 105mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company wants to use Eventarc to trigger a Cloud Run service when new objects are created in a GCS bucket. They also need to filter events for a specific bucket and object prefix. Which THREE resources must exist or be created?

Question 106easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 10 GB of CSV files from Amazon S3 into BigQuery on a daily basis. The files arrive in a specific S3 bucket at 3 AM UTC each day. Which service should be used to automate this transfer?

Question 107mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A financial services company receives real-time stock trade data via Pub/Sub. They need to enrich each trade with reference data from a Cloud SQL table and write the results to BigQuery for real-time analytics. The enrichment must handle late-arriving data and ensure exactly-once processing. Which Dataflow streaming pipeline configuration should be used?

Question 108hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to continuously synchronize customer data changes from an on-premises Oracle database to BigQuery for near-real-time analytics. The Oracle database has Change Data Capture (CDC) enabled. Which Google Cloud service should be used to stream these changes with minimal latency and schema evolution support?

Question 109mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to move 500 TB of archival data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 30 days. Which method is most cost-effective and reliable?

Question 110mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Pub/Sub to ingest clickstream data. Each message contains a JSON payload with a nested array of user actions. The data must be written to BigQuery, with each action in the array becoming a separate row. Which BigQuery feature or approach should be used to achieve this transformation?

Question 111easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to orchestrate a series of tasks that include calling external APIs, running BigQuery queries, and sending notifications. The workflow involves conditional branching and parallel steps. Which Google Cloud service should be used?

Question 112hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company is running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They notice that the number of workers is not scaling up to handle increased throughput, causing latency spikes. The pipeline uses a GlobalWindow with default triggering. What is the most likely cause of the under-scaling?

Question 113mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to move data from an on-premises MySQL database to BigQuery for analytics. They need to capture all changes (inserts, updates, deletes) in near real-time and also perform an initial historical load. Which approach meets these requirements with minimal operational overhead?

Question 114easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is building a Dataflow pipeline that reads from BigQuery, transforms data using Apache Beam, and writes results to Cloud Storage in Avro format. They need to ensure the pipeline can be easily redeployed with different parameters without modifying code. Which deployment method should they use?

Question 115hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses BigQuery to store event data. They need to load data from multiple sources with different schemas and expect frequent schema changes. Which approach provides the most flexibility for schema evolution while minimizing load failures and performance impact?

Question 116mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is building a data pipeline that ingests streaming data from Pub/Sub, transforms it with Dataflow, and loads it into BigQuery. They want to handle malformed messages that cannot be parsed. Which TWO actions should they implement for error handling? (Choose 2)

Question 117mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to perform a one-time migration of 10 TB of data from on-premises Hadoop HDFS to Cloud Storage. The network link is 1 Gbps. Which TWO services or tools should they consider? (Choose 2)

Question 118hardmulti select

Study the full Python automation breakdown →

A company uses Dataflow to process data with Apache Beam in Python. The pipeline reads from Pub/Sub, applies a ParDo that calls an external API for enrichment, and writes to BigQuery. The external API has rate limits and occasionally fails. To improve reliability, which THREE strategies should be implemented? (Choose 3)

Question 119easymulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a recurring transfer of data from a partner's Amazon S3 bucket to a Cloud Storage bucket for further processing. Which THREE components or configurations are necessary? (Choose 3)

Question 120mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is migrating a Spark batch job from on-premises to Dataproc. The job uses RDDs for custom transformations and writes output to BigQuery. They want to optimize the job for performance and cost on Dataproc. Which THREE practices should they adopt? (Choose 3)

Question 121mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to migrate 200 TB of on-premises Oracle data to BigQuery. The network bandwidth is limited to 100 Mbps, and the data must be loaded within 2 weeks. Which Google Cloud service is most appropriate for the initial data transfer?

Question 122easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to stream real-time clickstream data from a website into BigQuery for near-real-time analytics. They expect peaks of 10,000 events per second. Which combination of services is most suitable for ingestion?

Question 123hardmulti select

Review the full routing breakdown →

A company uses Pub/Sub to ingest IoT sensor data and wants to process it with a Dataflow pipeline that uses fixed windows of 1 minute to compute average temperature. The pipeline also needs to handle malformed messages by routing them to a dead letter queue. Which TWO configurations should the engineer implement? (Choose TWO.)

Question 124mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a recurring batch load of CSV files from an on-premises SFTP server into BigQuery. The files are generated daily and need to be loaded into a partitioned table by date. Which THREE steps should the engineer take? (Choose THREE.)

Question 125easymulti select

Read the full Ingesting and Processing the Data explanation →

A company uses Cloud Datastream to replicate data from a MySQL database to BigQuery in near real-time. Which TWO BigQuery features are automatically used by Datastream for optimal performance and consistency? (Choose TWO.)

Question 126mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to build a Dataflow pipeline that reads JSON messages from Pub/Sub, transforms them (including filtering, grouping, and enrichment), and writes the results to BigQuery. The pipeline must handle schema evolution in the input messages and minimize data loss. Which THREE settings or features should the engineer use? (Choose THREE.)

Question 1easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 10 TB of CSV files from Amazon S3 into Google BigQuery on a daily basis. Which service should they use to automate this transfer?

Question 2easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to stream real-time user click events from your application into BigQuery for immediate analysis. The events must be available for query within seconds. Which approach is recommended?

Question 3easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Your company is migrating an on-premises Hadoop cluster to Google Cloud. You need to transform large datasets using Spark SQL. Which Google Cloud service should you use?

Question 4easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 500 TB of on-premises data to Google Cloud Storage. The data is stored on NAS devices and the network bandwidth is limited to 100 Mbps. What is the most cost-effective and timely transfer method?

Question 5mediummultiple choice

Study the full Python automation breakdown →

You are building a Dataflow pipeline in Python that reads messages from Pub/Sub, enriches them with data from a BigQuery table, and writes the results to BigQuery. The enrichment lookup table is large and changes infrequently. Which approach minimizes cost and latency?

Question 6mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline to process streaming data. The pipeline may encounter malformed records. You need to handle these errors without failing the entire pipeline and store the bad records for later analysis. What is the best practice?

Question 7mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your company uses Kafka for event streaming. You want to run Kafka on Google Cloud with the ability to auto-scale clusters and use managed infrastructure. Which service should you choose?

Question 8mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to perform a one-time migration of historical data from an on-premises Teradata data warehouse to BigQuery. The data volume is 50 TB and you have a high-speed network connection (10 Gbps). What is the most efficient way to load the data?

Question 9mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You have a Dataflow pipeline that processes streaming data with high throughput. You notice that the pipeline is experiencing high latency and the workers are underutilized. Which Dataflow feature can automatically optimize resource allocation?

Question 10mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your organization uses dbt (data build tool) for transformations on BigQuery. You need to run dbt models on a schedule and manage versions. Which Google Cloud service can execute dbt jobs in a serverless manner?

Question 11hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an on-premises PostgreSQL database to Cloud SQL. You need to continuously replicate changes to BigQuery for real-time analytics with minimal latency. Which service should you use?

Question 12hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline that needs to exactly-once process events from Pub/Sub and write to BigQuery using the Storage Write API. The pipeline may restart and could reprocess some messages. What setting ensures exactly-once semantics for the output?

Question 13hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to process a large volume of event data from Cloud Storage, apply complex transformations using Apache Spark, and then load the results into BigQuery. The data arrives in batches every hour. You want to minimize costs by using preemptible VMs. Which service should you use?

Question 14mediummulti select

Read the full Ingesting and Processing the Data explanation →

Which TWO statements are true about BigQuery Data Transfer Service? (Choose 2)

Question 15hardmulti select

Read the full Ingesting and Processing the Data explanation →

You are building a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline must handle late-arriving data and ensure that the windowing and triggering are correct. Which THREE configurations should you consider? (Choose 3)

Question 16easymultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization wants to ingest on-premises Oracle database changes into BigQuery for real-time analytics with minimal latency. The Oracle database is version 19c and has a high transactional volume. Which Google Cloud service should they use?

Question 17easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule recurring nightly loads from Amazon S3 to Google Cloud Storage. The data is in CSV format and the volume is approximately 500 GB per night. Which Google Cloud service should they use?

Question 18mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company runs a Dataflow pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. The pipeline uses classic templates and is deployed in batch mode. They notice that the pipeline does not scale well under high load, causing a backlog in Pub/Sub. Which improvement would BEST address the scaling issue?

Question 19mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to load data from a MySQL database into BigQuery daily. The data volume is 10 GB per day and the schema changes occasionally. They want to minimize costs and operational overhead. What is the MOST appropriate approach?

Question 20mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A media company streams real-time viewer data from Pub/Sub to BigQuery using a Dataflow pipeline. They need to handle occasional malformed messages without losing valid data. Which pattern should they implement?

Question 21mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Google Cloud Storage. The network bandwidth is limited to 100 Mbps. Which transfer method is MOST cost-effective and time-efficient?

Question 22hardmultiple choice

Study the full Python automation breakdown →

A data engineer is designing a Dataflow pipeline in Python that reads from Pub/Sub, applies complex transformations using external libraries, and writes to BigQuery. The pipeline must be deployed as a reusable, version-controlled template that can be easily updated without re-uploading the pipeline code each time. Which approach should they use?

Question 23hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A financial services company needs to ingest real-time trade data from multiple sources into BigQuery for immediate fraud detection. The data volume is high (1 million messages per second) and each message must be available for queries within seconds. They are considering the Storage Write API. Which stream mode should they choose to balance data availability and cost?

Question 24hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A team uses dbt on BigQuery to transform data in their data warehouse. They have a large table with nested and repeated fields (arrays and structs). The transformation needs to normalize this data into a star schema. Which dbt feature and BigQuery SQL feature should they use together?

Question 25easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which event-driven solution should they use?

Question 26mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is using Apache Spark on Dataproc to process a large dataset. They need to perform complex aggregation and transformation with high performance. The dataset has a known schema and they want to take advantage of Catalyst optimizer. Which Spark API should they use?

Question 27mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Workflows to orchestrate a series of Google Cloud services for data processing. They need to call an external HTTP API as part of the workflow and handle potential failures with retries. Which Workflows feature should they use?

Question 28mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineering team needs to ingest streaming data from an existing Kafka cluster (on-premises) into Google Cloud for real-time analytics. They want to minimize changes to the existing Kafka setup and avoid long-term operational overhead. Which TWO approaches should they consider?

Question 29hardmulti select

Read the full Ingesting and Processing the Data explanation →

A large enterprise is migrating its data warehouse from Teradata to BigQuery. They need to transfer historical data (100 TB) and set up ongoing daily incremental loads. They also need to transform the data using dbt. Which THREE Google Cloud services should they use?

Question 30easymulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load data from CSV files in Cloud Storage into BigQuery. The CSV files have a header row and some columns contain nested JSON strings. Which TWO methods can they use to load this data into BigQuery?

Question 31mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are building a streaming pipeline to ingest real-time clickstream data from a website into BigQuery for immediate analysis. The data must be available in BigQuery within seconds and you need to handle late-arriving data (e.g., browser offline events) that may arrive hours later. Which approach should you use?

Question 32easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to transfer 500 TB of data from an on-premises Hadoop cluster to Google Cloud Storage (GCS) for processing with Dataproc. The on-premises network has a 1 Gbps dedicated link to Google Cloud. The data must be transferred as quickly as possible, minimizing network usage. Which transfer method should they use?

Question 33hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team is processing a large dataset with Apache Beam on Dataflow. The pipeline sometimes fails due to transient errors when writing to a BigQuery sink. You need to ensure that failed records are not lost and can be reprocessed later without blocking the pipeline. What is the best approach?

Question 34mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a near-real-time CDC pipeline to replicate changes from an on-premises PostgreSQL database to BigQuery for analytics. The source database has high transaction volume and you must ensure minimal impact on the source. Which Google Cloud service should you use to ingest the change data?

Question 35easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You are loading 10 GB of daily CSV files from a GCS bucket into a BigQuery table. The files contain some malformed rows that you want to skip. Which BigQuery load configuration should you use?

Question 36hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your Dataflow pipeline reads from Pub/Sub, performs transformations, and writes to BigQuery. You notice that the pipeline's autoscaling is not keeping up with sudden spikes in traffic, causing increased lag. The pipeline uses Classic Templates. Which change would most effectively improve autoscaling responsiveness?

Question 37mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an existing Kafka cluster to Google Cloud using Dataproc. The cluster handles high-throughput streaming data with strict ordering requirements per partition. Which choice of Dataproc configuration is most appropriate?

Question 38mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team uses dbt to transform data in BigQuery. You need to schedule dbt runs to refresh materialized tables and views every hour. The transformations include both full refreshes and incremental models. What is the most efficient way to orchestrate these dbt runs on Google Cloud?

Question 39easymultiple choice

Read the full Ingesting and Processing the Data explanation →

You need to ingest Google Ads performance data into BigQuery on a daily basis for reporting. Which service should you use?

Question 40mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a streaming pipeline that ingests events from Pub/Sub, enriches them with a machine learning model, and writes the results to BigQuery. The ML model is deployed on Cloud Run and has a high latency (500ms per request). You need to minimize the impact of slow ML inference on the overall pipeline throughput. Which approach should you take?

Question 41hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

Your team has a Dataflow pipeline that reads from BigQuery, transforms data, and writes to GCS. The pipeline is failing with 'Out of Memory' errors on the worker nodes. The input data is large but fits within the total cluster memory. Which configuration change is most likely to resolve the issue without increasing costs significantly?

Question 42easymultiple choice

Review the full routing breakdown →

You need to react to changes in a GCS bucket (e.g., new object creation) and trigger a Cloud Run service to process the new file. Which Google Cloud service should you use to route the event?

Question 43mediummulti select

Read the full Ingesting and Processing the Data explanation →

You need to ingest streaming data from a custom application into BigQuery with exactly-once semantics and low latency. The data volume is up to 10 MB/s. Which TWO services should you combine?

Question 44hardmulti select

Read the full Ingesting and Processing the Data explanation →

Your company has a Dataproc cluster that runs Spark jobs. You need to choose between RDDs, DataFrames, and Datasets for a new job that performs complex aggregations on structured data. Which TWO statements are correct regarding performance and ease of use?

Question 45mediummulti select

Read the full Ingesting and Processing the Data explanation →

You are building a BigQuery table that contains nested and repeated fields (e.g., order with line items). You need to write a query that counts the number of line items per order. Which TWO SQL functions/techniques can you use?

Question 46mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 5 PB of historical data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 30 days. Which transfer method should they use?

Question 47easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to stream real-time user click events from their web application into BigQuery for immediate analysis. Which combination of services is the most scalable and cost-effective for this use case?

Question 48hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Some incoming messages are malformed and fail to parse. How should you handle these messages to ensure the pipeline continues processing without data loss?

Question 49mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 2 TB of Avro files stored in Cloud Storage into BigQuery on a daily schedule. The schema is static and the data should overwrite the existing table each day. What is the most efficient way to accomplish this?

Question 50easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which Google Cloud service is designed to replicate data from MySQL, PostgreSQL, and Oracle databases to BigQuery or Cloud Storage in near real-time?

Question 51mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to run a Spark ML training job on a Dataproc cluster with high memory per node, but the cluster should automatically scale down when idle to save costs. Which configuration should they use?

Question 52hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are migrating an on-premises Kafka cluster to Google Cloud. The cluster has 50 topics with a total throughput of 200 MB/s. You want to minimize operational overhead. Which approach is the most cost-effective?

Question 53mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is creating a Dataflow Flex Template for a batch pipeline that reads from BigQuery and writes to Cloud Storage. They need to pass a runtime parameter for the output bucket. How should they define this parameter?

Question 54easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which BigQuery feature allows you to write data with exactly-once semantics, high throughput, and the ability to buffer data before making it available for queries?

Question 55mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to transform data using dbt (data build tool) on BigQuery. They have a CI/CD pipeline and need to version-control their transformations. Which setup is recommended?

Question 56hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

You are using Dataproc to run a Spark job that reads data from Cloud Storage, performs aggregations, and writes results back to Cloud Storage. The job is failing with out-of-memory errors on the shuffle. Which optimization should you apply?

Question 57mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which service should they use to set up this event-driven architecture?

Question 58mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a nightly transfer of data from an Amazon S3 bucket to Cloud Storage. Which two steps are required to achieve this? (Choose TWO.)

Question 59mediummulti select

Read the full Ingesting and Processing the Data explanation →

Which three of the following are valid BigQuery data loading methods? (Choose THREE.)

Question 60hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company is designing a real-time analytics pipeline using Pub/Sub and Dataflow. They need to ensure exactly-once processing and handle late-arriving data. Which two configurations should they implement? (Choose TWO.)

Question 61mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to ingest daily Salesforce reports into BigQuery without writing custom code. The reports are exported to an Amazon S3 bucket on a schedule. Which service should they use to automate the transfer?

Question 62easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to migrate 500 TB of on-premises archival data to Cloud Storage. The data is stored on a SAN and the network link is limited to 1 Gbps. The migration must complete within 10 days. What is the MOST cost-effective approach?

Question 63hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A streaming pipeline ingests events from Pub/Sub, enriches them via a slow REST API call, and writes the result to BigQuery. The API has a limit of 10 requests per second per client. The pipeline processes 1000 messages per second. Which approach minimizes latency while respecting API limits?

Question 64mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

An organization needs to continuously replicate change data from a MySQL database to BigQuery with sub-minute latency. The database is running on-premises. Which Google Cloud service should they use?

Question 65mediummultiple choice

Study the full Python automation breakdown →

A data engineer is building a Dataflow pipeline in Python that reads from BigQuery, transforms data, and writes to Cloud Storage. The pipeline will be deployed in production. Which approach should they use to ensure the pipeline is reusable across environments with different configuration parameters?

Question 66hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses BigQuery's Storage Write API in committed mode to stream data. They notice that some writes are failing with 'DEADLINE_EXCEEDED' errors during peak traffic. The pipeline is a Dataflow job using the Beam SDK. What is the MOST likely cause and solution?

Question 67easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which Dataflow feature automatically scales the number of workers based on the pipeline's current workload, and also selects the optimal machine type for each worker based on the pipeline's resource requirements?

Question 68mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data pipeline processes JSON files from Cloud Storage, transforms them using Apache Beam, and writes the output to BigQuery. Some records are malformed and cause the pipeline to fail. How should the engineer handle these errors to ensure the pipeline continues processing while preserving the malformed records for analysis?

Question 69mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to orchestrate a multi-step data processing workflow that includes calling a Cloud Run service, waiting for its completion, and then running a BigQuery query. The workflow should be serverless and integrate with Cloud Events. Which Google Cloud service should they use?

Question 70easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data analyst needs to transform nested and repeated fields in BigQuery. They have a table with a column of type ARRAY<STRUCT<...>>. Which SQL function should they use to flatten the array into individual rows for analysis?

Question 71hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A Dataflow pipeline reads from Pub/Sub, applies a keyed stateful ParDo that uses state variables to deduplicate events based on event ID, and writes to BigQuery. During a pipeline update, some events are duplicated in BigQuery. The state is not preserved across updates. Which configuration ensures exactly-once semantics during updates?

Question 72mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to use dbt (data build tool) to transform data in BigQuery. They have a Cloud Storage bucket containing raw CSV files that are loaded daily into BigQuery via an external table. Which dbt feature should they use to modularize the transformation logic and handle dependencies between models?

Question 73mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is building a real-time anomaly detection pipeline using Dataflow. Events are ingested from Pub/Sub, and the pipeline must compute a sliding window average every minute over a 1-hour window. Which TWO configurations are required for this pipeline? (Choose 2)

Question 74mediummulti select

Read the full Ingesting and Processing the Data explanation →

A retail company wants to trigger a Cloud Run service whenever a new CSV file is uploaded to a specific Cloud Storage bucket. Which THREE components are needed to set up this event-driven architecture? (Choose 3)

Question 75hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company is migrating on-premises Apache Kafka workloads to Google Cloud. They want to minimize changes to existing producer and consumer applications while leveraging managed services. Which TWO services should they consider? (Choose 2)

Question 76easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to ingest on-premises Oracle CDC data into BigQuery in near real-time with minimal operational overhead. Which service should they use?

Question 77mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data team wants to load millions of small JSON files (each <1 MB) from GCS into BigQuery daily with the lowest cost and fastest performance. They need exactly-once semantics and the ability to detect new files automatically. Which approach is most suitable?

Question 78hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company is using Pub/Sub to ingest clickstream events and Dataflow to write to BigQuery. They observe that some events are malformed and cause the pipeline to fail. They need a solution that captures malformed events without blocking the pipeline and allows reprocessing later. Which Dataflow pattern should they implement?

Question 79easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to transfer 500 TB of archival data from an on-premises NAS to Cloud Storage. The on-premises network has limited bandwidth (100 Mbps). Which transfer method should they recommend?

Question 80mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company runs Apache Kafka on Dataproc for real-time event streaming. They want to archive the Kafka topics to Cloud Storage for long-term retention and later analysis in BigQuery. Which approach is the most cost-effective and operationally simple?

Question 81hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery using the Storage Write API in exactly-once mode. The pipeline must handle late-arriving data (up to 1 hour) and maintain correct aggregation results. Which trigger configuration should they use?

Question 82mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to use dbt to transform data in BigQuery. Their source data is loaded daily into staging tables. They need to run dbt transformations on a schedule and only process tables that have changed. Which dbt feature should they use?

Question 83mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to create a Dataflow pipeline template that can be reused across multiple environments (dev, staging, prod) with different parameters (e.g., input Pub/Sub topic, output BigQuery table). Which template type should they use?

Question 84hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Eventarc to trigger a Cloud Run service when new objects appear in a GCS bucket. Recently, the Cloud Run service has been failing with 429 errors (too many requests) during high-velocity uploads. They need to handle the load without losing events. What should they do?

Question 85easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to query a BigQuery table that contains an array of structs. They want to expand the array into separate rows for each element. Which SQL function should they use?

Question 86mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to migrate their on-premises Teradata data warehouse to BigQuery. They need an automated, one-time transfer of historical data (10 TB) and ongoing incremental daily syncs. Which Google Cloud service should they use?

Question 87hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is using Spark on Dataproc to process a large dataset. They notice the job is slow due to excessive shuffling. They want to optimize the job by using a more efficient data structure that reduces serialization overhead and provides better memory management. Which Spark API should they use?

Question 88mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company needs to stream real-time user activity data from their application into BigQuery for immediate dashboarding. They want to minimize latency (under 5 seconds) and ensure exactly-once delivery. Which TWO options should they consider? (Choose 2)

Question 89mediummulti select

Study the full Python automation breakdown →

A data engineer is designing a batch processing pipeline that runs daily. The pipeline reads CSV files from GCS, transforms them using Python, and writes the results to BigQuery. They need to parameterize the pipeline for different environments and run it on a schedule. Which THREE components should they use? (Choose 3)

Question 90hardmulti select

Read the full Ingesting and Processing the Data explanation →

A company uses Workflows to orchestrate a multi-step data pipeline. One step calls an HTTP endpoint that may take up to 10 minutes, but the default Workflows timeout is too short. They also need to handle transient errors with retries. Which TWO configurations should they apply? (Choose 2)

Question 91mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to ingest data from an on-premises Oracle database into BigQuery in near real-time with minimal latency. The database has a high volume of inserts and updates. Which service should they use?

Question 92easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load a 10 GB CSV file from GCS into BigQuery. The file contains some malformed rows that should be skipped. Which approach is most efficient?

Question 93mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A team wants to transfer data from an on-premises Hadoop cluster to Cloud Storage for processing. The cluster is located in a remote area with limited bandwidth. They need to transfer 500 TB of data. Which service should they use?

Question 94hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A data pipeline uses Pub/Sub to ingest events, a Dataflow streaming pipeline to process them, and writes results to BigQuery. The pipeline must handle occasional duplicate events without causing duplicate rows in BigQuery. What is the best approach?

Question 95mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Google Ads and wants to automatically load their advertising data into BigQuery daily. They also need to transform the data with SQL and schedule a recurring query. Which combination of services meets these requirements with minimal operational overhead?

Question 96mediummultiple choice

Study the full Python automation breakdown →

A data engineer needs to create a Dataflow pipeline that reads from Pub/Sub, applies a Python transformation, and writes to BigQuery. The pipeline should be reusable across environments with different parameters. Which deployment method is most appropriate?

Question 97hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Kafka on Dataproc to ingest streaming data. They want to process the data with Spark Structured Streaming and write results to BigQuery. The team is using Dataproc clusters. Which approach minimizes cost while maintaining performance?

Question 98easymultiple choice

Read the full Ingesting and Processing the Data explanation →

Which BigQuery feature allows you to query data directly from Cloud Storage without loading it into BigQuery storage?

Question 99mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to build an event-driven application that processes images uploaded to a Cloud Storage bucket. The processing takes up to 10 minutes per image and should be automatically triggered. Which compute option should they use?

Question 100hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A Dataflow streaming pipeline is experiencing high latency and frequent OOM errors when processing variable-sized JSON messages from Pub/Sub. The team suspects that the autoscaling is not effective. Which feature should they enable to improve resource utilization?

Question 101easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A team needs to orchestrate a multi-step workflow that involves calling external APIs, running BigQuery queries, and conditionally executing Cloud Functions. Which Google Cloud service is best suited for this?

Question 102mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses dbt on BigQuery to transform data. They want to run dbt models on a schedule and manage environments (dev, prod). Which GCP service should they use to run dbt jobs?

Question 103mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company needs to stream data from a MySQL database to BigQuery with a latency under 10 seconds. They also need to handle schema changes automatically. Which TWO services should they combine?

Question 104hardmulti select

Read the full Ingesting and Processing the Data explanation →

A data team needs to transfer 200 TB of data from Amazon S3 to GCS. The transfer must be incremental, and they need to monitor the transfer progress. Which THREE components should they use?

Question 105mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company wants to use Eventarc to trigger a Cloud Run service when new objects are created in a GCS bucket. They also need to filter events for a specific bucket and object prefix. Which THREE resources must exist or be created?

Question 106easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to load 10 GB of CSV files from Amazon S3 into BigQuery on a daily basis. The files arrive in a specific S3 bucket at 3 AM UTC each day. Which service should be used to automate this transfer?

Question 107mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A financial services company receives real-time stock trade data via Pub/Sub. They need to enrich each trade with reference data from a Cloud SQL table and write the results to BigQuery for real-time analytics. The enrichment must handle late-arriving data and ensure exactly-once processing. Which Dataflow streaming pipeline configuration should be used?

Question 108hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company needs to continuously synchronize customer data changes from an on-premises Oracle database to BigQuery for near-real-time analytics. The Oracle database has Change Data Capture (CDC) enabled. Which Google Cloud service should be used to stream these changes with minimal latency and schema evolution support?

Question 109mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to move 500 TB of archival data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 30 days. Which method is most cost-effective and reliable?

Question 110mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses Pub/Sub to ingest clickstream data. Each message contains a JSON payload with a nested array of user actions. The data must be written to BigQuery, with each action in the array becoming a separate row. Which BigQuery feature or approach should be used to achieve this transformation?

Question 111easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to orchestrate a series of tasks that include calling external APIs, running BigQuery queries, and sending notifications. The workflow involves conditional branching and parallel steps. Which Google Cloud service should be used?

Question 112hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company is running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They notice that the number of workers is not scaling up to handle increased throughput, causing latency spikes. The pipeline uses a GlobalWindow with default triggering. What is the most likely cause of the under-scaling?

Question 113mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to move data from an on-premises MySQL database to BigQuery for analytics. They need to capture all changes (inserts, updates, deletes) in near real-time and also perform an initial historical load. Which approach meets these requirements with minimal operational overhead?

Question 114easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer is building a Dataflow pipeline that reads from BigQuery, transforms data using Apache Beam, and writes results to Cloud Storage in Avro format. They need to ensure the pipeline can be easily redeployed with different parameters without modifying code. Which deployment method should they use?

Question 115hardmultiple choice

Read the full Ingesting and Processing the Data explanation →

A company uses BigQuery to store event data. They need to load data from multiple sources with different schemas and expect frequent schema changes. Which approach provides the most flexibility for schema evolution while minimizing load failures and performance impact?

Question 116mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is building a data pipeline that ingests streaming data from Pub/Sub, transforms it with Dataflow, and loads it into BigQuery. They want to handle malformed messages that cannot be parsed. Which TWO actions should they implement for error handling? (Choose 2)

Question 117mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to perform a one-time migration of 10 TB of data from on-premises Hadoop HDFS to Cloud Storage. The network link is 1 Gbps. Which TWO services or tools should they consider? (Choose 2)

Question 118hardmulti select

Study the full Python automation breakdown →

A company uses Dataflow to process data with Apache Beam in Python. The pipeline reads from Pub/Sub, applies a ParDo that calls an external API for enrichment, and writes to BigQuery. The external API has rate limits and occasionally fails. To improve reliability, which THREE strategies should be implemented? (Choose 3)

Question 119easymulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a recurring transfer of data from a partner's Amazon S3 bucket to a Cloud Storage bucket for further processing. Which THREE components or configurations are necessary? (Choose 3)

Question 120mediummulti select

Read the full Ingesting and Processing the Data explanation →

A company is migrating a Spark batch job from on-premises to Dataproc. The job uses RDDs for custom transformations and writes output to BigQuery. They want to optimize the job for performance and cost on Dataproc. Which THREE practices should they adopt? (Choose 3)

Question 121mediummultiple choice

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to migrate 200 TB of on-premises Oracle data to BigQuery. The network bandwidth is limited to 100 Mbps, and the data must be loaded within 2 weeks. Which Google Cloud service is most appropriate for the initial data transfer?

Question 122easymultiple choice

Read the full Ingesting and Processing the Data explanation →

A company wants to stream real-time clickstream data from a website into BigQuery for near-real-time analytics. They expect peaks of 10,000 events per second. Which combination of services is most suitable for ingestion?

Question 123hardmulti select

Review the full routing breakdown →

A company uses Pub/Sub to ingest IoT sensor data and wants to process it with a Dataflow pipeline that uses fixed windows of 1 minute to compute average temperature. The pipeline also needs to handle malformed messages by routing them to a dead letter queue. Which TWO configurations should the engineer implement? (Choose TWO.)

Question 124mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to schedule a recurring batch load of CSV files from an on-premises SFTP server into BigQuery. The files are generated daily and need to be loaded into a partitioned table by date. Which THREE steps should the engineer take? (Choose THREE.)

Question 125easymulti select

Read the full Ingesting and Processing the Data explanation →

A company uses Cloud Datastream to replicate data from a MySQL database to BigQuery in near real-time. Which TWO BigQuery features are automatically used by Datastream for optimal performance and consistency? (Choose TWO.)

Question 126mediummulti select

Read the full Ingesting and Processing the Data explanation →

A data engineer needs to build a Dataflow pipeline that reads JSON messages from Pub/Sub, transforms them (including filtering, grouping, and enrichment), and writes the results to BigQuery. The pipeline must handle schema evolution in the input messages and minimize data loss. Which THREE settings or features should the engineer use? (Choose THREE.)