How many Ingesting and Processing the Data questions are on the PDE exam?

The Ingesting and Processing the Data domain is one of the weighted domains on the PDE exam. The Courseiva question bank has 126 practice questions for this domain.

Free PDE Ingesting and Processing the Data Practice Questions (2026)

Q: What does the Ingesting and Processing the Data domain cover on the PDE exam?

The Ingesting and Processing the Data domain covers the key concepts and skills tested in this area of the PDE exam blueprint published by Google Cloud.

Q: How can I practice Ingesting and Processing the Data questions for PDE?

Click any of the 126 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Ingesting and Processing the Data domain.

Practice Ingesting and Processing the Data questions

10Q 20Q 30Q 50Q

All PDE Ingesting and Processing the Data questions (126)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data engineer needs to load 10 TB of CSV files from Amazon S3 into Google BigQuery on a daily basis. Which service should they use to automate this transfer?

You need to stream real-time user click events from your application into BigQuery for immediate analysis. The events must be available for query within seconds. Which approach is recommended?

Your company is migrating an on-premises Hadoop cluster to Google Cloud. You need to transform large datasets using Spark SQL. Which Google Cloud service should you use?

A data engineer needs to transfer 500 TB of on-premises data to Google Cloud Storage. The data is stored on NAS devices and the network bandwidth is limited to 100 Mbps. What is the most cost-effective and timely transfer method?

You are building a Dataflow pipeline in Python that reads messages from Pub/Sub, enriches them with data from a BigQuery table, and writes the results to BigQuery. The enrichment lookup table is large and changes infrequently. Which approach minimizes cost and latency?

You are designing a Dataflow pipeline to process streaming data. The pipeline may encounter malformed records. You need to handle these errors without failing the entire pipeline and store the bad records for later analysis. What is the best practice?

Your company uses Kafka for event streaming. You want to run Kafka on Google Cloud with the ability to auto-scale clusters and use managed infrastructure. Which service should you choose?

You need to perform a one-time migration of historical data from an on-premises Teradata data warehouse to BigQuery. The data volume is 50 TB and you have a high-speed network connection (10 Gbps). What is the most efficient way to load the data?

You have a Dataflow pipeline that processes streaming data with high throughput. You notice that the pipeline is experiencing high latency and the workers are underutilized. Which Dataflow feature can automatically optimize resource allocation?

Your organization uses dbt (data build tool) for transformations on BigQuery. You need to run dbt models on a schedule and manage versions. Which Google Cloud service can execute dbt jobs in a serverless manner?

You are migrating an on-premises PostgreSQL database to Cloud SQL. You need to continuously replicate changes to BigQuery for real-time analytics with minimal latency. Which service should you use?

You are designing a Dataflow pipeline that needs to exactly-once process events from Pub/Sub and write to BigQuery using the Storage Write API. The pipeline may restart and could reprocess some messages. What setting ensures exactly-once semantics for the output?

You need to process a large volume of event data from Cloud Storage, apply complex transformations using Apache Spark, and then load the results into BigQuery. The data arrives in batches every hour. You want to minimize costs by using preemptible VMs. Which service should you use?

Which TWO statements are true about BigQuery Data Transfer Service? (Choose 2)

You are building a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline must handle late-arriving data and ensure that the windowing and triggering are correct. Which THREE configurations should you consider? (Choose 3)

An organization wants to ingest on-premises Oracle database changes into BigQuery for real-time analytics with minimal latency. The Oracle database is version 19c and has a high transactional volume. Which Google Cloud service should they use?

A data engineer needs to schedule recurring nightly loads from Amazon S3 to Google Cloud Storage. The data is in CSV format and the volume is approximately 500 GB per night. Which Google Cloud service should they use?

A company runs a Dataflow pipeline that reads from Pub/Sub, transforms data, and writes to BigQuery. The pipeline uses classic templates and is deployed in batch mode. They notice that the pipeline does not scale well under high load, causing a backlog in Pub/Sub. Which improvement would BEST address the scaling issue?

A company needs to load data from a MySQL database into BigQuery daily. The data volume is 10 GB per day and the schema changes occasionally. They want to minimize costs and operational overhead. What is the MOST appropriate approach?

A media company streams real-time viewer data from Pub/Sub to BigQuery using a Dataflow pipeline. They need to handle occasional malformed messages without losing valid data. Which pattern should they implement?

An organization needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Google Cloud Storage. The network bandwidth is limited to 100 Mbps. Which transfer method is MOST cost-effective and time-efficient?

A data engineer is designing a Dataflow pipeline in Python that reads from Pub/Sub, applies complex transformations using external libraries, and writes to BigQuery. The pipeline must be deployed as a reusable, version-controlled template that can be easily updated without re-uploading the pipeline code each time. Which approach should they use?

A financial services company needs to ingest real-time trade data from multiple sources into BigQuery for immediate fraud detection. The data volume is high (1 million messages per second) and each message must be available for queries within seconds. They are considering the Storage Write API. Which stream mode should they choose to balance data availability and cost?

A team uses dbt on BigQuery to transform data in their data warehouse. They have a large table with nested and repeated fields (arrays and structs). The transformation needs to normalize this data into a star schema. Which dbt feature and BigQuery SQL feature should they use together?

A company wants to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which event-driven solution should they use?

A data engineer is using Apache Spark on Dataproc to process a large dataset. They need to perform complex aggregation and transformation with high performance. The dataset has a known schema and they want to take advantage of Catalyst optimizer. Which Spark API should they use?

A company uses Workflows to orchestrate a series of Google Cloud services for data processing. They need to call an external HTTP API as part of the workflow and handle potential failures with retries. Which Workflows feature should they use?

A data engineering team needs to ingest streaming data from an existing Kafka cluster (on-premises) into Google Cloud for real-time analytics. They want to minimize changes to the existing Kafka setup and avoid long-term operational overhead. Which TWO approaches should they consider?

A large enterprise is migrating its data warehouse from Teradata to BigQuery. They need to transfer historical data (100 TB) and set up ongoing daily incremental loads. They also need to transform the data using dbt. Which THREE Google Cloud services should they use?

A data engineer needs to load data from CSV files in Cloud Storage into BigQuery. The CSV files have a header row and some columns contain nested JSON strings. Which TWO methods can they use to load this data into BigQuery?

You are building a streaming pipeline to ingest real-time clickstream data from a website into BigQuery for immediate analysis. The data must be available in BigQuery within seconds and you need to handle late-arriving data (e.g., browser offline events) that may arrive hours later. Which approach should you use?

A company wants to transfer 500 TB of data from an on-premises Hadoop cluster to Google Cloud Storage (GCS) for processing with Dataproc. The on-premises network has a 1 Gbps dedicated link to Google Cloud. The data must be transferred as quickly as possible, minimizing network usage. Which transfer method should they use?

Your team is processing a large dataset with Apache Beam on Dataflow. The pipeline sometimes fails due to transient errors when writing to a BigQuery sink. You need to ensure that failed records are not lost and can be reprocessed later without blocking the pipeline. What is the best approach?

You are designing a near-real-time CDC pipeline to replicate changes from an on-premises PostgreSQL database to BigQuery for analytics. The source database has high transaction volume and you must ensure minimal impact on the source. Which Google Cloud service should you use to ingest the change data?

You are loading 10 GB of daily CSV files from a GCS bucket into a BigQuery table. The files contain some malformed rows that you want to skip. Which BigQuery load configuration should you use?

Your Dataflow pipeline reads from Pub/Sub, performs transformations, and writes to BigQuery. You notice that the pipeline's autoscaling is not keeping up with sudden spikes in traffic, causing increased lag. The pipeline uses Classic Templates. Which change would most effectively improve autoscaling responsiveness?

You are migrating an existing Kafka cluster to Google Cloud using Dataproc. The cluster handles high-throughput streaming data with strict ordering requirements per partition. Which choice of Dataproc configuration is most appropriate?

Your team uses dbt to transform data in BigQuery. You need to schedule dbt runs to refresh materialized tables and views every hour. The transformations include both full refreshes and incremental models. What is the most efficient way to orchestrate these dbt runs on Google Cloud?

You need to ingest Google Ads performance data into BigQuery on a daily basis for reporting. Which service should you use?

You are designing a streaming pipeline that ingests events from Pub/Sub, enriches them with a machine learning model, and writes the results to BigQuery. The ML model is deployed on Cloud Run and has a high latency (500ms per request). You need to minimize the impact of slow ML inference on the overall pipeline throughput. Which approach should you take?

Your team has a Dataflow pipeline that reads from BigQuery, transforms data, and writes to GCS. The pipeline is failing with 'Out of Memory' errors on the worker nodes. The input data is large but fits within the total cluster memory. Which configuration change is most likely to resolve the issue without increasing costs significantly?

You need to react to changes in a GCS bucket (e.g., new object creation) and trigger a Cloud Run service to process the new file. Which Google Cloud service should you use to route the event?

You need to ingest streaming data from a custom application into BigQuery with exactly-once semantics and low latency. The data volume is up to 10 MB/s. Which TWO services should you combine?

Your company has a Dataproc cluster that runs Spark jobs. You need to choose between RDDs, DataFrames, and Datasets for a new job that performs complex aggregations on structured data. Which TWO statements are correct regarding performance and ease of use?

You are building a BigQuery table that contains nested and repeated fields (e.g., order with line items). You need to write a query that counts the number of line items per order. Which TWO SQL functions/techniques can you use?

A data engineer needs to transfer 5 PB of historical data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 30 days. Which transfer method should they use?

A company wants to stream real-time user click events from their web application into BigQuery for immediate analysis. Which combination of services is the most scalable and cost-effective for this use case?

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Some incoming messages are malformed and fail to parse. How should you handle these messages to ensure the pipeline continues processing without data loss?

A data engineer needs to load 2 TB of Avro files stored in Cloud Storage into BigQuery on a daily schedule. The schema is static and the data should overwrite the existing table each day. What is the most efficient way to accomplish this?

Which Google Cloud service is designed to replicate data from MySQL, PostgreSQL, and Oracle databases to BigQuery or Cloud Storage in near real-time?

A company needs to run a Spark ML training job on a Dataproc cluster with high memory per node, but the cluster should automatically scale down when idle to save costs. Which configuration should they use?

You are migrating an on-premises Kafka cluster to Google Cloud. The cluster has 50 topics with a total throughput of 200 MB/s. You want to minimize operational overhead. Which approach is the most cost-effective?

A data engineer is creating a Dataflow Flex Template for a batch pipeline that reads from BigQuery and writes to Cloud Storage. They need to pass a runtime parameter for the output bucket. How should they define this parameter?

Which BigQuery feature allows you to write data with exactly-once semantics, high throughput, and the ability to buffer data before making it available for queries?

A company wants to transform data using dbt (data build tool) on BigQuery. They have a CI/CD pipeline and need to version-control their transformations. Which setup is recommended?

You are using Dataproc to run a Spark job that reads data from Cloud Storage, performs aggregations, and writes results back to Cloud Storage. The job is failing with out-of-memory errors on the shuffle. Which optimization should you apply?

An organization needs to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which service should they use to set up this event-driven architecture?

A data engineer needs to schedule a nightly transfer of data from an Amazon S3 bucket to Cloud Storage. Which two steps are required to achieve this? (Choose TWO.)

Which three of the following are valid BigQuery data loading methods? (Choose THREE.)

A company is designing a real-time analytics pipeline using Pub/Sub and Dataflow. They need to ensure exactly-once processing and handle late-arriving data. Which two configurations should they implement? (Choose TWO.)

A data engineer needs to ingest daily Salesforce reports into BigQuery without writing custom code. The reports are exported to an Amazon S3 bucket on a schedule. Which service should they use to automate the transfer?

A company wants to migrate 500 TB of on-premises archival data to Cloud Storage. The data is stored on a SAN and the network link is limited to 1 Gbps. The migration must complete within 10 days. What is the MOST cost-effective approach?

A streaming pipeline ingests events from Pub/Sub, enriches them via a slow REST API call, and writes the result to BigQuery. The API has a limit of 10 requests per second per client. The pipeline processes 1000 messages per second. Which approach minimizes latency while respecting API limits?

An organization needs to continuously replicate change data from a MySQL database to BigQuery with sub-minute latency. The database is running on-premises. Which Google Cloud service should they use?

A data engineer is building a Dataflow pipeline in Python that reads from BigQuery, transforms data, and writes to Cloud Storage. The pipeline will be deployed in production. Which approach should they use to ensure the pipeline is reusable across environments with different configuration parameters?

A company uses BigQuery's Storage Write API in committed mode to stream data. They notice that some writes are failing with 'DEADLINE_EXCEEDED' errors during peak traffic. The pipeline is a Dataflow job using the Beam SDK. What is the MOST likely cause and solution?

Which Dataflow feature automatically scales the number of workers based on the pipeline's current workload, and also selects the optimal machine type for each worker based on the pipeline's resource requirements?

A data pipeline processes JSON files from Cloud Storage, transforms them using Apache Beam, and writes the output to BigQuery. Some records are malformed and cause the pipeline to fail. How should the engineer handle these errors to ensure the pipeline continues processing while preserving the malformed records for analysis?

A company wants to orchestrate a multi-step data processing workflow that includes calling a Cloud Run service, waiting for its completion, and then running a BigQuery query. The workflow should be serverless and integrate with Cloud Events. Which Google Cloud service should they use?

A data analyst needs to transform nested and repeated fields in BigQuery. They have a table with a column of type ARRAY<STRUCT<...>>. Which SQL function should they use to flatten the array into individual rows for analysis?

A Dataflow pipeline reads from Pub/Sub, applies a keyed stateful ParDo that uses state variables to deduplicate events based on event ID, and writes to BigQuery. During a pipeline update, some events are duplicated in BigQuery. The state is not preserved across updates. Which configuration ensures exactly-once semantics during updates?

A company wants to use dbt (data build tool) to transform data in BigQuery. They have a Cloud Storage bucket containing raw CSV files that are loaded daily into BigQuery via an external table. Which dbt feature should they use to modularize the transformation logic and handle dependencies between models?

A company is building a real-time anomaly detection pipeline using Dataflow. Events are ingested from Pub/Sub, and the pipeline must compute a sliding window average every minute over a 1-hour window. Which TWO configurations are required for this pipeline? (Choose 2)

A retail company wants to trigger a Cloud Run service whenever a new CSV file is uploaded to a specific Cloud Storage bucket. Which THREE components are needed to set up this event-driven architecture? (Choose 3)

A company is migrating on-premises Apache Kafka workloads to Google Cloud. They want to minimize changes to existing producer and consumer applications while leveraging managed services. Which TWO services should they consider? (Choose 2)

A data engineer needs to ingest on-premises Oracle CDC data into BigQuery in near real-time with minimal operational overhead. Which service should they use?

A data team wants to load millions of small JSON files (each <1 MB) from GCS into BigQuery daily with the lowest cost and fastest performance. They need exactly-once semantics and the ability to detect new files automatically. Which approach is most suitable?

A company is using Pub/Sub to ingest clickstream events and Dataflow to write to BigQuery. They observe that some events are malformed and cause the pipeline to fail. They need a solution that captures malformed events without blocking the pipeline and allows reprocessing later. Which Dataflow pattern should they implement?

A data engineer needs to transfer 500 TB of archival data from an on-premises NAS to Cloud Storage. The on-premises network has limited bandwidth (100 Mbps). Which transfer method should they recommend?

A company runs Apache Kafka on Dataproc for real-time event streaming. They want to archive the Kafka topics to Cloud Storage for long-term retention and later analysis in BigQuery. Which approach is the most cost-effective and operationally simple?

A data engineer is designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery using the Storage Write API in exactly-once mode. The pipeline must handle late-arriving data (up to 1 hour) and maintain correct aggregation results. Which trigger configuration should they use?

A company wants to use dbt to transform data in BigQuery. Their source data is loaded daily into staging tables. They need to run dbt transformations on a schedule and only process tables that have changed. Which dbt feature should they use?

A data engineer needs to create a Dataflow pipeline template that can be reused across multiple environments (dev, staging, prod) with different parameters (e.g., input Pub/Sub topic, output BigQuery table). Which template type should they use?

A company uses Eventarc to trigger a Cloud Run service when new objects appear in a GCS bucket. Recently, the Cloud Run service has been failing with 429 errors (too many requests) during high-velocity uploads. They need to handle the load without losing events. What should they do?

A data engineer needs to query a BigQuery table that contains an array of structs. They want to expand the array into separate rows for each element. Which SQL function should they use?

A company wants to migrate their on-premises Teradata data warehouse to BigQuery. They need an automated, one-time transfer of historical data (10 TB) and ongoing incremental daily syncs. Which Google Cloud service should they use?

A data engineer is using Spark on Dataproc to process a large dataset. They notice the job is slow due to excessive shuffling. They want to optimize the job by using a more efficient data structure that reduces serialization overhead and provides better memory management. Which Spark API should they use?

A company needs to stream real-time user activity data from their application into BigQuery for immediate dashboarding. They want to minimize latency (under 5 seconds) and ensure exactly-once delivery. Which TWO options should they consider? (Choose 2)

A data engineer is designing a batch processing pipeline that runs daily. The pipeline reads CSV files from GCS, transforms them using Python, and writes the results to BigQuery. They need to parameterize the pipeline for different environments and run it on a schedule. Which THREE components should they use? (Choose 3)

A company uses Workflows to orchestrate a multi-step data pipeline. One step calls an HTTP endpoint that may take up to 10 minutes, but the default Workflows timeout is too short. They also need to handle transient errors with retries. Which TWO configurations should they apply? (Choose 2)

A company wants to ingest data from an on-premises Oracle database into BigQuery in near real-time with minimal latency. The database has a high volume of inserts and updates. Which service should they use?

A data engineer needs to load a 10 GB CSV file from GCS into BigQuery. The file contains some malformed rows that should be skipped. Which approach is most efficient?

A team wants to transfer data from an on-premises Hadoop cluster to Cloud Storage for processing. The cluster is located in a remote area with limited bandwidth. They need to transfer 500 TB of data. Which service should they use?

A data pipeline uses Pub/Sub to ingest events, a Dataflow streaming pipeline to process them, and writes results to BigQuery. The pipeline must handle occasional duplicate events without causing duplicate rows in BigQuery. What is the best approach?

A company uses Google Ads and wants to automatically load their advertising data into BigQuery daily. They also need to transform the data with SQL and schedule a recurring query. Which combination of services meets these requirements with minimal operational overhead?

A data engineer needs to create a Dataflow pipeline that reads from Pub/Sub, applies a Python transformation, and writes to BigQuery. The pipeline should be reusable across environments with different parameters. Which deployment method is most appropriate?

A company uses Kafka on Dataproc to ingest streaming data. They want to process the data with Spark Structured Streaming and write results to BigQuery. The team is using Dataproc clusters. Which approach minimizes cost while maintaining performance?

Which BigQuery feature allows you to query data directly from Cloud Storage without loading it into BigQuery storage?

A company wants to build an event-driven application that processes images uploaded to a Cloud Storage bucket. The processing takes up to 10 minutes per image and should be automatically triggered. Which compute option should they use?

100

A Dataflow streaming pipeline is experiencing high latency and frequent OOM errors when processing variable-sized JSON messages from Pub/Sub. The team suspects that the autoscaling is not effective. Which feature should they enable to improve resource utilization?

101

A team needs to orchestrate a multi-step workflow that involves calling external APIs, running BigQuery queries, and conditionally executing Cloud Functions. Which Google Cloud service is best suited for this?

102

A company uses dbt on BigQuery to transform data. They want to run dbt models on a schedule and manage environments (dev, prod). Which GCP service should they use to run dbt jobs?

103

A company needs to stream data from a MySQL database to BigQuery with a latency under 10 seconds. They also need to handle schema changes automatically. Which TWO services should they combine?

104

A data team needs to transfer 200 TB of data from Amazon S3 to GCS. The transfer must be incremental, and they need to monitor the transfer progress. Which THREE components should they use?

105

A company wants to use Eventarc to trigger a Cloud Run service when new objects are created in a GCS bucket. They also need to filter events for a specific bucket and object prefix. Which THREE resources must exist or be created?

106

A data engineer needs to load 10 GB of CSV files from Amazon S3 into BigQuery on a daily basis. The files arrive in a specific S3 bucket at 3 AM UTC each day. Which service should be used to automate this transfer?

107

A financial services company receives real-time stock trade data via Pub/Sub. They need to enrich each trade with reference data from a Cloud SQL table and write the results to BigQuery for real-time analytics. The enrichment must handle late-arriving data and ensure exactly-once processing. Which Dataflow streaming pipeline configuration should be used?

108

A company needs to continuously synchronize customer data changes from an on-premises Oracle database to BigQuery for near-real-time analytics. The Oracle database has Change Data Capture (CDC) enabled. Which Google Cloud service should be used to stream these changes with minimal latency and schema evolution support?

109

A data engineer needs to move 500 TB of archival data from an on-premises Hadoop cluster to Cloud Storage. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 30 days. Which method is most cost-effective and reliable?

110

A company uses Pub/Sub to ingest clickstream data. Each message contains a JSON payload with a nested array of user actions. The data must be written to BigQuery, with each action in the array becoming a separate row. Which BigQuery feature or approach should be used to achieve this transformation?

111

A data engineer needs to orchestrate a series of tasks that include calling external APIs, running BigQuery queries, and sending notifications. The workflow involves conditional branching and parallel steps. Which Google Cloud service should be used?

112

A company is running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They notice that the number of workers is not scaling up to handle increased throughput, causing latency spikes. The pipeline uses a GlobalWindow with default triggering. What is the most likely cause of the under-scaling?

113

A company wants to move data from an on-premises MySQL database to BigQuery for analytics. They need to capture all changes (inserts, updates, deletes) in near real-time and also perform an initial historical load. Which approach meets these requirements with minimal operational overhead?

114

A data engineer is building a Dataflow pipeline that reads from BigQuery, transforms data using Apache Beam, and writes results to Cloud Storage in Avro format. They need to ensure the pipeline can be easily redeployed with different parameters without modifying code. Which deployment method should they use?

115

A company uses BigQuery to store event data. They need to load data from multiple sources with different schemas and expect frequent schema changes. Which approach provides the most flexibility for schema evolution while minimizing load failures and performance impact?

116

A company is building a data pipeline that ingests streaming data from Pub/Sub, transforms it with Dataflow, and loads it into BigQuery. They want to handle malformed messages that cannot be parsed. Which TWO actions should they implement for error handling? (Choose 2)

117

A data engineer needs to perform a one-time migration of 10 TB of data from on-premises Hadoop HDFS to Cloud Storage. The network link is 1 Gbps. Which TWO services or tools should they consider? (Choose 2)

118

A company uses Dataflow to process data with Apache Beam in Python. The pipeline reads from Pub/Sub, applies a ParDo that calls an external API for enrichment, and writes to BigQuery. The external API has rate limits and occasionally fails. To improve reliability, which THREE strategies should be implemented? (Choose 3)

119

A data engineer needs to schedule a recurring transfer of data from a partner's Amazon S3 bucket to a Cloud Storage bucket for further processing. Which THREE components or configurations are necessary? (Choose 3)

120

A company is migrating a Spark batch job from on-premises to Dataproc. The job uses RDDs for custom transformations and writes output to BigQuery. They want to optimize the job for performance and cost on Dataproc. Which THREE practices should they adopt? (Choose 3)

121

A data engineer needs to migrate 200 TB of on-premises Oracle data to BigQuery. The network bandwidth is limited to 100 Mbps, and the data must be loaded within 2 weeks. Which Google Cloud service is most appropriate for the initial data transfer?

122

A company wants to stream real-time clickstream data from a website into BigQuery for near-real-time analytics. They expect peaks of 10,000 events per second. Which combination of services is most suitable for ingestion?

123

A company uses Pub/Sub to ingest IoT sensor data and wants to process it with a Dataflow pipeline that uses fixed windows of 1 minute to compute average temperature. The pipeline also needs to handle malformed messages by routing them to a dead letter queue. Which TWO configurations should the engineer implement? (Choose TWO.)

124

A data engineer needs to schedule a recurring batch load of CSV files from an on-premises SFTP server into BigQuery. The files are generated daily and need to be loaded into a partitioned table by date. Which THREE steps should the engineer take? (Choose THREE.)

125

A company uses Cloud Datastream to replicate data from a MySQL database to BigQuery in near real-time. Which TWO BigQuery features are automatically used by Datastream for optimal performance and consistency? (Choose TWO.)

126

A data engineer needs to build a Dataflow pipeline that reads JSON messages from Pub/Sub, transforms them (including filtering, grouping, and enrichment), and writes the results to BigQuery. The pipeline must handle schema evolution in the input messages and minimize data loss. Which THREE settings or features should the engineer use? (Choose THREE.)

Practice all 126 Ingesting and Processing the Data questions

Other PDE exam domains

Designing Data Processing Systems Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

Frequently asked questions

What does the Ingesting and Processing the Data domain cover on the PDE exam?

The Ingesting and Processing the Data domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.

How many Ingesting and Processing the Data questions are in the PDE question bank?

The Courseiva PDE question bank contains 126 questions in the Ingesting and Processing the Data domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Ingesting and Processing the Data for PDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Ingesting and Processing the Data questions for PDE?

Yes — the session launcher on this page draws questions exclusively from the Ingesting and Processing the Data domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included