Free PDE Maintaining and Automating Data Workloads Practice Questions (2026)

Q: How many Maintaining and Automating Data Workloads questions are on the PDE exam?

The Maintaining and Automating Data Workloads domain is one of the weighted domains on the PDE exam. The Courseiva question bank has 75 practice questions for this domain.

Q: How can I practice Maintaining and Automating Data Workloads questions for PDE?

Click any of the 75 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Maintaining and Automating Data Workloads domain.

Practice Maintaining and Automating Data Workloads questions

10Q 20Q 30Q 50Q

All PDE Maintaining and Automating Data Workloads questions (75)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data engineer uses Cloud Composer to orchestrate a daily batch pipeline. A downstream task should only start after an upstream BigQuery load job finishes successfully and a specific file appears in Cloud Storage. Which combination of operators should the engineer use in the Airflow DAG?

A company uses Dataflow streaming pipelines to process real-time events. They notice increasing system lag over time. Which two Cloud Monitoring metrics should be examined to diagnose the cause?

A data team needs to share a BigQuery dataset with another business unit. They want to provide a point-in-time snapshot of the data without incurring additional storage costs for the copy. Which BigQuery feature should they use?

An engineer needs to create a reusable Dataflow pipeline that can be executed with different parameters without modifying code. Which Dataflow feature should they use?

A company runs a Dataproc cluster for ETL jobs that process data nightly. They want to reduce costs while maintaining performance. Which strategy is MOST effective?

A data engineer needs to alert when Pub/Sub subscription has messages older than 1 hour. Which Cloud Monitoring metric and filter should they use?

A team wants to enforce data quality rules on BigQuery tables using Dataplex. They need to run column-level checks for null values and row-level checks for value ranges on a schedule. Which Dataplex feature should they use?

An organization uses BigQuery on-demand pricing. To control costs, they want to estimate the bytes processed by a query before running it. Which command or method should they use?

A company uses Cloud Composer for pipeline orchestration. They need to define task dependencies where Task B and Task C can run in parallel after Task A, and Task D must run after both B and C complete. How should they define the DAG?

A streaming Dataflow pipeline needs to be updated without draining the existing pipeline. Which update strategy should be used?

A company wants to use Cloud DLP to inspect data in BigQuery for sensitive information and de-identify it by masking credit card numbers. They want to perform this on a schedule. Which approach should they take?

A data engineer notices that BigQuery queries are slower than expected. They want to identify the most expensive stages in the query execution. Which tool or command should they use?

A data engineer needs to migrate a schema from BigQuery where a column is currently REQUIRED and needs to become NULLABLE. Which TWO statements are correct? (Choose 2)

A company runs BigQuery workloads with varying demand. They want to use flat-rate pricing with baseline slots and the ability to burst during peak times. Which TWO actions should they take? (Choose 2)

A company uses Cloud Composer (Airflow) to orchestrate pipelines. They want to implement a pattern where a task polls for a file arrival in Cloud Storage and then triggers subsequent tasks. Which THREE Airflow concepts are essential? (Choose 3)

A data engineer is building a batch pipeline that runs daily using Cloud Composer. The pipeline has three tasks: extract data from Cloud Storage, transform data using Dataflow, and load the transformed data into BigQuery. The engineer wants to ensure that the Dataflow job only starts after the extraction task completes successfully, and the load task only starts after the Dataflow job finishes. How should the engineer define the task dependencies in the Airflow DAG?

You need to schedule a simple workflow that fetches data from an API every hour, transforms it using Cloud Functions, and writes the result to Cloud Storage. The workflow has no complex branching or retry logic beyond basic retries. Which orchestration service is the MOST cost-effective and simplest to implement?

A company runs a streaming Dataflow pipeline that reads from Pub/Sub, enriches data with a side input from BigQuery, and writes to BigQuery. After updating the pipeline code (adding a new field to the output), the engineer notices that the new pipeline version is not picking up the updated code because the job was started from a template. The engineer wants to update the streaming pipeline without draining it. What should the engineer do?

You are monitoring a streaming Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. In Cloud Monitoring, you notice that the 'system_lag' metric is increasing over time and now exceeds 10 minutes. The 'data_watermark' metric shows a steady lag. What is the most likely cause of the increasing system lag?

A company wants to share a large BigQuery dataset with a partner for analysis. The partner needs read-only access to a specific snapshot of the data as of a certain point in time, and the company wants to avoid additional storage costs for the partner. What is the most cost-effective approach?

Your organization has a BigQuery flat-rate reservation with 2000 slots. During peak hours, query performance degrades because concurrent queries exceed the available slots. You want to handle these bursts without changing the base reservation. What should you do?

A data engineer is designing a Dataflow pipeline that reads from a Kafka topic (using Pub/Sub for Kafka) and writes to BigQuery. The data schema may change over time, with new fields appearing. The engineer wants to handle schema drift automatically without failing the pipeline. Which approach should the engineer use?

You are troubleshooting a Dataproc cluster that runs nightly Spark jobs. The jobs are failing with out-of-memory errors. You want to reduce costs while fixing the issue. Which combination of actions should you take? (Select the BEST answer.)

A data engineer needs to inspect a BigQuery table for sensitive data such as credit card numbers and email addresses before sharing it with a third party. The engineer also wants to de-identify the data by masking the sensitive columns. Which Google Cloud service should be used?

A company runs a critical batch pipeline using Cloud Dataflow. The pipeline processes financial transactions and runs every hour. Recently, some runs have failed due to transient errors (e.g., network timeouts). The engineer wants to automatically retry failed runs without manual intervention. The pipeline is launched from a Cloud Composer DAG using DataflowPythonOperator. What is the BEST way to handle retries?

You are optimizing a BigQuery query that scans 1 TB of data every day. The query joins a large fact table (partitioned by date) with a small dimension table. You notice that the query always scans the entire fact table, even though you only need the last 7 days of data. Which optimization will MOST reduce the bytes scanned?

A company uses Cloud Pub/Sub for a real-time data pipeline. The subscription has a backlog of millions of messages that are not being processed quickly enough. In Cloud Monitoring, you observe that the 'subscription/num_undelivered_messages' metric is high and growing, while 'subscription/oldest_unacked_message_age' is also increasing. Which action is MOST likely to reduce the backlog?

A data engineer needs to set up a Dataplex data quality scan to run weekly on a BigQuery table. The scan should check that: (1) the 'email' column is not null, (2) the 'age' column is between 0 and 120, and (3) the 'country_code' column matches a list of valid ISO codes. Which TWO Dataplex features should the engineer use?

A company is migrating their on-premises data warehouse to BigQuery. They have a mix of batch and streaming ingestion. The data team wants to optimize query costs. Which THREE practices should they adopt?

A data engineer is building a Cloud Workflows workflow that orchestrates multiple Cloud Functions and API calls. The workflow should handle transient failures with retries and send a notification to a Pub/Sub topic if the workflow ultimately fails. Which THREE steps should the engineer include in the workflow definition?

A data engineer needs to orchestrate a complex data pipeline that involves multiple steps including data extraction from Cloud Storage, transformation using Dataflow, and loading into BigQuery. The pipeline has dependencies between tasks and requires monitoring and retries. Which Google Cloud service should be used for orchestration?

You are running a streaming pipeline with Dataflow that reads from Pub/Sub and writes to BigQuery. You notice that the system lag metric is increasing over time, indicating that messages are taking longer to process. What is the most likely cause and how should you address it?

A company uses BigQuery flat-rate pricing with 500 slots purchased as a committed use discount. During peak hours, they need additional capacity but do not want to buy more committed slots. They have a secondary project used for ad-hoc queries by analysts. How can they provide burst capacity to the primary project during peak times without increasing committed spend?

Your team uses Cloud Composer to run Apache Airflow DAGs. One DAG uses a BigQueryInsertJobOperator to run a query and then uses BigQueryCheckOperator to verify the results. The DAG is failing intermittently because the query result is not ready when the check operator runs. How should you modify the DAG to ensure the check operator runs only after the query completes successfully?

A data engineer needs to share a large BigQuery table with a different team, but wants to minimize storage costs. The table is 1 TB in size and is updated daily. The other team only needs read access to the data as of a specific point in time (e.g., end of each day). Which BigQuery feature should be used to provide a read-only copy without duplicating the entire table?

You are designing a Dataflow pipeline that reads from Pub/Sub, performs transformations, and writes to BigQuery. The pipeline must handle schema changes in the incoming data (e.g., new fields appearing). The BigQuery schema should evolve automatically to accept new fields without failing. Which approach should you use?

Which Dataflow feature allows you to package a pipeline into a reusable template that can be deployed with different parameters at runtime?

A company runs a Dataflow streaming pipeline that processes financial transactions. They need to apply a new transformation that enriches the data with a lookup from Cloud Bigtable without stopping the pipeline. The pipeline must be updated in a way that minimises data loss and preserves exactly-once semantics. What is the recommended approach?

You are monitoring a Dataproc cluster and notice that the cluster utilisation is high, but jobs are running slowly. The cluster uses preemptible workers for cost savings. What is the most likely cause of the performance degradation?

Which BigQuery feature allows you to estimate the cost of a query before running it, by returning the number of bytes that would be processed?

Your company stores sensitive customer data in Cloud Storage. You need to inspect the data for personally identifiable information (PII) and de-identify it before sharing with a third party. Which Google Cloud service should you use?

You need to set up a BigQuery reservation that provides a baseline of 500 slots for daily workloads and can automatically scale up to 1000 slots during peak times. You want to pay only for the slots used beyond the baseline. Which reservation configuration should you choose?

A data engineer needs to monitor a Pub/Sub-based streaming pipeline. Which two Cloud Monitoring metrics should be used to detect a backlog of unprocessed messages? (Choose two.)

You are configuring Dataplex data quality rules for a BigQuery table. Which three types of rules can be defined using Dataplex's SQL-based rule engine? (Choose three.)

A company runs a Dataflow pipeline that processes a high-volume data stream. They notice that the pipeline's worker CPU utilisation is near 100% and the system lag is increasing. Which three actions can improve performance? (Choose three.)

You are building a data pipeline that runs daily batch jobs on Dataproc, then loads results into BigQuery. You want to orchestrate the entire workflow, including dependencies between steps, retries, and monitoring. Which Google Cloud service is most appropriate?

Your streaming Dataflow pipeline reads from Pub/Sub, enriches data with a side input, and writes to BigQuery. You need to update the enrichment logic without draining the pipeline, to minimize data loss and maintain exactly-once semantics. What should you do?

You manage a BigQuery reservation with 500 baseline slots and autoscaling up to 2000 slots. Your team runs a mix of interactive queries and batch load jobs. During peak hours, you notice that interactive queries are throttled when autoscaling slots are consumed by long-running batch loads. How can you ensure interactive queries get priority access to slots?

You want to monitor the latency of messages in a Pub/Sub subscription. Which Cloud Monitoring metric should you use to see the age of the oldest unacknowledged message?

You need to create a reusable Dataflow pipeline for transforming CSV files in Cloud Storage into Avro files in another bucket. The pipeline should be configurable via runtime parameters (e.g., input and output paths). Which approach should you use?

Your team uses Cloud Dataproc for Spark ML training jobs. You want to reduce costs for non-critical, fault-tolerant training jobs. Which Dataproc feature should you use for worker nodes?

A BigQuery table has a REQUIRED column 'user_id' that now needs to accept NULL values due to upstream data changes. You want to alter the schema with minimal downtime and no data loss. What should you do?

You need to estimate the cost of a BigQuery query before running it. Which command or feature should you use?

You are designing a data quality pipeline that must inspect PII in BigQuery tables and de-identify sensitive columns before sharing with analysts. Which GCP service should you use?

Your Dataflow streaming job is experiencing high system lag. You want to identify the root cause. Which Cloud Monitoring metrics should you examine first? (Choose the best option.)

You need to schedule a Dataproc Spark job to run at 2 AM every day, and upon completion, trigger a BigQuery load job. Which Cloud Composer operator should you use to run the Spark job?

You want to create a cost-efficient snapshot of a large BigQuery table that can be used by other teams for read-only analytics without incurring additional storage costs for the base table data. What should you use?

Your company uses Cloud Composer to orchestrate a data pipeline that includes Dataproc Spark jobs and BigQuery load operations. You need to pass the output file path from the Spark job to the next BigQuery task in the DAG. Which two mechanisms can you use to share data between tasks? (Choose TWO.)

You are designing a data pipeline that ingests streaming data from Pub/Sub, processes it with Dataflow, and writes to BigQuery. You need to ensure that schema changes in the incoming data (new fields) are handled without pipeline failure. Which THREE steps should you take? (Choose THREE.)

You are using Cloud Workflows to orchestrate a series of API calls. You need to handle errors and retries. Which THREE features of Cloud Workflows can you use? (Choose THREE.)

You are designing a Cloud Composer workflow that loads data from Cloud Storage into BigQuery, runs a Dataflow job to transform the data, and then triggers a Dataproc Spark job. After each step, you need to conditionally branch based on success or failure. Which Airflow feature allows you to pass messages between tasks to enable dynamic branching?

Your Dataflow streaming pipeline is experiencing increasing system lag over time. You have enabled autoscaling and the pipeline is using the default streaming engine. Which metric should you monitor in Cloud Monitoring to determine if the pipeline is falling behind due to slow processing or due to a bottleneck in the output sink?

You have a BigQuery table that is used by multiple teams. To save costs, you want to provide a consistent view of the data as of a specific point in time without creating full copies. Which BigQuery feature should you use?

You need to orchestrate a simple, linear workflow that calls several Cloud Functions and API endpoints sequentially with conditional logic. The workflow should be defined as code and have minimal overhead. Which GCP service should you use?

Your organization has a BigQuery flat-rate reservation with 500 slots. During peak hours, queries are queued and you need additional capacity temporarily. You want to add slots for a burst of activity without committing to a long-term purchase. What should you do?

You are running a Dataproc cluster for batch processing. The job is not latency-sensitive and you want to minimize cost. You notice that the cluster is underutilized during the job. Which configuration change would reduce costs most effectively?

A data engineer wants to quickly estimate the cost of running a BigQuery query before executing it. Which command-line tool or command should they use?

You need to inspect a BigQuery table for sensitive data such as credit card numbers and apply masking. Which GCP service should you use to identify and de-identify the data?

Your streaming Dataflow pipeline reads from Pub/Sub and writes to BigQuery. You need to update the pipeline to add a new transformation step without losing any messages or causing duplicate processing. Which TWO actions should you take? (Choose 2)

You are setting up Dataplex data quality rules for a BigQuery table. You want to define rules that check for non-null values in key columns and also validate that a column's values fall within a certain range. Which TWO rule types must you use? (Choose 2)

You need to monitor the health of a Pub/Sub subscription that feeds into a Dataflow pipeline. Which TWO Cloud Monitoring metrics are most relevant to detect if messages are not being acknowledged promptly? (Choose 2)

You want to optimize BigQuery costs for a large dataset that is frequently queried by time range. You also need to ensure that predictable workloads have dedicated slot capacity. Which TWO strategies should you combine? (Choose 2)

You have a BigQuery table with a REQUIRED column that you now need to allow NULL values. You also need to add two new nullable columns. Which THREE steps are required to achieve this schema evolution? (Choose 3)

You need to deploy a reusable Dataflow pipeline that can be executed with different parameters from Cloud Composer. Which TWO components should you use? (Choose 2)

You are building a data pipeline that ingests data from on-premises into Cloud Storage, then processes it with Dataproc, and finally loads into BigQuery. You need to schedule the pipeline to run daily. The pipeline must handle occasional failures gracefully. Which THREE Google Cloud services should you use together to achieve this? (Choose 3)

Practice all 75 Maintaining and Automating Data Workloads questions

Other PDE exam domains

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

Frequently asked questions

What does the Maintaining and Automating Data Workloads domain cover on the PDE exam?

The Maintaining and Automating Data Workloads domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.

How many Maintaining and Automating Data Workloads questions are in the PDE question bank?

The Courseiva PDE question bank contains 75 questions in the Maintaining and Automating Data Workloads domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Maintaining and Automating Data Workloads for PDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Maintaining and Automating Data Workloads questions for PDE?

Yes — the session launcher on this page draws questions exclusively from the Maintaining and Automating Data Workloads domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included