CCNA Pde Maintaining Automating Questions

75 questions · Pde Maintaining Automating topic · All types, answers revealed

1
Multi-Selecthard

You are setting up Dataplex data quality rules for a BigQuery table. You want to define rules that check for non-null values in key columns and also validate that a column's values fall within a certain range. Which TWO rule types must you use? (Choose 2)

Select 2 answers
A.Table rule (e.g., row count)
B.Row rule (e.g., not null)
C.Partition rule
D.Column rule (e.g., value range)
E.Custom SQL rule
AnswersB, D

Row rules can enforce null checks on specific columns.

Why this answer

Dataplex data quality rules include row rules (for null checks) and column rules (for range or value checks). Table rules apply to the entire table; custom SQL can be used but row and column rules are the standard.

2
MCQhard

A data engineer needs to share a large BigQuery table with a different team, but wants to minimize storage costs. The table is 1 TB in size and is updated daily. The other team only needs read access to the data as of a specific point in time (e.g., end of each day). Which BigQuery feature should be used to provide a read-only copy without duplicating the entire table?

A.Table clone
B.Authorized views
C.Time travel using FOR SYSTEM_TIME AS OF
D.Table snapshot
AnswerD

Snapshots provide point-in-time read-only copies and share storage with the base table, minimizing cost.

Why this answer

BigQuery table clones provide a lightweight, writable copy of a table that initially shares storage with the base table and only incurs costs for changes made to the clone. Snapshots are read-only and also share storage but require the base table to be preserved until snapshot expires. For read-only point-in-time access, a snapshot is more appropriate because it is immutable and cost-effective.

3
MCQeasy

You need to estimate the cost of a BigQuery query before running it. Which command or feature should you use?

A.Check the BigQuery jobs list for similar queries.
B.Use the BigQuery cache to estimate if the query is cached.
C.Run EXPLAIN on the query to see the query plan.
D.Use the bq command with the --dry_run flag.
AnswerD

Dry run estimates the bytes processed, allowing cost estimation without running the query.

Why this answer

Option D is correct because the `bq` command with the `--dry_run` flag allows you to estimate the amount of data a BigQuery query will process before actually executing it. This dry run does not read any data or incur charges; it simply returns the estimated bytes to be processed, which you can use to calculate the cost based on BigQuery's pricing model.

Exam trap

Cisco often tests the distinction between tools that estimate cost versus tools that analyze query execution, so the trap here is confusing the EXPLAIN command (which shows the query plan) with the `--dry_run` flag (which estimates bytes processed and cost).

How to eliminate wrong answers

Option A is wrong because checking the BigQuery jobs list for similar queries only gives you historical cost data, not an estimate for the specific query you are about to run, and it assumes a similar query exists. Option B is wrong because the BigQuery cache stores results of previously run queries, but it does not provide an estimate of cost or data processed; it only indicates whether results might be served from cache. Option C is wrong because running EXPLAIN on the query shows the query plan and execution steps, but it does not provide a cost estimate or the amount of data that will be scanned.

4
Multi-Selecthard

You are designing a data pipeline that ingests streaming data from Pub/Sub, processes it with Dataflow, and writes to BigQuery. You need to ensure that schema changes in the incoming data (new fields) are handled without pipeline failure. Which THREE steps should you take? (Choose THREE.)

Select 3 answers
A.Set the BigQuery table schema to allow automatic addition of new fields by using 'ignore_unknown_values'.
B.Configure the BigQuery table to require all fields, causing the pipeline to fail on unknown fields.
C.Use ALTER TABLE ADD COLUMN DDL statements to add nullable columns for new fields.
D.Configure the Dataflow pipeline to update the BigQuery table schema when new fields are detected.
E.Set all columns in the BigQuery table as NULLABLE from the start.
AnswersA, C, D

This option tells BigQuery to ignore unknown fields and allows schema auto-detection.

Why this answer

To handle schema drift, BigQuery allows schema relaxation (adding nullable columns) via DDL. Dataflow can update the destination table schema using BigQuery APIs (e.g., set the schema to include unknown fields). Alternatively, using BigQuery's automatic schema detection with 'ignore_unknown_values' can allow new fields to be added automatically.

Setting all fields to NULLABLE in advance is not practical. Setting the table to require all fields causes failures on unknown fields.

5
MCQhard

A company runs a streaming Dataflow pipeline that reads from Pub/Sub, enriches data with a side input from BigQuery, and writes to BigQuery. After updating the pipeline code (adding a new field to the output), the engineer notices that the new pipeline version is not picking up the updated code because the job was started from a template. The engineer wants to update the streaming pipeline without draining it. What should the engineer do?

A.Use the gcloud dataflow jobs update command with the new Flex Template.
B.Stop the pipeline, update the template, and restart with the same job name.
C.Modify the original template and redeploy it as a new job with the same pipeline name.
D.Use the gcloud dataflow jobs drain command, then restart with the new template.
AnswerA

Dataflow supports updating a running streaming job from a Flex Template by specifying --update and the job ID. This allows code changes without draining.

Why this answer

Option A is correct because the `gcloud dataflow jobs update` command allows you to update a running streaming Dataflow pipeline with a new Flex Template without draining or stopping the job. This command performs an in-place update, preserving the job's state and checkpointing, so the pipeline continues processing with the new code. Since the original job was started from a template, using this command with the new Flex Template ensures the updated code is picked up seamlessly.

Exam trap

Cisco often tests the misconception that you must drain or stop a streaming pipeline to update it, but the `gcloud dataflow jobs update` command is specifically designed for in-place updates of streaming jobs started from templates.

How to eliminate wrong answers

Option B is wrong because stopping the pipeline and restarting with the same job name would cause data loss or duplication due to the loss of checkpointing state, and it violates the requirement to update without draining. Option C is wrong because modifying the original template and redeploying as a new job with the same pipeline name does not update the running job; it creates a separate job, and Dataflow does not allow two jobs with the same name to run concurrently. Option D is wrong because draining the job (using `gcloud dataflow jobs drain`) gracefully stops the pipeline, which contradicts the requirement to update without draining; after draining, you would need to restart, which is not an in-place update.

6
MCQhard

You are running a Dataproc cluster for batch processing. The job is not latency-sensitive and you want to minimize cost. You notice that the cluster is underutilized during the job. Which configuration change would reduce costs most effectively?

A.Resize the cluster to use larger machines
B.Switch to single-node cluster
C.Use preemptible workers for the worker nodes
D.Enable autoscaling
AnswerC

Preemptible workers significantly reduce costs and are ideal for batch jobs that can tolerate interruptions.

Why this answer

Using preemptible workers in Dataproc reduces cost by about 60-80% compared to standard VMs. They are suitable for fault-tolerant batch jobs because they can be terminated at any time.

7
MCQmedium

You are designing a data quality pipeline that must inspect PII in BigQuery tables and de-identify sensitive columns before sharing with analysts. Which GCP service should you use?

A.Dataplex
B.Cloud Data Catalog
C.Cloud DLP
D.Dataflow
AnswerC

Cloud DLP inspects and de-identifies sensitive data in BigQuery, Cloud Storage, and other sources.

Why this answer

Cloud DLP (Data Loss Prevention) is the correct choice because it is purpose-built for inspecting, classifying, and de-identifying sensitive data such as PII. It integrates natively with BigQuery via inspection jobs and de-identification templates, allowing you to scan tables for over 150 built-in infoTypes (e.g., email, SSN) and apply transformations like masking, tokenization, or encryption before sharing data with analysts.

Exam trap

The trap here is that candidates often confuse Dataplex's data governance features (like policy tags and metadata) with actual de-identification, but Dataplex cannot transform data—it only applies access controls, whereas Cloud DLP performs the actual masking or tokenization of sensitive values.

How to eliminate wrong answers

Option A is wrong because Dataplex is a data fabric service for managing, governing, and cataloging data across lakes and warehouses, but it does not perform de-identification or PII inspection itself; it can integrate with Cloud DLP for such tasks but is not the primary tool. Option B is wrong because Cloud Data Catalog is a metadata management service for discovering and tagging assets, but it lacks native de-identification capabilities and cannot transform sensitive data. Option D is wrong because Dataflow is a stream/batch processing service that can be used to build custom de-identification pipelines, but it requires manual implementation of DLP logic and is not the out-of-the-box service for inspecting and de-identifying PII in BigQuery tables.

8
MCQeasy

You need to schedule a simple workflow that fetches data from an API every hour, transforms it using Cloud Functions, and writes the result to Cloud Storage. The workflow has no complex branching or retry logic beyond basic retries. Which orchestration service is the MOST cost-effective and simplest to implement?

A.Cloud Scheduler
B.Workflows
C.Cloud Composer
D.Dataflow
AnswerB

Workflows is serverless, cost-effective (pay per execution), and sufficient for simple linear workflows with basic retries.

Why this answer

Workflows is serverless, pay-per-execution, and defined in YAML/JSON. It integrates natively with Cloud Functions and Cloud Storage. Cloud Composer is overkill for simple linear workflows and incurs cluster costs.

Cloud Scheduler alone cannot orchestrate multiple steps. Dataflow is for data processing, not orchestration.

9
MCQhard

You need to set up a BigQuery reservation that provides a baseline of 500 slots for daily workloads and can automatically scale up to 1000 slots during peak times. You want to pay only for the slots used beyond the baseline. Which reservation configuration should you choose?

A.Create a reservation with 500 committed slots and another with 500 flex slots.
B.Use on-demand pricing with a maximum query cost limit.
C.Purchase 1000 committed use slots to ensure consistent capacity.
D.Create a reservation with 500 baseline slots and enable autoscaling up to 1000 slots.
AnswerD

This configuration provides baseline committed slots and on-demand autoscaling.

Why this answer

Option D is correct because BigQuery reservations support baseline slots with autoscaling, which allows you to set a committed baseline of 500 slots and automatically scale up to 1000 slots during peak demand. You are billed only for the additional slots used beyond the baseline, which matches the requirement of paying only for slots used beyond the baseline.

Exam trap

Cisco often tests the distinction between flex slots (pre-purchased, always billed) and autoscaling (pay-per-use beyond baseline), leading candidates to incorrectly choose flex slots for variable workloads.

How to eliminate wrong answers

Option A is wrong because creating two separate reservations (500 committed + 500 flex) does not provide automatic scaling; flex slots are pre-purchased capacity, not on-demand, so you would pay for the full 500 flex slots regardless of usage. Option B is wrong because on-demand pricing does not use slots or reservations; it charges per byte processed and has no concept of baseline or autoscaling, and a maximum query cost limit only caps spending, not capacity. Option C is wrong because purchasing 1000 committed slots forces you to pay for all 1000 slots at all times, even when only 500 are needed, which does not meet the requirement of paying only for slots used beyond the baseline.

10
MCQmedium

Your Dataflow streaming job is experiencing high system lag. You want to identify the root cause. Which Cloud Monitoring metrics should you examine first? (Choose the best option.)

A.Data freshness and worker CPU utilization
B.System lag and worker CPU utilization
C.Element count and data freshness
D.Backlog bytes and system lag
AnswerB

High system lag combined with high CPU suggests workers are bottlenecked; low CPU may indicate other issues.

Why this answer

For streaming Dataflow jobs, system lag measures the maximum time between the event timestamp and when it is processed. High system lag typically indicates that the pipeline cannot keep up with the input rate. Worker CPU utilization is a key metric to check if workers are overloaded.

Data freshness is for batch pipelines. Element count alone doesn't show lag. Backlog bytes could be useful but worker CPU is more directly indicative of processing capacity issues.

11
Multi-Selecthard

A company runs a Dataflow pipeline that processes a high-volume data stream. They notice that the pipeline's worker CPU utilisation is near 100% and the system lag is increasing. Which three actions can improve performance? (Choose three.)

Select 3 answers
A.Increase the worker disk size.
B.Increase the number of workers.
C.Use batch processing instead of streaming.
D.Enable Dataflow Streaming Engine.
E.Use higher-CPU machine types (e.g., n2-highcpu).
AnswersB, D, E

More workers distribute the load and reduce CPU per worker.

Why this answer

Option B is correct because increasing the number of workers distributes the processing load across more parallel workers, reducing CPU utilization per worker and allowing the pipeline to keep up with the incoming data stream. This directly addresses both high CPU usage and increasing system lag by scaling out horizontally.

Exam trap

Cisco often tests the misconception that increasing disk size (Option A) can improve processing performance, when in fact it only addresses storage bottlenecks, not CPU or lag issues.

12
Multi-Selectmedium

Your streaming Dataflow pipeline reads from Pub/Sub and writes to BigQuery. You need to update the pipeline to add a new transformation step without losing any messages or causing duplicate processing. Which TWO actions should you take? (Choose 2)

Select 2 answers
A.Use the Dataflow update command with the same pipeline name and the new template.
B.Take a snapshot of the pipeline before updating.
C.Stop the pipeline, modify it, then restart with a new name.
D.Drain the pipeline before making changes.
E.Cancel the pipeline and create a new one.
AnswersA, B

Updating a running pipeline with the same name is the recommended way to apply changes without draining.

Why this answer

To update a streaming pipeline without draining, you can use the `--update` flag with the new pipeline. The snapshot feature allows you to restore state if needed. Draining would stop the pipeline; canceling would lose messages.

13
MCQeasy

You need to schedule a Dataproc Spark job to run at 2 AM every day, and upon completion, trigger a BigQuery load job. Which Cloud Composer operator should you use to run the Spark job?

A.DataflowPythonOperator
B.BigQueryOperator
C.DataprocClusterCreateOperator
D.DataprocSubmitJobOperator
AnswerD

This operator submits a job (e.g., Spark, PySpark) to an existing Dataproc cluster.

Why this answer

Option D is correct because the DataprocSubmitJobOperator is specifically designed to submit a job (e.g., a Spark job) to an existing Dataproc cluster. In this scenario, you need to run a Spark job on a scheduled basis, and Cloud Composer (Airflow) provides this operator to submit the job to Dataproc. After the Spark job completes, you can chain a BigQuery load operator to trigger the load, matching the requirement exactly.

Exam trap

The trap here is that candidates confuse operators that manage cluster lifecycle (like DataprocClusterCreateOperator) with operators that submit jobs, or they mistakenly think DataflowPythonOperator can run Spark jobs because both are data processing frameworks.

How to eliminate wrong answers

Option A is wrong because DataflowPythonOperator is used to run Apache Beam pipelines on Dataflow, not Spark jobs on Dataproc. Option B is wrong because BigQueryOperator is used to execute BigQuery SQL queries or load jobs, not to run Spark jobs. Option C is wrong because DataprocClusterCreateOperator is used to create a new Dataproc cluster, not to submit a job to an existing cluster; the question assumes the cluster already exists or is managed separately, and the focus is on submitting the Spark job.

14
Multi-Selecthard

A company runs BigQuery workloads with varying demand. They want to use flat-rate pricing with baseline slots and the ability to burst during peak times. Which TWO actions should they take? (Choose 2)

Select 2 answers
A.Use on-demand pricing for bursting
B.Use flex slots for short-term bursts
C.Set a maximum number of slots per query
D.Purchase committed use reservations for baseline slots
E.Create a reservation with baseline + autoscaling slots
AnswersB, D

Flex slots are ideal for handling peak demand without commitment.

Why this answer

Committed use reservations provide baseline slots at a discount. Flex slots allow bursting for short periods without long-term commitment.

15
MCQhard

You manage a BigQuery reservation with 500 baseline slots and autoscaling up to 2000 slots. Your team runs a mix of interactive queries and batch load jobs. During peak hours, you notice that interactive queries are throttled when autoscaling slots are consumed by long-running batch loads. How can you ensure interactive queries get priority access to slots?

A.Create a separate reservation for interactive queries with a higher priority assignment.
B.Reduce the baseline slots to 200 and rely solely on autoscaling.
C.Switch to on-demand pricing to eliminate slot contention.
D.Set the autoscaling max to 1000 slots for batch jobs.
AnswerA

Creating separate reservations for interactive and batch workloads allows you to control slot allocation and prioritize interactive queries.

Why this answer

Option A is correct because BigQuery reservations allow you to create separate reservations for different workloads (e.g., interactive queries vs. batch loads) and assign them different priority levels. By creating a dedicated reservation for interactive queries with a higher priority, you ensure that interactive queries get preferential access to slots, even when autoscaling slots are consumed by long-running batch jobs. This directly addresses the contention issue without reducing overall capacity.

Exam trap

Cisco often tests the misconception that autoscaling alone or reducing baseline slots can solve priority issues, but the key is that without separate reservations and explicit priority assignments, all jobs compete equally for the same pool of slots.

How to eliminate wrong answers

Option B is wrong because reducing baseline slots to 200 and relying solely on autoscaling does not solve the priority issue; autoscaling slots are shared and batch jobs could still consume them, leading to the same throttling of interactive queries. Option C is wrong because switching to on-demand pricing eliminates slot reservations entirely, meaning you lose the ability to guarantee capacity or prioritize workloads, and you may face unpredictable performance and higher costs. Option D is wrong because setting the autoscaling max to 1000 slots for batch jobs does not prevent batch jobs from consuming all available slots; it only limits the maximum they can use, but without priority assignment, interactive queries can still be throttled if batch jobs fill the reservation.

16
MCQhard

Your Dataflow streaming pipeline is experiencing increasing system lag over time. You have enabled autoscaling and the pipeline is using the default streaming engine. Which metric should you monitor in Cloud Monitoring to determine if the pipeline is falling behind due to slow processing or due to a bottleneck in the output sink?

A.Worker CPU utilization
B.System lag
C.Element count
D.Data freshness
AnswerB

System lag measures the maximum time data is waiting to be processed. High system lag indicates the pipeline is falling behind due to processing bottlenecks.

Why this answer

In Dataflow streaming pipelines, 'System lag' represents the maximum time that an item has been waiting to be processed. If system lag is high, it indicates that the pipeline cannot keep up with the input rate. However, to pinpoint if the bottleneck is the sink, you should also monitor 'Data freshness' (time since last output) or write to a staging area temporarily.

But the question specifically asks for a metric that helps determine if the pipeline is falling behind due to slow processing or a sink bottleneck. The correct answer is 'System lag' as it directly reflects processing delay; if it increases, the pipeline is not processing fast enough regardless of sink.

17
MCQmedium

Your streaming Dataflow pipeline reads from Pub/Sub, enriches data with a side input, and writes to BigQuery. You need to update the enrichment logic without draining the pipeline, to minimize data loss and maintain exactly-once semantics. What should you do?

A.Cancel the pipeline and create a new one with the updated code.
B.Stop the pipeline, update the code, and restart from the latest snapshot.
C.Use the Dataflow job update mechanism to replace the pipeline with a new version.
D.Drain the pipeline, update the code, and restart with the same job ID.
AnswerC

Dataflow allows updating a streaming pipeline with a new job graph, preserving state and exactly-once processing.

Why this answer

Option C is correct because the Dataflow job update mechanism allows you to replace a running pipeline's code with a new version without draining or stopping it, preserving the existing state and minimizing data loss. This mechanism supports exactly-once semantics by ensuring that all in-flight elements are processed exactly once, even after the update, by maintaining the pipeline's checkpoint and watermark state.

Exam trap

The trap here is that candidates often confuse the Dataflow job update mechanism with draining or snapshot-based restarts, not realizing that Dataflow's update feature is specifically designed to allow in-place code changes without data loss or reprocessing.

How to eliminate wrong answers

Option A is wrong because canceling the pipeline would discard all in-flight data and state, leading to data loss and violating exactly-once semantics. Option B is wrong because stopping the pipeline and restarting from a snapshot is not a supported operation in Dataflow; snapshots are used for draining or saving state, but restarting from a snapshot does not guarantee exactly-once processing and can cause data duplication or loss. Option D is wrong because draining the pipeline would allow it to finish processing all existing data before stopping, but then you must create a new pipeline with a new job ID; restarting with the same job ID is not possible after draining, and the drain process itself can cause data loss if not handled correctly.

18
MCQhard

A company runs a Dataflow streaming pipeline that processes financial transactions. They need to apply a new transformation that enriches the data with a lookup from Cloud Bigtable without stopping the pipeline. The pipeline must be updated in a way that minimises data loss and preserves exactly-once semantics. What is the recommended approach?

A.Use the Dataflow update option with the same pipeline name and new version, ensuring the transform is backward compatible.
B.Drain the pipeline first, then start a new pipeline with the updated code.
C.Create a new pipeline in parallel and switch the Pub/Sub subscription to the new pipeline.
D.Stop the pipeline, update the code, and restart with a new pipeline name.
AnswerA

Updating preserves state and exactly-once semantics.

Why this answer

Dataflow supports updating a streaming pipeline without draining by replacing the pipeline version. Using the --update flag with the same pipeline name and a new version allows the pipeline to be upgraded while preserving the state exactly-once.

19
MCQmedium

You are troubleshooting a Dataproc cluster that runs nightly Spark jobs. The jobs are failing with out-of-memory errors. You want to reduce costs while fixing the issue. Which combination of actions should you take? (Select the BEST answer.)

A.Use SSD persistent disks for all nodes to improve I/O performance.
B.Decrease the number of worker nodes and increase the size of each worker.
C.Switch to a high-memory machine type for the master node only.
D.Increase the number of preemptible worker nodes and use standard machine types for the master.
AnswerD

More workers increase parallelism and memory capacity; preemptible workers are cost-effective. The master node remains standard to ensure stability.

Why this answer

Preemptible workers are cheaper and can be used for worker nodes, but not for master nodes. Adding more preemptible workers can increase parallelism and reduce memory pressure per worker, but may cause more preemptions. Using high-memory master nodes is not necessary for worker memory issues.

Using SSDs for scratch storage can improve performance but does not directly address OOM. Reducing worker count would exacerbate the problem.

20
MCQmedium

Your organization has a BigQuery flat-rate reservation with 500 slots. During peak hours, queries are queued and you need additional capacity temporarily. You want to add slots for a burst of activity without committing to a long-term purchase. What should you do?

A.Switch to on-demand pricing
B.Use flex slots
C.Create a secondary reservation with autoscaling
D.Purchase additional committed use reservations
AnswerB

Flex slots provide temporary capacity on an hourly basis, perfect for bursting.

Why this answer

Flex slots are short-term, hourly commitments that can be added to an existing flat-rate reservation to handle bursts. They are ideal for temporary capacity needs.

21
MCQmedium

A team wants to enforce data quality rules on BigQuery tables using Dataplex. They need to run column-level checks for null values and row-level checks for value ranges on a schedule. Which Dataplex feature should they use?

A.Dataplex Data Profiling
B.BigQuery stored procedures with scheduled queries
C.Dataplex Data Quality Tasks
D.Cloud DLP inspection jobs
AnswerC

Data Quality Tasks accept custom SQL rules and can be scheduled.

Why this answer

Dataplex Data Quality Tasks allow defining SQL-based rules (row and column) and scheduling scans.

22
Multi-Selectmedium

A data engineer needs to monitor a Pub/Sub-based streaming pipeline. Which two Cloud Monitoring metrics should be used to detect a backlog of unprocessed messages? (Choose two.)

Select 2 answers
A.subscription/oldest_unacked_message_age
B.topic/byte_cost
C.subscription/num_undelivered_messages
D.topic/send_request_count
E.subscription/ack_message_count
AnswersA, C

This metric shows the age of the oldest unacknowledged message, indicating backlog depth.

Why this answer

The 'subscription/num_undelivered_messages' metric shows the number of messages not yet acknowledged, and 'subscription/oldest_unacked_message_age' indicates how long messages have been waiting. Both help detect backlog.

23
MCQhard

A data engineer is designing a Dataflow pipeline that reads from a Kafka topic (using Pub/Sub for Kafka) and writes to BigQuery. The data schema may change over time, with new fields appearing. The engineer wants to handle schema drift automatically without failing the pipeline. Which approach should the engineer use?

A.Use a Dataflow side input that reads the latest schema from a file and updates the BigQuery schema accordingly.
B.Define a UDF in Dataflow that dynamically adjusts the output schema.
C.Store the entire record as a single JSON string column in BigQuery and parse it later.
D.Configure the Dataflow pipeline to use BigQuery's schema autodetect option for each insert.
AnswerC

This is a common pattern: store raw data in a JSON column (with a flexible schema), and handle schema evolution by adding new fields as nested columns or using SQL to extract them later.

Why this answer

Option C is correct because storing the entire record as a single JSON string column in BigQuery allows the pipeline to accept any schema changes without requiring schema modifications at write time. This approach decouples the ingestion from schema evolution, enabling the data to be parsed later using BigQuery's JSON functions (e.g., JSON_EXTRACT) or by loading into a separate schema-on-read layer. It avoids pipeline failures caused by mismatched fields or type changes.

Exam trap

Cisco often tests the misconception that BigQuery's schema autodetect works with streaming inserts, but it is only available for batch load jobs, leading candidates to incorrectly choose option D.

How to eliminate wrong answers

Option A is wrong because using a side input to read a schema file introduces external dependency and latency; the schema update would not be atomic with the incoming data, and the pipeline would still need to handle schema mismatches during the window when the file is being updated. Option B is wrong because a UDF in Dataflow operates on individual elements but cannot alter the BigQuery output table schema dynamically; the output schema must be fixed at pipeline construction time, so a UDF cannot add new columns to BigQuery on the fly. Option D is wrong because BigQuery's schema autodetect option is only available for load jobs (e.g., from GCS) and is not supported for streaming inserts via the Storage Write API or tabledata.insertAll; even if it were, autodetect would fail on the first record with a new field if the table schema is not updated first.

24
Multi-Selectmedium

You need to monitor the health of a Pub/Sub subscription that feeds into a Dataflow pipeline. Which TWO Cloud Monitoring metrics are most relevant to detect if messages are not being acknowledged promptly? (Choose 2)

Select 2 answers
A.subscription/num_outstanding_messages
B.subscription/oldest_unacked_message_age
C.topic/send_request_count
D.subscription/ack_message_count
E.subscription/num_undelivered_messages
AnswersB, E

This metric indicates how long messages have been waiting for acknowledgment.

Why this answer

The metric `subscription/num_undelivered_messages` indicates the number of messages not yet delivered/acknowledged. `subscription/oldest_unacked_message_age` shows the age of the oldest unacknowledged message, indicating backlogs.

25
Multi-Selecteasy

You are using Cloud Workflows to orchestrate a series of API calls. You need to handle errors and retries. Which THREE features of Cloud Workflows can you use? (Choose THREE.)

Select 3 answers
A.Use try/except blocks to catch and handle errors.
B.Integrate with Cloud Load Balancing for high availability.
C.Use conditional branches (if-else) based on step results.
D.Define a retry policy on a step.
E.Enable automatic logging for each step.
AnswersA, C, D

Workflows supports try-except-else-finally constructs.

Why this answer

Cloud Workflows supports steps with retry policies, try/except blocks for error handling, and conditional (if-else) logic for branching. It does not have built-in built-in step-level logging (logging is done via Cloud Logging), and there is no built-in load balancer integration.

26
Multi-Selecthard

A data engineer is building a Cloud Workflows workflow that orchestrates multiple Cloud Functions and API calls. The workflow should handle transient failures with retries and send a notification to a Pub/Sub topic if the workflow ultimately fails. Which THREE steps should the engineer include in the workflow definition?

Select 3 answers
A.Use a 'for' loop to iterate over retries.
B.Use the 'googleapis.pubsub.v1.projects.topics.publish' connector to send a failure notification.
C.Use a 'try' / 'catch' block to handle exceptions and route to a failure step.
D.Use a 'switch' step to check the status of previous steps and conditionally execute next steps.
E.Use a 'retry' block with a max retries and backoff configuration on each API call step.
AnswersB, D, E

Workflows can call Pub/Sub via the connector to publish messages, e.g., a failure alert.

Why this answer

Workflows supports retry policies via 'retry' blocks, conditional steps using 'switch', and Pub/Sub publishing via the 'googleapis.pubsub.v1.projects.topics.publish' connector. The 'try/catch' is not a Workflows construct; instead, use 'step' with 'retry' and 'on_error' for failure handling. 'for' loops are for iteration, not error handling.

27
Multi-Selectmedium

A company uses Cloud Composer (Airflow) to orchestrate pipelines. They want to implement a pattern where a task polls for a file arrival in Cloud Storage and then triggers subsequent tasks. Which THREE Airflow concepts are essential? (Choose 3)

Select 3 answers
A.Sensors (e.g., GCSObjectExistenceSensor)
B.XComs to pass file path between tasks
C.Operators (e.g., PythonOperator)
D.SubDAGs for grouping tasks
E.Task dependencies (bitshift operators)
AnswersA, B, E

Sensors poll for conditions like file existence.

Why this answer

A is correct because Sensors are a specialized type of operator designed to wait for a specific condition, such as file arrival in Cloud Storage. The `GCSObjectExistenceSensor` in Cloud Composer (Airflow) polls Google Cloud Storage at a configurable interval until the target file exists, making it the precise tool for this file-arrival polling pattern.

Exam trap

Cisco often tests whether candidates confuse general-purpose Operators (like PythonOperator) with Sensors, leading them to pick Operators for polling tasks when Sensors are the correct, purpose-built solution.

28
Multi-Selectmedium

A data engineer needs to migrate a schema from BigQuery where a column is currently REQUIRED and needs to become NULLABLE. Which TWO statements are correct? (Choose 2)

Select 2 answers
A.Use ALTER TABLE RENAME COLUMN and then add new column
B.Drop the column and add it again as NULLABLE
C.Use bq update --schema to change the mode
D.Create a new table with the desired schema and use a query to populate it
E.Use ALTER TABLE ALTER COLUMN SET DATA TYPE to change to NULLABLE
AnswersB, D

Dropping and adding the column changes its mode to NULLABLE.

Why this answer

BigQuery does not allow changing a column from REQUIRED to NULLABLE directly. One must either drop and recreate the column or use a query to create a new table.

29
MCQeasy

An engineer needs to create a reusable Dataflow pipeline that can be executed with different parameters without modifying code. Which Dataflow feature should they use?

A.Dataflow Shuffle
B.Dataflow Flex Templates
C.Dataflow SQL
D.Dataflow Classic Templates
AnswerB

Flex Templates use Docker containers and support any pipeline dependency, allowing parameterization.

Why this answer

Flex Templates allow packaging a pipeline into a Docker image with parameterization, enabling reuse across different environments.

30
MCQeasy

A streaming Dataflow pipeline needs to be updated without draining the existing pipeline. Which update strategy should be used?

A.Drain the pipeline first, then start a new one
B.Replace the job with a new job using the same pipeline name
C.Use a different pipeline name and cancel the old one
D.Stop the job, update the code, and restart
AnswerB

Dataflow supports updating an existing streaming job if the pipeline name matches and the graph is compatible.

Why this answer

Dataflow Streaming Engine allows updates without draining by using the same pipeline name and enabling Streaming Engine.

31
MCQmedium

You are monitoring a streaming Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. In Cloud Monitoring, you notice that the 'system_lag' metric is increasing over time and now exceeds 10 minutes. The 'data_watermark' metric shows a steady lag. What is the most likely cause of the increasing system lag?

A.BigQuery write throughput is throttling the pipeline.
B.The Pub/Sub subscription has too many unacknowledged messages.
C.The pipeline is using a global window with late data handling.
D.The Dataflow pipeline is underprovisioned with workers, causing processing backlog.
AnswerD

Insufficient workers lead to a backlog, increasing system lag. Autoscaling may be delayed or maxed out.

Why this answer

Option D is correct because an underprovisioned pipeline lacks sufficient worker resources to process incoming messages at the rate they arrive. This causes a growing backlog in the pipeline's internal buffers, which directly increases the 'system_lag' metric (the time between data ingestion and processing). The 'data_watermark' lag remaining steady indicates that the pipeline is still making progress on event-time processing, but the overall processing capacity is insufficient to keep up with the input rate.

Exam trap

The trap here is that candidates confuse 'system_lag' (processing delay) with 'data_watermark' (event-time completeness), leading them to incorrectly attribute the issue to late data handling or Pub/Sub acknowledgment problems instead of a simple resource underprovisioning.

How to eliminate wrong answers

Option A is wrong because BigQuery write throughput throttling would manifest as a steady or increasing 'data_watermark' lag (due to backpressure on event-time processing) and would typically cause write failures or retries, not a steadily increasing system lag while watermark lag stays constant. Option B is wrong because too many unacknowledged messages in Pub/Sub would indicate a subscriber issue, but Dataflow manages its own acknowledgments; if the pipeline were failing to acknowledge, the 'system_lag' would not necessarily increase—instead, the subscription backlog would grow, and the pipeline might stall. Option C is wrong because using a global window with late data handling would affect the 'data_watermark' metric (it would lag as late data arrives), not the 'system_lag' metric, which measures processing delay independent of windowing strategy.

32
MCQeasy

Which BigQuery feature allows you to estimate the cost of a query before running it, by returning the number of bytes that would be processed?

A.EXPLAIN statement
B.INFORMATION_SCHEMA.JOBS
C.--dry_run flag
D.Slot estimator
AnswerC

dry_run returns bytes processed without running the query.

Why this answer

The --dry_run flag in the BigQuery CLI or the dryRun parameter in the API simulates the query and returns the bytes processed without executing it, allowing cost estimation.

33
MCQmedium

A data engineer is building a batch pipeline that runs daily using Cloud Composer. The pipeline has three tasks: extract data from Cloud Storage, transform data using Dataflow, and load the transformed data into BigQuery. The engineer wants to ensure that the Dataflow job only starts after the extraction task completes successfully, and the load task only starts after the Dataflow job finishes. How should the engineer define the task dependencies in the Airflow DAG?

A.extract >> [transform, load]
B.transform >> extract >> load
C.extract >> transform >> load
D.extract >> load >> transform
AnswerC

Correct: This defines sequential dependencies: extract before transform, transform before load.

Why this answer

Option C is correct because Airflow uses the bitshift operator (>>) to define task dependencies in a linear sequence. The DAG must ensure that the extract task completes before the transform task starts, and the transform task completes before the load task starts. This is achieved by chaining the tasks in order: extract >> transform >> load, which enforces the required sequential execution.

Exam trap

Cisco often tests the misconception that multiple tasks can be chained in parallel with a single bitshift operator, leading candidates to choose Option A, which incorrectly allows the load task to start before the Dataflow job completes.

How to eliminate wrong answers

Option A is wrong because it sets transform and load as parallel downstream tasks of extract, meaning load could start before transform finishes, violating the requirement that load waits for Dataflow. Option B is wrong because it places transform before extract, which would attempt to run the Dataflow job before the extraction completes, breaking the dependency chain. Option D is wrong because it places load before transform, which would attempt to load data into BigQuery before the Dataflow transformation is done, leading to incorrect or missing data.

34
MCQhard

You want to create a cost-efficient snapshot of a large BigQuery table that can be used by other teams for read-only analytics without incurring additional storage costs for the base table data. What should you use?

A.Create a BigQuery table snapshot of the original table.
B.Export the table to Cloud Storage as Avro files and load into a new table.
C.Create a view over the original table.
D.Create a BigQuery table clone of the original table.
AnswerD

Table clones share storage with the base table, so no additional storage cost initially. Charges apply only for modifications.

Why this answer

Option D is correct because a BigQuery table clone creates a read-only, cost-efficient copy of the table that references the underlying storage of the base table, so no additional storage costs are incurred for the base data. Clones are ideal for sharing snapshots for read-only analytics without duplicating storage, and they support time-travel queries within the clone's retention period.

Exam trap

Cisco often tests the distinction between table clones (zero-cost storage for base data) and table snapshots (which incur storage costs for the snapshot data), leading candidates to mistakenly choose snapshots for cost efficiency.

How to eliminate wrong answers

Option A is wrong because a BigQuery table snapshot incurs additional storage costs for the snapshot data, as it creates a separate copy of the table's data at a point in time, not a zero-cost reference. Option B is wrong because exporting to Cloud Storage as Avro files and loading into a new table duplicates the data, incurring both export and storage costs for the new table, which is not cost-efficient. Option C is wrong because a view does not create a snapshot; it runs a query against the original table each time it is accessed, which can incur query costs and does not provide a static, read-only copy for other teams.

35
MCQeasy

A data engineer wants to quickly estimate the cost of running a BigQuery query before executing it. Which command-line tool or command should they use?

A.gcloud logging read
B.bq query --use_cache=false
C.gcloud bigtable queries run
D.bq query --dry_run
AnswerD

The --dry_run flag parses and validates the query, then outputs the bytes processed without executing.

Why this answer

The `bq query --dry_run` command parses the query and reports the number of bytes processed without executing it, allowing cost estimation based on on-demand pricing.

36
MCQeasy

You want to monitor the latency of messages in a Pub/Sub subscription. Which Cloud Monitoring metric should you use to see the age of the oldest unacknowledged message?

A.pubsub.googleapis.com/subscription/oldest_unacked_message_age
B.pubsub.googleapis.com/subscription/num_undelivered_messages
C.pubsub.googleapis.com/topic/send_request_count
D.pubsub.googleapis.com/topic/publish_latency
AnswerA

This metric directly shows the age of the oldest unacknowledged message, indicating processing latency.

Why this answer

The metric 'subscription/oldest_unacked_message_age' measures the age (in seconds) of the oldest unacknowledged message in a subscription. This helps track processing lag. The other metrics measure different aspects: num_undelivered_messages is count, not age; topic metrics are irrelevant for subscription lag; publish_latency is about publishing, not consumption.

37
MCQmedium

You are building a data pipeline that runs daily batch jobs on Dataproc, then loads results into BigQuery. You want to orchestrate the entire workflow, including dependencies between steps, retries, and monitoring. Which Google Cloud service is most appropriate?

A.Cloud Scheduler
B.Cloud Composer
C.Cloud Workflows
D.Dataflow
AnswerB

Cloud Composer (Airflow) is the right choice for complex workflows with dependencies, retries, and scheduling across Dataproc and BigQuery.

Why this answer

Cloud Composer is a managed Apache Airflow service that provides DAG-based orchestration with rich operators for Dataproc, BigQuery, and other GCP services. It handles dependencies, retries, and monitoring out of the box. Workflows is simpler and serverless but lacks the extensive operator library and scheduling flexibility of Airflow.

38
MCQmedium

Your organization has a BigQuery flat-rate reservation with 2000 slots. During peak hours, query performance degrades because concurrent queries exceed the available slots. You want to handle these bursts without changing the base reservation. What should you do?

A.Enable autoscaling on the reservation to automatically add slots up to a maximum.
B.Purchase committed use discounts to increase the base reservation to 3000 slots.
C.Change the pricing model to on-demand to allow unlimited slots.
D.Purchase flex slots during peak hours to add capacity temporarily.
AnswerD

Flex slots are short-term, pay-as-you-go slots that can be added to a reservation for burst capacity, then released.

Why this answer

Option D is correct because flex slots allow you to temporarily add capacity to a BigQuery flat-rate reservation without committing to a permanent increase. This handles burst workloads during peak hours by adding slots on demand, and you only pay for the time they are used. The base reservation of 2000 slots remains unchanged, meeting the requirement.

Exam trap

Cisco often tests the distinction between permanent capacity changes (committed use discounts) and temporary capacity additions (flex slots), trapping candidates who confuse autoscaling (which modifies the reservation's behavior) with the requirement to keep the base reservation unchanged.

How to eliminate wrong answers

Option A is wrong because autoscaling automatically adds slots up to a maximum, but it changes the base reservation by enabling a dynamic scaling mechanism that can incur costs even when not needed, and it does not preserve the original 2000-slot base reservation as a fixed baseline. Option B is wrong because purchasing committed use discounts increases the base reservation permanently to 3000 slots, which contradicts the requirement to not change the base reservation. Option C is wrong because switching to on-demand pricing removes the reservation entirely and uses unlimited slots, but it changes the pricing model and does not preserve the flat-rate reservation structure.

39
MCQeasy

Which Dataflow feature allows you to package a pipeline into a reusable template that can be deployed with different parameters at runtime?

A.Cloud Dataproc
B.Classic Templates
C.Dataflow SQL
D.Flex Templates
AnswerD

Flex Templates use Docker containers and support complex runtime parameters.

Why this answer

Dataflow Flex Templates allow you to containerize a pipeline and provide runtime parameters, enabling reusability across different environments or jobs.

40
Multi-Selectmedium

You are configuring Dataplex data quality rules for a BigQuery table. Which three types of rules can be defined using Dataplex's SQL-based rule engine? (Choose three.)

Select 3 answers
A.Row-level rules (e.g., condition must be true for every row)
B.Set-level rules (e.g., intersection or difference between two tables)
C.Pattern matching rules (e.g., regex on column values)
D.Table-level rules (e.g., row count threshold)
E.Column-level rules (e.g., uniqueness, nullness, range)
AnswersA, D, E

Row-level rules validate each row against a condition.

Why this answer

Option A is correct because Dataplex's SQL-based rule engine supports row-level rules that enforce a condition that must be true for every row in a BigQuery table. These rules are defined using a SQL expression that is evaluated per row, and if any row fails the condition, the rule is violated. This allows you to validate data integrity at the most granular level, such as ensuring a column value is always positive.

Exam trap

Cisco often tests the distinction between rule categories that are natively supported versus those that require custom SQL workarounds, leading candidates to mistakenly select set-level or pattern matching rules as separate types when they are actually implemented within row-level rules.

41
MCQmedium

You need to inspect a BigQuery table for sensitive data such as credit card numbers and apply masking. Which GCP service should you use to identify and de-identify the data?

A.IAM Recommender
B.Cloud KMS
C.Dataplex
D.Cloud Data Loss Prevention (DLP)
AnswerD

Cloud DLP is designed for inspection and de-identification of sensitive data.

Why this answer

Cloud DLP can automatically inspect BigQuery tables for sensitive data types (credit card numbers, etc.) and apply de-identification transforms like masking, tokenization, etc.

42
MCQmedium

A company uses Dataflow streaming pipelines to process real-time events. They notice increasing system lag over time. Which two Cloud Monitoring metrics should be examined to diagnose the cause?

A.Pub/Sub subscription/num_undelivered_messages and Dataflow job/watermark_lag
B.Dataproc cluster/yarn_allocated_memory_percentage and Dataflow job/worker_cpu
C.Dataflow job/system_lag and Dataflow job/data_freshness
D.BigQuery query/execution_times and Dataflow job/elapsed_time
AnswerC

system_lag indicates processing delay; data_freshness shows watermark progress. Both are key for streaming lag.

Why this answer

System lag measures the time between event ingestion and processing. Data freshness shows the watermark. Worker CPU indicates compute resource issues.

43
MCQmedium

A data engineer notices that BigQuery queries are slower than expected. They want to identify the most expensive stages in the query execution. Which tool or command should they use?

A.Use bq show to view job statistics
B.Use bq query --format=prettyjson and look at statistics
C.Use Cloud Monitoring to view query execution graphs
D.Use EXPLAIN statement in BigQuery
AnswerD

EXPLAIN shows query plan, stage cost, and steps.

Why this answer

EXPLAIN provides query plan details including costs per stage. It can be run via bq or in the console.

44
Multi-Selecthard

You have a BigQuery table with a REQUIRED column that you now need to allow NULL values. You also need to add two new nullable columns. Which THREE steps are required to achieve this schema evolution? (Choose 3)

Select 3 answers
A.Add the new columns using ALTER TABLE ADD COLUMN.
B.Use the ALTER TABLE SET OPTIONS statement to change the column mode to NULLABLE.
C.Update the IAM permissions on the table.
D.Use CREATE OR REPLACE TABLE with the new schema and import data.
E.Export the table to Cloud Storage in Avro format.
AnswersB, D, E

ALTER TABLE ... ALTER COLUMN SET DATA TYPE or SET OPTIONS can change mode to NULLABLE.

Why this answer

BigQuery requires exporting data, modifying the schema, and importing data or using SQL statements. The schema can be updated via `bq update` or ALTER TABLE. Simply adding columns is possible, but changing from REQUIRED to NULLABLE requires recreation or ALTER TABLE CHANGE COLUMN.

45
Multi-Selectmedium

A company is migrating their on-premises data warehouse to BigQuery. They have a mix of batch and streaming ingestion. The data team wants to optimize query costs. Which THREE practices should they adopt?

Select 3 answers
A.Switch to flat-rate pricing to cap slot usage.
B.Use materialized views for frequently executed aggregations.
C.Partition tables by a date or timestamp column.
D.Limit the number of concurrent queries by setting a maximum slot capacity.
E.Cluster tables on columns that are frequently used in filters and joins.
AnswersB, C, E

Materialized views automatically refresh and are used by the query optimizer to speed up queries and reduce scanned bytes.

Why this answer

Partitioning by date reduces bytes scanned. Using clustered tables improves performance for filter/join queries. Using materialized views can precompute aggregations and reduce scans.

Flat-rate pricing is about reservation management, not cost optimization per query. Limiting slots is not a cost optimization; it may cause throttling.

46
Multi-Selectmedium

You want to optimize BigQuery costs for a large dataset that is frequently queried by time range. You also need to ensure that predictable workloads have dedicated slot capacity. Which TWO strategies should you combine? (Choose 2)

Select 2 answers
A.Use query caching
B.Partition the table by date
C.Purchase committed use reservations for baseline capacity
D.Create a materialized view for the entire table
E.Enable autoscaling slots
AnswersB, C

Partitioning by date limits the bytes scanned per query, reducing cost.

Why this answer

Partitioned tables reduce bytes scanned by time-range queries. Committed use reservations provide predictable slot capacity at a discount. Autoscaling does not provide dedicated capacity; caching is automatic.

47
Multi-Selectmedium

Your company uses Cloud Composer to orchestrate a data pipeline that includes Dataproc Spark jobs and BigQuery load operations. You need to pass the output file path from the Spark job to the next BigQuery task in the DAG. Which two mechanisms can you use to share data between tasks? (Choose TWO.)

Select 2 answers
A.Store the output path as a Cloud Composer variable.
B.Publish the output path to a Pub/Sub topic and subscribe in the next task.
C.Write the output path to a Cloud Storage object and read it in the next task.
D.Use BigQuery as an intermediary to store the output path.
E.Use Airflow XComs to push the output path from the Spark task and pull it in the BigQuery task.
AnswersC, E

Cloud Storage is a durable store that both tasks can access.

Why this answer

Airflow XComs allow tasks to exchange small amounts of data (e.g., file paths) by pushing and pulling values. Cloud Storage can be used as an intermediate store: the Spark job writes output to GCS, and the BigQuery task reads from that location. BigQuery does not directly communicate with Dataproc.

Cloud Composer variables are for global configuration, not task-to-task. Pub/Sub is not needed for simple file path sharing.

48
MCQhard

A data engineer needs to alert when Pub/Sub subscription has messages older than 1 hour. Which Cloud Monitoring metric and filter should they use?

A.Metric: topic/send_message_operation_count; filter: topic_id
B.Metric: subscription/ack_message_count; filter: subscription_id
C.Metric: subscription/num_undelivered_messages; filter: subscription_id
D.Metric: subscription/oldest_unacked_message_age; filter: subscription_id
AnswerD

Correct metric and filter for alerting on message age.

Why this answer

The metric subscription/oldest_unacked_message_age gives the age of the oldest unacknowledged message. Filtering by subscription ID targets the specific subscription.

49
MCQmedium

You are monitoring a Dataproc cluster and notice that the cluster utilisation is high, but jobs are running slowly. The cluster uses preemptible workers for cost savings. What is the most likely cause of the performance degradation?

A.The primary workers are using standard disks instead of SSDs.
B.The preemptible workers are being preempted frequently, causing task retries and slowdowns.
C.The cluster is under-provisioned; increase the number of preemptible workers.
D.The cluster is using an older image version; upgrade to the latest.
AnswerB

Preemptible workers have a high chance of termination, which affects job performance.

Why this answer

Preemptible workers can be terminated at any time, causing job restarts and slower execution. The high utilisation indicates many workers are preempted, leading to recomputation.

50
MCQmedium

A company runs a critical batch pipeline using Cloud Dataflow. The pipeline processes financial transactions and runs every hour. Recently, some runs have failed due to transient errors (e.g., network timeouts). The engineer wants to automatically retry failed runs without manual intervention. The pipeline is launched from a Cloud Composer DAG using DataflowPythonOperator. What is the BEST way to handle retries?

A.Add a DataflowJobStatusSensor in the DAG that waits for job completion and retries if failed.
B.Set the 'retries' parameter in the DAG's default_args to a positive integer.
C.Configure the Dataflow pipeline to automatically retry on failure using the --numberOfWorkerHarnessThreads option.
D.Use a Cloud Function triggered by Cloud Scheduler to re-launch the pipeline if the Dataflow job fails.
AnswerB

This allows Airflow to retry the entire task (which launches the Dataflow job) if it fails due to transient errors.

Why this answer

Option B is correct because Cloud Composer (Apache Airflow) natively supports task-level retries via the 'retries' parameter in default_args. When a DataflowPythonOperator fails due to a transient error, Airflow automatically re-executes the task up to the specified number of retries, without requiring custom sensors or external triggers. This is the simplest and most reliable mechanism for handling transient failures in a DAG-driven pipeline.

Exam trap

The trap here is that candidates confuse Dataflow-level retry options (like --maxRetryAttempts) with Airflow task-level retries, or assume that a sensor or external trigger is required to detect and retry failures, when in fact Airflow's native retry parameter is the simplest and most appropriate solution for transient errors in a DAG-managed pipeline.

How to eliminate wrong answers

Option A is wrong because a DataflowJobStatusSensor only monitors job status and does not automatically retry the pipeline; it would require additional branching logic to relaunch the job, adding unnecessary complexity. Option C is wrong because --numberOfWorkerHarnessThreads controls parallelism within the Dataflow worker, not retry behavior on pipeline failure; retries are configured via --maxRetryAttempts or similar Dataflow pipeline options, not this flag. Option D is wrong because using a Cloud Function and Cloud Scheduler introduces an external dependency and latency, whereas Airflow's built-in retry mechanism is more direct and integrated with the DAG's execution context.

51
MCQeasy

An organization uses BigQuery on-demand pricing. To control costs, they want to estimate the bytes processed by a query before running it. Which command or method should they use?

A.Use the bq query --dry_run command
B.Use bq ls to list table sizes
C.Use BigQuery reservations to get cost estimate
D.Use INFORMATION_SCHEMA.JOBS_BY_PROJECT to view past costs
AnswerA

Dry run provides byte estimate without running the query.

Why this answer

BigQuery dry run estimates bytes processed without executing the query. It can be done via CLI with --dry_run flag or in the console.

52
MCQmedium

You have a BigQuery table that is used by multiple teams. To save costs, you want to provide a consistent view of the data as of a specific point in time without creating full copies. Which BigQuery feature should you use?

A.Authorized views
B.Materialized views
C.Table snapshot
D.Table clone
AnswerC

Table snapshots are read-only, point-in-time copies that cost only storage and are ideal for sharing consistent views.

Why this answer

BigQuery table snapshots provide a point-in-time copy of a table that incurs only storage costs for the snapshot (no additional slot usage). They are read-only and can be used to share data without duplicating the base table.

53
Multi-Selecteasy

You need to deploy a reusable Dataflow pipeline that can be executed with different parameters from Cloud Composer. Which TWO components should you use? (Choose 2)

Select 2 answers
A.Direct runner
B.Dataflow Flex Template
C.Cloud Composer with DataflowStartFlexTemplateOperator
D.Dataflow Classic Template
E.Cloud Scheduler
AnswersB, C

Flex Templates are reusable and parameterizable.

Why this answer

Dataflow Flex Templates allow you to create custom templates that can accept runtime parameters. Cloud Composer can trigger these templates using the DataflowStartFlexTemplateOperator. Direct runner options are not needed.

54
MCQeasy

A data engineer needs to inspect a BigQuery table for sensitive data such as credit card numbers and email addresses before sharing it with a third party. The engineer also wants to de-identify the data by masking the sensitive columns. Which Google Cloud service should be used?

A.Dataplex
B.BigQuery column-level security
C.Data Catalog
D.Cloud DLP
AnswerD

Cloud DLP can inspect BigQuery tables for sensitive info and apply de-identification transformations like masking, tokenization, etc.

Why this answer

Cloud DLP (Data Loss Prevention) is the correct service because it is specifically designed to inspect, classify, and de-identify sensitive data such as credit card numbers and email addresses. It provides built-in infoType detectors for over 150 types of sensitive data and supports masking, tokenization, and other de-identification techniques. The engineer can use Cloud DLP to scan BigQuery tables and then apply transformations to mask the sensitive columns before sharing.

Exam trap

The trap here is that candidates confuse BigQuery column-level security (access control) with data masking, but BigQuery column-level security only hides data from unauthorized users and does not inspect or transform the data itself, whereas Cloud DLP performs actual de-identification.

How to eliminate wrong answers

Option A is wrong because Dataplex is a data fabric service for managing, governing, and cataloging data across lakes, warehouses, and marts, but it does not natively inspect or de-identify sensitive data; it relies on integration with Cloud DLP for such tasks. Option B is wrong because BigQuery column-level security (using policy tags) only controls access at the column level by granting or denying read permissions, but it does not inspect content for sensitive data or perform masking/de-identification. Option C is wrong because Data Catalog is a metadata management service for discovering and tagging data assets, but it cannot scan for sensitive data patterns or apply de-identification transformations.

55
MCQmedium

Your team uses Cloud Composer to run Apache Airflow DAGs. One DAG uses a BigQueryInsertJobOperator to run a query and then uses BigQueryCheckOperator to verify the results. The DAG is failing intermittently because the query result is not ready when the check operator runs. How should you modify the DAG to ensure the check operator runs only after the query completes successfully?

A.Add a BigQueryTableExistenceSensor before the BigQueryCheckOperator to wait for the table to be created.
B.Change the BigQueryInsertJobOperator to use deferrable mode to make it async.
C.Use a PythonOperator to call the BigQuery API directly and wait for the job to finish.
D.Increase the timeout of the BigQueryCheckOperator.
AnswerA

Sensors wait for a condition; this ensures the table exists before the check runs.

Why this answer

The BigQueryInsertJobOperator is synchronous by default, so the DAG should already wait for completion. However, if the check operator fails due to data not being written, adding a sensor operator (e.g., BigQueryTableExistenceSensor) between the two tasks ensures the table exists before checking.

56
MCQmedium

A company uses Cloud Pub/Sub for a real-time data pipeline. The subscription has a backlog of millions of messages that are not being processed quickly enough. In Cloud Monitoring, you observe that the 'subscription/num_undelivered_messages' metric is high and growing, while 'subscription/oldest_unacked_message_age' is also increasing. Which action is MOST likely to reduce the backlog?

A.Delete the subscription and recreate it with a larger message retention duration.
B.Reduce the acknowledgment deadline to force faster processing.
C.Change the subscription type from push to pull.
D.Increase the number of subscribers or the throughput capacity of the existing subscribers.
AnswerD

Adding more subscribers (e.g., scaling out Dataflow workers) increases the rate of message processing, reducing the backlog.

Why this answer

Option D is correct because the backlog indicates that subscribers cannot keep up with the message flow. Increasing the number of subscribers or scaling their throughput capacity directly addresses the processing bottleneck, allowing messages to be pulled and acknowledged faster. Cloud Pub/Sub scales horizontally, so adding more pull subscribers or increasing the resources of existing ones (e.g., more worker threads, higher CPU/memory) reduces the backlog.

Exam trap

Cisco often tests the misconception that reducing the acknowledgment deadline or changing subscription type will speed up processing, when in reality these actions can increase redeliveries or do not address the root cause of insufficient subscriber capacity.

How to eliminate wrong answers

Option A is wrong because deleting and recreating the subscription with a larger message retention duration does not increase processing speed; it only keeps messages longer, which does not reduce the existing backlog. Option B is wrong because reducing the acknowledgment deadline forces subscribers to acknowledge messages faster, but if they cannot process them in time, it leads to more redeliveries and can worsen the backlog. Option C is wrong because changing from push to pull does not inherently increase throughput; both modes can be scaled, and the bottleneck is subscriber capacity, not the delivery mechanism.

57
MCQeasy

A company wants to share a large BigQuery dataset with a partner for analysis. The partner needs read-only access to a specific snapshot of the data as of a certain point in time, and the company wants to avoid additional storage costs for the partner. What is the most cost-effective approach?

A.Create a BigQuery table snapshot at the desired point in time and share it.
B.Create a BigQuery table clone at the desired point in time and share it with the partner.
C.Export the table to Cloud Storage as Avro and share a signed URL.
D.Grant the partner access to the original table with an authorized view.
AnswerB

A table clone is a zero-copy, read-only snapshot that does not incur additional storage costs (until data changes). It provides point-in-time consistency.

Why this answer

BigQuery table clones are zero-copy clones that share the underlying storage with the base table. They do not incur additional storage costs until the data in the clone is modified. Snapshots incur storage costs for the snapshot.

Authorized views or datasets require the partner to query the base table, which may incur analysis costs but no extra storage; however, the partner may see changes to the base table. The question specifies a point-in-time snapshot, so a clone is best.

58
Multi-Selectmedium

A data engineer needs to set up a Dataplex data quality scan to run weekly on a BigQuery table. The scan should check that: (1) the 'email' column is not null, (2) the 'age' column is between 0 and 120, and (3) the 'country_code' column matches a list of valid ISO codes. Which TWO Dataplex features should the engineer use?

Select 3 answers
A.Schedule the data quality scan using Dataplex scheduling options
B.Use Cloud DLP to inspect the table for data quality issues
C.Create a column rule for the 'age' column using SQL condition 'age BETWEEN 0 AND 120'
D.Create a Dataplex asset and add a tag template to enforce constraints
E.Create a row rule for the 'email' column using SQL condition 'email IS NOT NULL'
AnswersA, C, E

Dataplex allows scheduling data quality tasks (e.g., weekly) directly from the UI or API.

Why this answer

Dataplex data quality tasks use SQL-based rules. Row rules validate conditions per row (e.g., IS NOT NULL, BETWEEN). Column rules validate column-level conditions (e.g., in set).

Schedule is set via the scheduling feature. The other options are not Dataplex data quality features: DLP is for sensitive data, Data Catalog is for metadata.

59
MCQmedium

You are designing a Dataflow pipeline that reads from Pub/Sub, performs transformations, and writes to BigQuery. The pipeline must handle schema changes in the incoming data (e.g., new fields appearing). The BigQuery schema should evolve automatically to accept new fields without failing. Which approach should you use?

A.Use Dataprep to clean and standardise the data before loading into BigQuery.
B.Predefine the BigQuery table schema with all possible fields and use UPDATE to add missing fields.
C.Use a JavaScript UDF to parse incoming data and map to a fixed schema, ignoring new fields.
D.Set the table schema to allow unknown fields and use BigQuery's schema auto-update feature in the pipeline.
AnswerD

BigQuery can automatically add nullable columns when new fields appear if configured correctly.

Why this answer

Using BigQuery's schema auto-detection combined with specifying the write disposition as WRITE_APPEND and allowing unknown fields can handle schema drift. However, the most robust approach is to use a flexible schema in the pipeline and set BigQuery's schema update options to allow automatic addition of nullable fields.

60
MCQmedium

You need to create a reusable Dataflow pipeline for transforming CSV files in Cloud Storage into Avro files in another bucket. The pipeline should be configurable via runtime parameters (e.g., input and output paths). Which approach should you use?

A.Use Cloud Functions triggered by Cloud Storage events.
B.Create a Dataflow Classic Template with the pipeline code and parameters.
C.Use Cloud Run Jobs to run the transformation as a container.
D.Create a Dataflow Flex Template with a Docker image and parameterized metadata.
AnswerD

Flex Templates allow full customization and runtime parameters, making the pipeline reusable.

Why this answer

Option D is correct because Dataflow Flex Templates allow you to package your pipeline code and dependencies into a Docker image, and define parameterized metadata (e.g., input and output paths) that are exposed as runtime parameters. This approach provides full customization of the execution environment and supports reusable, configurable pipelines for transforming CSV to Avro across different Cloud Storage buckets.

Exam trap

Cisco often tests the distinction between Classic Templates (limited parameterization, no custom Docker) and Flex Templates (full parameterization, custom Docker), and the trap here is that candidates may choose Classic Templates because they are simpler, overlooking the requirement for configurable runtime parameters and reusable custom transformations.

How to eliminate wrong answers

Option A is wrong because Cloud Functions triggered by Cloud Storage events are stateless, have a limited execution timeout (9 minutes for HTTP functions, 10 minutes for background functions), and are not designed for long-running or complex data transformations like converting CSV to Avro; they also lack built-in support for Dataflow's parallel processing and schema handling. Option B is wrong because Dataflow Classic Templates are pre-packaged with fixed pipeline code and limited parameterization (only a few predefined parameters), and they do not support custom Docker images or advanced dependency management, making them less flexible for arbitrary runtime paths. Option C is wrong because Cloud Run Jobs are stateless containers with a maximum timeout of 60 minutes and are not optimized for large-scale data processing; they lack Dataflow's auto-scaling, exactly-once processing, and integration with Avro schema evolution.

61
MCQhard

You are optimizing a BigQuery query that scans 1 TB of data every day. The query joins a large fact table (partitioned by date) with a small dimension table. You notice that the query always scans the entire fact table, even though you only need the last 7 days of data. Which optimization will MOST reduce the bytes scanned?

A.Create a materialized view that pre-aggregates the data by day.
B.Add a WHERE clause that filters on the date column used for partitioning.
C.Cluster the fact table on the join key used in the query.
D.Change the table to use time-unit column partitioning with a 1-day partition interval.
AnswerB

This enables partition pruning, so only the last 7 days' partitions are scanned, reducing bytes from 1 TB to ~19 GB (1/52).

Why this answer

Using a WHERE clause on the partitioning column (e.g., date) allows BigQuery to perform partition pruning, drastically reducing scanned bytes. Clustering on the join key can improve performance but does not reduce bytes scanned as much as partition pruning. Materialized views precompute aggregations but may not reduce scans if the base query still scans full table.

Changing to a clustered table does not prune partitions automatically.

62
MCQmedium

You are running a streaming pipeline with Dataflow that reads from Pub/Sub and writes to BigQuery. You notice that the system lag metric is increasing over time, indicating that messages are taking longer to process. What is the most likely cause and how should you address it?

A.The source Pub/Sub topic has insufficient throughput; increase the number of partitions.
B.The Dataflow workers are CPU-bound; increase the number of workers or adjust autoscaling settings.
C.The BigQuery destination table has too many columns; reduce the number of columns.
D.The pipeline uses a batch transform that should be replaced with a streaming transform.
AnswerB

High system lag suggests worker resources are insufficient; adding workers reduces lag.

Why this answer

Increasing system lag often means the pipeline is CPU-bound, failing to keep up with the incoming data rate. Updating the pipeline with a higher number of workers (or enabling autoscaling) can resolve this.

63
MCQmedium

Your team uses Cloud Dataproc for Spark ML training jobs. You want to reduce costs for non-critical, fault-tolerant training jobs. Which Dataproc feature should you use for worker nodes?

A.Use preemptible instances for worker nodes.
B.Use custom machine types with more memory.
C.Use SSDs instead of HDDs for persistent disks.
D.Use committed use discounts for 1-year or 3-year terms.
AnswerA

Preemptible instances cost ~60-80% less and are suitable for fault-tolerant batch jobs.

Why this answer

Preemptible instances are short-lived, lower-cost VMs that Cloud Dataproc can use for worker nodes. Because the training jobs are non-critical and fault-tolerant (e.g., they can handle node failures via Spark's built-in resilience), preemptible instances significantly reduce costs while still completing the workload. This directly addresses the requirement to reduce costs for fault-tolerant jobs.

Exam trap

Cisco often tests the distinction between cost-saving features that require commitment (committed use discounts) versus those that exploit workload characteristics (preemptible instances), and candidates mistakenly choose committed use discounts because they think 'discount' always means lower cost, ignoring the fault-tolerance requirement.

How to eliminate wrong answers

Option B is wrong because custom machine types with more memory increase cost per node, which contradicts the goal of reducing costs. Option C is wrong because SSDs are more expensive than HDDs, and while they improve I/O performance, the question focuses on cost reduction, not performance. Option D is wrong because committed use discounts require a 1-year or 3-year commitment and are typically applied to all instances in a project, not specifically to worker nodes in a Dataproc cluster; they also do not leverage the fault-tolerant nature of the jobs to achieve the lowest possible cost.

64
MCQeasy

A data engineer needs to orchestrate a complex data pipeline that involves multiple steps including data extraction from Cloud Storage, transformation using Dataflow, and loading into BigQuery. The pipeline has dependencies between tasks and requires monitoring and retries. Which Google Cloud service should be used for orchestration?

A.Workflows
B.Cloud Scheduler
C.Cloud Composer
D.Cloud Tasks
AnswerC

Cloud Composer (Airflow) is designed for orchestrating complex pipelines with dependencies.

Why this answer

Cloud Composer is a managed Apache Airflow service that provides a robust platform for orchestrating complex workflows with task dependencies, retries, and monitoring.

65
MCQhard

A company wants to use Cloud DLP to inspect data in BigQuery for sensitive information and de-identify it by masking credit card numbers. They want to perform this on a schedule. Which approach should they take?

A.Use Dataplex data quality rules with a custom SQL regex
B.Use Cloud Data Loss Prevention API with Cloud Composer
C.Use BigQuery column-level security with classification
D.Use Cloud DLP inspect and de-identify jobs triggered by Cloud Scheduler
AnswerD

DLP supports scheduled inspection and de-identification via Cloud Scheduler.

Why this answer

Cloud DLP can inspect BigQuery tables and de-identify using transforms like masking. Scheduling can be done via Cloud Scheduler.

66
Multi-Selectmedium

You are building a data pipeline that ingests data from on-premises into Cloud Storage, then processes it with Dataproc, and finally loads into BigQuery. You need to schedule the pipeline to run daily. The pipeline must handle occasional failures gracefully. Which THREE Google Cloud services should you use together to achieve this? (Choose 3)

Select 3 answers
A.Cloud Storage
B.Dataproc
C.Cloud Composer
D.Dataflow
E.Pub/Sub
AnswersA, B, C

Storage is the landing zone for raw data.

Why this answer

Cloud Composer orchestrates the whole pipeline. Cloud Storage is the staging area. Dataproc processes data.

BigQuery is the destination. Dataflow is not needed. Pub/Sub is for messaging, not scheduling.

67
MCQmedium

A company runs a Dataproc cluster for ETL jobs that process data nightly. They want to reduce costs while maintaining performance. Which strategy is MOST effective?

A.Use committed use discounts for all VMs
B.Enable Dataproc auto-scaling
C.Use preemptible VMs for all nodes including master
D.Use preemptible VMs for worker nodes only
AnswerD

Workers can be preemptible because batch jobs can tolerate interruptions; master remains on-demand for reliability.

Why this answer

Preemptible VMs are cheaper and suitable for fault-tolerant batch jobs. They can be used for worker nodes in Dataproc.

68
MCQmedium

A company uses Cloud Composer for pipeline orchestration. They need to define task dependencies where Task B and Task C can run in parallel after Task A, and Task D must run after both B and C complete. How should they define the DAG?

A.A >> B; B >> D; A >> C; C >> D
B.A >> B >> C >> D
C.A >> [B, C] >> D
D.A.set_downstream(B); B.set_upstream(C); C.set_downstream(D)
AnswerC

Correct: A executes, then B and C in parallel, then D after both.

Why this answer

Using bitshift operators: A >> [B, C] >> D sets B and C after A, and D after both B and C complete.

69
MCQmedium

You are designing a Cloud Composer workflow that loads data from Cloud Storage into BigQuery, runs a Dataflow job to transform the data, and then triggers a Dataproc Spark job. After each step, you need to conditionally branch based on success or failure. Which Airflow feature allows you to pass messages between tasks to enable dynamic branching?

A.Sensors
B.XComs
C.TaskFlow API
D.DAG dependencies
AnswerB

XComs are the standard mechanism for passing messages between Airflow tasks, enabling branching based on results.

Why this answer

XComs (cross-communications) in Airflow allow tasks to exchange small amounts of data, such as status or metadata. This data can be used by BranchPythonOperator to conditionally choose downstream tasks.

70
MCQeasy

You need to orchestrate a simple, linear workflow that calls several Cloud Functions and API endpoints sequentially with conditional logic. The workflow should be defined as code and have minimal overhead. Which GCP service should you use?

A.Cloud Tasks
B.Workflows
C.Dataflow
D.Cloud Composer
AnswerB

Workflows is serverless, YAML/JSON-based, and perfect for simple orchestrations.

Why this answer

Workflows is a serverless orchestration service that uses YAML/JSON to define workflows. It is ideal for simpler, linear or conditional orchestrations without the need for full Airflow infrastructure.

71
MCQmedium

A data engineer uses Cloud Composer to orchestrate a daily batch pipeline. A downstream task should only start after an upstream BigQuery load job finishes successfully and a specific file appears in Cloud Storage. Which combination of operators should the engineer use in the Airflow DAG?

A.BigQueryInsertJobOperator with wait_for_downstream=True
B.BigQueryInsertJobOperator and GCSObjectExistenceSensor with upstream dependency
C.DataflowPythonOperator and GCSObjectExistenceSensor
D.BigQueryOperator and FileSensor with downstream dependency
AnswerB

Correct: BigQueryInsertJobOperator performs the load, GCSObjectExistenceSensor polls for the file, and upstream dependency ensures order.

Why this answer

The BigQueryInsertJobOperator (or BigQueryOperator) handles the load job, and the GoogleCloudStorageObjectExistenceSensor (or GCSObjectExistenceSensor) waits for the file. Task dependencies link them.

72
MCQhard

A BigQuery table has a REQUIRED column 'user_id' that now needs to accept NULL values due to upstream data changes. You want to alter the schema with minimal downtime and no data loss. What should you do?

A.Run `ALTER TABLE dataset.table ALTER COLUMN user_id DROP NOT NULL;`
B.Use the bq command: `bq update --set_nullable_fields user_id dataset.table`
C.Create a view that casts user_id to NULLABLE and use the view instead.
D.Drop the table and recreate it with the column as NULLABLE.
AnswerA

This BigQuery DDL statement changes the column to nullable without downtime or data loss.

Why this answer

BigQuery allows changing a column from REQUIRED to NULLABLE using the ALTER TABLE ALTER COLUMN SET DATA TYPE statement. This operation is a metadata change and does not require table recreation or data copy. Dropping and recreating the table would cause downtime and data loss.

Using a view is a workaround but doesn't change the underlying schema. Exporting and reloading is disruptive.

73
MCQhard

A data team needs to share a BigQuery dataset with another business unit. They want to provide a point-in-time snapshot of the data without incurring additional storage costs for the copy. Which BigQuery feature should they use?

A.BigQuery table snapshots
B.BigQuery table clones
C.BigQuery authorized views
D.BigQuery export to Cloud Storage
AnswerB

Clones are writable and share storage with the base table, so no extra cost for the initial copy. They can be updated independently.

Why this answer

Clones use the same underlying storage as the source table; snapshots also share storage but are immutable. Both are cost-effective. For regular updates, clones are more flexible.

74
MCQmedium

Your company stores sensitive customer data in Cloud Storage. You need to inspect the data for personally identifiable information (PII) and de-identify it before sharing with a third party. Which Google Cloud service should you use?

A.Security Command Center
B.Dataplex
C.Cloud Data Loss Prevention (DLP)
D.Cloud KMS
AnswerC

DLP is designed for inspecting and de-identifying sensitive data.

Why this answer

Cloud Data Loss Prevention (DLP) is the correct service because it is specifically designed to inspect, classify, and de-identify sensitive data such as PII in Cloud Storage. It provides built-in infoType detectors for over 150 types of PII and supports de-identification techniques like masking, tokenization, and encryption. This directly matches the requirement to inspect and de-identify data before sharing with a third party.

Exam trap

Cisco often tests the distinction between data security services (like Cloud KMS for encryption) and data inspection/de-identification services (like Cloud DLP), leading candidates to mistakenly choose Cloud KMS because they associate 'de-identify' with encryption, but Cloud KMS only manages keys, not the inspection or transformation of data content.

How to eliminate wrong answers

Option A is wrong because Security Command Center is a security and risk management platform that provides threat detection, vulnerability scanning, and compliance monitoring, but it does not have native capabilities to inspect or de-identify PII in data objects. Option B is wrong because Dataplex is a data governance and management service that helps organize, catalog, and manage data across lakes and warehouses, but it lacks built-in PII inspection and de-identification features. Option D is wrong because Cloud KMS is a key management service for creating, storing, and managing encryption keys, but it does not inspect data for PII or perform de-identification; it only provides encryption/decryption operations.

75
MCQhard

A company uses BigQuery flat-rate pricing with 500 slots purchased as a committed use discount. During peak hours, they need additional capacity but do not want to buy more committed slots. They have a secondary project used for ad-hoc queries by analysts. How can they provide burst capacity to the primary project during peak times without increasing committed spend?

A.Create flex slots in the secondary project, create a reservation in the secondary project, and assign the reservation to the primary project.
B.Enable autoscaling slot management in the primary project's reservation, allowing slots to scale up based on demand.
C.Upgrade the primary project's edition to Enterprise Plus to allow bursting.
D.Purchase additional committed use slots in the primary project and apply them to the reservation.
AnswerA

Flex slots provide temporary slots; they can be assigned to the primary project via a reservation in the secondary project.

Why this answer

Flex slots allow temporary capacity in a separate project, and can be assigned to the primary project via reservations. This provides burst capacity without committing to long-term purchases.

Ready to test yourself?

Try a timed practice session using only Pde Maintaining Automating questions.