Knowledge + Practice

CCNA Ensuring solution quality Questions

45 questions · Ensuring solution quality · All types, answers revealed

Practice these questions Domain overview All questions

1

MCQhard

A financial services company uses Dataflow pipelines with late data handling. They need to ensure that all late-arriving data is processed correctly but also want to control costs. What is the best configuration?

A.Use a global window with a very long allowed lateness (e.g., 7 days).

B.Use session windows with a gap duration of 1 hour and allowed lateness of 2 days.

C.Use sliding windows with a short allowed lateness (e.g., 10 minutes) and a side input containing historical data.

D.Use fixed windows with allowed lateness set to the maximum expected delay (e.g., 2 days) and a trivial watermark.

AnswerD

Fixed windows with a realistic allowed lateness capture late data without excessive state cost, and a trivial watermark ensures no data is dropped.

Why this answer

Option D is correct because using fixed windows with allowed lateness set to the maximum expected delay and a trivial watermark balances completeness and cost. Option A (global window with long allowed lateness) can cause high state cost. Option B (session windows) may merge late data incorrectly.

Option C (sliding windows with short allowed lateness and side input) is complex and may miss data.

Practice this question →

2

MCQhard

A company uses Cloud Spanner for a global transactional application. During peak hours, commit latency increases by over 50%. Which configuration issue is the most likely root cause?

A.Insufficient compute capacity (nodes) allocated to the instance.

B.Hotspotting due to monotonically increasing primary keys.

C.Incorrect indexing of secondary indexes.

D.Network bandwidth constraints between regions.

AnswerB

This is a common cause of latency spikes in Spanner; use hash-prefixed keys to distribute writes.

Why this answer

Monotonically increasing primary keys in Spanner create hot spots, as all writes hit a single tablet, causing contention and increased latency.

Practice this question →

3

Multi-Selectmedium

A data engineer needs to monitor the performance of BigQuery queries to identify opportunities for optimization. Which TWO metrics should they focus on? (Choose two.)

Select 2 answers

A.Slot usage

B.Data scanned per query

C.Query execution time

D.Number of tables joined

E.Number of users

AnswersA, B

Monitoring slot usage helps identify query resource consumption and opportunities to optimize.

Why this answer

Options A and E are correct: Slot usage indicates resource consumption, and data scanned per query directly correlates with cost and performance. Option B (number of tables joined) is not a direct metric. Option C (execution time) is important but can be affected by many factors.

Option D (number of users) is irrelevant.

Practice this question →

4

Multi-Selecteasy

A company uses Cloud Logging to monitor application errors. They want to set up real-time notifications for critical errors. Which two actions are essential? (Choose two.)

Select 2 answers

A.Create a log-based metric for critical errors.

B.Export logs to BigQuery for later analysis.

C.Create a Cloud Pub/Sub notification directly on the log sink.

D.Enable VPC Flow Logs to capture network traffic.

E.Set up a Cloud Monitoring alert policy based on the log-based metric.

AnswersA, E

A log-based metric extracts error counts from logs, enabling quantitative alerting.

Why this answer

First, create a log-based metric to count critical error events. Then, set up an alert policy in Cloud Monitoring that triggers when that metric crosses a threshold.

Practice this question →

5

MCQeasy

A data pipeline ingests streaming data from Pub/Sub into BigQuery via Dataflow. Recently, the pipeline has been failing with 'deadline exceeded' errors. What is the most likely cause?

A.The BigQuery streaming quota is exceeded.

B.Dataflow workers are underutilized due to batch size settings.

C.Dataflow autoscaling is disabled.

D.The Pub/Sub subscription's acknowledgement deadline is too short for the processing time.

AnswerD

A short acknowledgment deadline causes messages to be redelivered, leading to repeated processing attempts and eventual deadline exceeded errors.

Why this answer

Option D is correct because 'deadline exceeded' errors in a Dataflow pipeline reading from Pub/Sub indicate that the subscriber is taking longer to process messages than the acknowledgement deadline allows. When the deadline expires, Pub/Sub redelivers the message, causing duplicate processing and eventual pipeline failure. This is a common issue when processing time exceeds the default 10-second acknowledgement deadline.

Exam trap

Google Cloud often tests the distinction between resource quota errors (like BigQuery streaming quota) and Pub/Sub-specific timeout errors, trapping candidates who confuse 'deadline exceeded' with general quota exhaustion.

How to eliminate wrong answers

Option A is wrong because BigQuery streaming quota exceeded would produce 'quota exceeded' or 'rate limit exceeded' errors, not 'deadline exceeded' errors. Option B is wrong because underutilized workers due to batch size settings would cause poor performance or backpressure, not 'deadline exceeded' errors; the error is about processing time vs. acknowledgement deadline, not worker utilization. Option C is wrong because disabled autoscaling would lead to resource exhaustion or latency, but the specific 'deadline exceeded' error is tied to Pub/Sub's acknowledgement mechanism, not Dataflow's scaling behavior.

Practice this question →

6

MCQhard

Refer to the exhibit. A team received this error when running a query. Which optimization should they apply first?

A.Use clustering on the date column.

B.Reduce the number of rows by aggregating in a subquery.

C.Run the query as a batch job.

D.Add a WHERE clause on a partitioning column.

AnswerD

If the table is partitioned by date, this prunes partitions and reduces data scanned, directly addressing the resource limit.

Why this answer

Option A is correct because adding a WHERE clause on a partitioning column (if the table is partitioned by date) would allow BigQuery to prune partitions, significantly reducing data scanned. Option B (clustering) is helpful but less effective than partitioning for date-range queries. Option C (batch job) does not reduce resource usage.

Option D (reducing rows early) is a good practice but not as impactful as partition pruning.

Practice this question →

7

MCQmedium

A data platform uses Cloud Spanner for transactional data. They are experiencing high latency during write-heavy periods. To maintain solution quality, what configuration change is most effective?

A.Use interleaved tables to reduce the number of split operations.

B.Enable online schema changes.

C.Increase the number of nodes in the Cloud Spanner instance.

D.Manually split the table using ALTER TABLE statements.

AnswerA

Interleaved tables store related rows in the same split, minimizing distributed transaction overhead.

Why this answer

Option A is correct because interleaved tables improve data locality, reducing the number of splits and distributed commits, thus reducing write latency. Option B (increasing nodes) can increase throughput but may increase latency due to more distributed transactions. Option C (online schema changes) does not directly affect write performance.

Option D (manual splitting) is not recommended and may worsen performance.

Practice this question →

8

MCQhard

A data analyst runs a complex SQL query in BigQuery that joins multiple large tables and receives the above error. Which action is most likely to resolve the issue?

A.Use a larger number of workers in the query execution.

B.Use smaller tables by sampling data.

C.Add clustering on join columns.

D.Increase the number of slots allocated to the project.

AnswerD

More slots provide more memory and CPU, reducing resource exceeded errors.

Why this answer

The error indicates that the query exceeded the available slot resources in the BigQuery project. Increasing the number of slots allocated to the project (option D) directly addresses this by providing more compute capacity for parallel query execution, which is the correct action to resolve resource exhaustion in BigQuery's serverless architecture.

Exam trap

Google Cloud often tests the misconception that performance tuning (e.g., clustering or sampling) can resolve resource exhaustion errors, when in fact the root cause is insufficient compute capacity that must be addressed by increasing slot allocation.

How to eliminate wrong answers

Option A is wrong because BigQuery automatically manages parallelism; manually specifying a larger number of workers is not supported and would not increase slot capacity. Option B is wrong because sampling data reduces accuracy and may not reflect the full dataset, which is not a valid solution for resource exhaustion—it changes the query result rather than fixing the resource issue. Option C is wrong because clustering on join columns improves query performance and reduces data scanned, but it does not increase the number of slots available; the error is about insufficient compute resources, not about inefficient data access patterns.

Practice this question →

9

MCQmedium

A company uses GKE to run microservices. They want to ensure the application restarts automatically if it becomes unresponsive. Which probes should they configure in their pod spec?

A.Startup probes only.

B.Liveness probes only.

C.Readiness probes only.

D.Both readiness and liveness probes.

AnswerD

Using both ensures traffic is only sent to ready pods and unresponsive pods are restarted, providing complete health management.

Why this answer

Liveness probes determine when to restart a container; readiness probes determine when a pod is ready to serve traffic. Both are needed for full resilience.

Practice this question →

10

MCQeasy

A data pipeline using Dataflow processes streaming data. Late-arriving events are currently being dropped. How should the team modify the pipeline to ensure late data is processed correctly?

A.Use side inputs to join late data with the main stream.

B.Use streaming inserts into BigQuery and ignore late data.

C.Configure triggers with allowed lateness and accumulation of late firings.

D.Increase the window duration to cover late data.

AnswerC

This is the standard pattern for handling late data in Dataflow: set allowed lateness and use triggers to emit on late arrival.

Why this answer

Dataflow allows setting allowed lateness on windows; trigger configuration can handle late data by emitting updates.

Practice this question →

11

MCQmedium

Refer to the exhibit. A Dataflow pipeline is failing intermittently with the shown error. Which step should the team take to ensure data quality and prevent such errors?

A.Increase the number of workers to process the data faster.

B.Add a monitoring alert on the 'system_lag' metric.

C.Use a strongly typed schema for the PCollection and let Beam automatically reject malformed data.

D.Modify the pipeline to handle parsing failures by sending invalid records to a dead letter queue.

AnswerD

A dead letter queue isolates bad data for later inspection without failing the pipeline.

Why this answer

Option D is correct because the error indicates that the pipeline is failing due to malformed or unparseable data. By sending invalid records to a dead letter queue (DLQ), the pipeline can continue processing valid data while capturing and isolating bad records for later analysis or reprocessing. This pattern is a standard data quality practice in Apache Beam and Dataflow, ensuring that transient or corrupt data does not cause pipeline failures.

Exam trap

Google Cloud often tests the distinction between scaling solutions (like increasing workers) and data quality patterns (like dead letter queues), trapping candidates who confuse performance optimization with error handling.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers addresses throughput and latency, not data quality or malformed data errors; it does not prevent parsing failures. Option B is wrong because monitoring the 'system_lag' metric tracks pipeline latency, not data quality issues; it would not prevent or handle malformed records. Option C is wrong because while strongly typed schemas can help catch type mismatches at compile time, they do not automatically reject malformed data at runtime in Beam; the pipeline would still fail if a record cannot be parsed into the schema, and Beam does not have built-in automatic rejection to a dead letter queue without explicit handling.

Practice this question →

12

Multi-Selecthard

A company uses Cloud Build to deploy containerized applications. They want to ensure build and deployment quality. Which THREE steps should they include in their CI/CD pipeline? (Choose three.)

Select 3 answers

A.Scan container images for vulnerabilities using Container Analysis.

B.Run unit tests after deployment.

C.Deploy directly to production on every commit.

D.Use canary deployments with gradual traffic shifting.

E.Pin base image digests in Dockerfile.

AnswersA, D, E

Vulnerability scanning ensures images are secure before deployment.

Why this answer

Options B, D, and E are correct: Container scanning catches vulnerabilities, canary deployments reduce risk, and pinning base image digests ensures reproducibility. Option A (unit tests after deployment) is too late. Option C (direct deployment to production) bypasses safety checks.

Practice this question →

13

MCQeasy

Refer to the exhibit. A subscriber is unable to pull messages from the topic. What is the most likely cause?

A.The service account has the subscriber role but the topic is not configured correctly.

B.The service account needs roles/pubsub.viewer to list subscriptions.

C.No subscription has been created for the topic.

D.The service account lacks roles/pubsub.publisher.

AnswerC

A subscription is required to pull messages; the topic only provides the ability to publish.

Why this answer

Option B is correct because a subscription must exist for pulling messages; the topic alone is not enough. Option A (publisher role) is not needed for subscribers. Option C (subscriber role on topic) is correct but the subscriber also needs to have a subscription.

Option D (viewer role) is irrelevant.

Practice this question →

14

MCQeasy

A data pipeline processes streaming data with Dataflow. The team notices occasional data duplication in BigQuery. What is the best approach to ensure exactly-once processing?

A.Use Pub/Sub with at-least-once delivery and deduplicate in BigQuery using a unique identifier.

B.Configure Dataflow with exactly-once sinks using file staging and deduplication.

C.Use Cloud Functions to deduplicate messages before they enter the pipeline.

D.Enable idempotent writes in BigQuery.

AnswerB

Dataflow's exactly-once sink mechanism ensures each record is written exactly once, preventing duplicates.

Why this answer

Option B is correct because Dataflow's exactly-once sink mechanism, which uses file staging and deduplication, ensures no duplicates. Option A (at-least-once delivery) can cause duplicates unless dedup is applied, but that's not automatic. Option C adds unnecessary complexity.

Option D is incorrect because BigQuery does not natively support idempotent writes.

Practice this question →

15

MCQmedium

A company runs a real-time anomaly detection system on Google Cloud. Streaming data from IoT devices is ingested via Pub/Sub, processed by Dataflow (Apache Beam), and results are written to Bigtable for low-latency serving. Recently, the system has been experiencing increased latency and occasional data loss. The Dataflow pipeline shows high system lag and backlog in Pub/Sub. The Bigtable cluster has 3 nodes and is reporting high CPU utilization (over 90%). The team suspects the issue is with the pipeline configuration. They have already verified that there are no errors in the pipeline code and no network issues. Which action should they take to resolve the issue?

A.Increase the number of Bigtable nodes to handle the write throughput.

B.Change the Dataflow worker machine type to n2-standard-8.

C.Decrease the batch size in the Dataflow pipeline to reduce latency.

D.Increase the number of Dataflow workers to process messages faster.

AnswerA

High CPU utilization suggests Bigtable is overwhelmed; adding nodes increases capacity.

Why this answer

The high CPU utilization on Bigtable (over 90%) indicates that the cluster is saturated and cannot keep up with the write throughput from Dataflow. This causes backpressure in the pipeline, leading to increased system lag and backlog in Pub/Sub, and eventually data loss when Pub/Sub messages expire. Increasing the number of Bigtable nodes directly addresses the bottleneck by distributing the write load and reducing CPU pressure, which allows the pipeline to drain the backlog and reduce latency.

Exam trap

Google Cloud often tests the misconception that scaling Dataflow workers or changing machine types always resolves pipeline latency, but the trap here is that the bottleneck is at the sink (Bigtable), so you must scale the sink first to relieve backpressure.

How to eliminate wrong answers

Option B is wrong because changing the Dataflow worker machine type to n2-standard-8 would increase compute capacity for processing, but the bottleneck is at the Bigtable sink, not the Dataflow workers; the pipeline is already experiencing backpressure from Bigtable, so more worker CPU would not resolve the write throughput limitation. Option C is wrong because decreasing the batch size in Dataflow would increase the number of smaller writes to Bigtable, which actually increases overhead and CPU usage on Bigtable, worsening the latency and backlog issue. Option D is wrong because increasing the number of Dataflow workers would increase the parallelism of writes to Bigtable, further amplifying the write pressure on the already saturated Bigtable cluster, making the high CPU utilization and backlog worse.

Practice this question →

16

MCQhard

A company runs a batch processing job on Dataproc that uses Apache Spark to process 500 GB of data daily. The job completes successfully but takes 4 hours. The team wants to reduce the runtime to under 2 hours without increasing cost. What should they do?

A.Use preemptible VMs for worker nodes and increase the number of workers.

B.Increase the master node's machine type to n2-standard-8.

C.Increase the machine type of worker nodes to n2-highmem-8.

D.Migrate the job to Dataflow with autoscaling enabled.

AnswerA

Preemptible VMs are cheaper, allowing more workers for the same cost, reducing runtime.

Why this answer

Preemptible VMs cost significantly less than standard VMs (about 60-80% discount). By using preemptible VMs for worker nodes, you can increase the number of workers (and thus parallelism) without increasing cost. This directly reduces runtime by distributing the 500 GB workload across more executors, while the cost savings from preemptible VMs offset the additional nodes.

Exam trap

Google Cloud often tests the trade-off between cost and performance by making candidates think that upgrading machine types (more CPU/memory) is the only way to speed up a job, ignoring that preemptible VMs allow scaling out (more nodes) without increasing cost.

How to eliminate wrong answers

Option B is wrong because increasing the master node's machine type (e.g., to n2-standard-8) improves driver capacity but does not accelerate data processing; Spark's bottleneck is typically worker parallelism and memory, not the driver. Option C is wrong because increasing worker node machine type (e.g., to n2-highmem-8) increases cost per node, and without adding more workers, the parallelism remains the same, so runtime may not drop below 2 hours while cost increases. Option D is wrong because migrating to Dataflow does not inherently reduce cost; Dataflow uses different pricing (per second of vCPU/memory) and autoscaling may increase cost if the job requires more resources to meet the 2-hour target, and the question explicitly requires no cost increase.

Practice this question →

17

MCQeasy

A team deploys a new version of a Cloud Function. After deployment, error rates increase significantly. What is the most efficient way to diagnose the cause?

A.Deploy a debug version with additional logging.

B.Check Cloud Logging for error stacks and exceptions.

C.Increase the function timeout and retry settings.

D.Immediately rollback to the previous version.

AnswerB

Logs provide immediate insight into the error, allowing targeted debugging.

Why this answer

Cloud Logging captures function execution logs and error stacks; reviewing them is the fastest way to understand the failure.

Practice this question →

18

Multi-Selecthard

A company uses Cloud Dataproc for ephemeral clusters to run batch jobs. They want to ensure job reliability and data quality. Which two configuration options should they use? (Choose two.)

Select 2 answers

A.Enable preemptible VMs for cost savings.

B.Use initialization actions for cluster setup.

C.Enable idle timeout to automatically delete clusters.

D.Use custom machine types for better performance.

E.Use graceful decommissioning of workers.

AnswersB, E

Initialization actions guarantee required software and configurations are present, improving job consistency.

Why this answer

Initialization actions enable consistent cluster setup (install libraries, configs), and graceful decommissioning ensures in-progress tasks complete before scaling down, preventing data loss.

Practice this question →

19

MCQeasy

A media company runs a batch data pipeline on Cloud Dataflow that ingests log files from Cloud Storage, transforms them, and writes results to BigQuery for analytics. The pipeline runs daily and has been stable for months. Recently, the source log format changed: a new optional field was added to some records. The pipeline started failing with ParseErrors for rows that contain the new field. The error logs show that the Dataflow job uses a hardcoded JSON schema that does not include the new field. The Dataflow pipeline logs are written to Stackdriver Logging, but no alerts are configured. The team wants to ensure that future schema changes do not break the pipeline and that failures are detected promptly. The team has limited experience with streaming and wants to keep the batch approach. Which course of action should the team take to improve solution quality?

A.Create a Cloud Monitoring alert on any PipelineError log entries from the Dataflow job, and set up a runbook to manually fix schema mismatches within one hour.

B.Schedule a Cloud Function to run every hour that checks the latest log file headers and compares them to the pipeline schema, sending an alert if differences are found.

C.Use BigQuery dry run queries to validate the schema before loading data, and if a mismatch is detected, block the pipeline run and notify the team via email.

D.Implement schema validation and evolution using a schema registry (e.g., AVRO) in the Dataflow pipeline, and configure Stackdriver alerts on pipeline failure or error logs.

AnswerD

A schema registry allows the pipeline to handle new fields gracefully by using a flexible schema, and monitoring alerts ensure timely detection of any remaining issues.

Why this answer

Option D is correct because using a schema registry (e.g., AVRO schema registry) and updating the pipeline to use a flexible schema (e.g., infer schema from data or use a schema registry) allows the pipeline to handle new fields without failing. Additionally, configuring Stackdriver alerts on pipeline failure logs ensures prompt detection of issues. Option A is incorrect because it only addresses detection after the fact, not prevention.

Option B is incorrect because BigQuery dry run does not prevent pipeline failures. Option C is incorrect because scheduling a job to check every hour is reactive and inefficient.

Practice this question →

20

MCQeasy

A team developed a microservice that writes logs to stdout. They want to centralize logs for analysis. Which GCP service should they use to automatically collect and store logs?

A.Install the Cloud Logging agent on the VM running the microservice.

B.Publish logs to a Pub/Sub topic and later store them.

C.Write logs directly to Cloud Storage.

D.Use the Cloud Logging client library (google-cloud-logging) for the microservice's language.

AnswerD

The client library automatically sends structured logs to Cloud Logging, enabling centralized analysis.

Why this answer

Option D is correct because Cloud Logging with the client library automatically captures stdout logs and sends them to Cloud Logging. Option A (Cloud Logging agent) is for VMs, not containers. Option B (Cloud Storage) is for object storage.

Option C (Pub/Sub) is for messaging, not log collection.

Practice this question →

21

MCQmedium

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

A.Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.

B.Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.

C.Use Default Encryption and enable Data Loss Prevention (DLP) API.

D.Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.

AnswerB

CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.

Why this answer

Option B is correct because Customer-Managed Encryption Keys (CMEK) allow the team to control and manage the encryption keys used to protect data at rest in Cloud Storage and BigQuery, while enabling Cloud Audit Logs provides the necessary audit trail for access to both the data and the keys. This combination directly satisfies the requirements for encryption at rest and auditability.

Exam trap

Google Cloud often tests the distinction between encryption key management (CMEK vs. CSEK vs. Default) and security controls (VPC Service Controls vs.

Audit Logs), leading candidates to conflate network perimeter controls with audit capabilities.

How to eliminate wrong answers

Option A is wrong because VPC Service Controls provide network-based security boundaries to prevent data exfiltration, but they do not provide audit logging of access to data or keys, which is a separate requirement. Option C is wrong because Default Encryption uses Google-managed keys, which do not give the team control over encryption keys, and the DLP API is for inspecting and classifying sensitive data, not for encryption at rest or audit logging. Option D is wrong because Customer-Supplied Encryption Keys (CSEK) require the customer to manage their own keys outside Google Cloud, which adds operational complexity and does not integrate with Cloud Audit Logs for key access auditing; VPC Service Controls again do not provide audit logging.

Practice this question →

22

Drag & Dropmedium

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Database Migration Service enables minimal-downtime migrations using replication.

Practice this question →

23

MCQeasy

A company uses Cloud Monitoring to track application latency. They notice a spike in latency every 30 minutes. What is the best initial step to diagnose the issue?

A.Increase the number of instances to handle the load.

B.Enable Cloud Trace for all requests.

C.Check if scheduled jobs or cron tasks overlap.

D.Change the alert threshold to ignore the spikes.

AnswerC

Regularly recurring spikes suggest a scheduled job causing contention; investigating this is the most direct diagnostic step.

Why this answer

Recurring spikes at regular intervals often indicate a scheduled process (e.g., cron job, batch job) that runs every 30 minutes. Checking for overlapping scheduled jobs is the most efficient first step before scaling or other actions.

Practice this question →

24

MCQeasy

A company uses Cloud Functions to process events from Cloud Storage. They notice that occasionally functions are not triggered. What should they check first to ensure solution quality?

A.Verify that the Cloud Storage bucket has notifications configured for the correct event type.

B.Check the logs for function execution.

C.Increase the function memory allocation.

D.Increase the function timeout.

AnswerA

A misconfigured notification will prevent the function from being triggered at all.

Why this answer

Option C is correct because the first step is to verify the Cloud Storage bucket notification configuration, as a misconfigured trigger will cause missed events. Option A (function timeout) does not cause missing triggers. Option B (memory) is unrelated.

Option D (logs) are helpful but after verifying trigger configuration.

Practice this question →

25

MCQhard

A company runs a data pipeline that ingests clickstream events from multiple websites into Cloud Pub/Sub, then processed by Dataflow to generate user sessions, and written to BigQuery for analytics. The pipeline runs 24/7. Recently, the team noticed that some sessions are incomplete due to missing events, and data quality checks reveal that about 2% of sessions have gaps of more than 30 minutes. The pipeline uses fixed 30-minute windows for sessionization, with allowed lateness set to 10 minutes. They have Cloud Monitoring dashboards tracking system throughput and pipeline lag but do not have custom metrics tracking per-element delays or watermark progress. The team suspects two possible causes: (a) the Pub/Sub subscription accumulates backlog and some messages are delivered after the window end; (b) the Dataflow job has insufficient workers causing checkpoint failures. The team needs to determine the root cause and improve data quality. What is the best first course of action?

A.Change the Pub/Sub subscription to pull mode with more aggressive flow control settings.

B.Increase the number of Dataflow workers and set autoscaling to the maximum allowed.

C.Modify the Dataflow pipeline to use session windows instead of fixed windows, and increase allowed lateness to 60 minutes.

D.Set up a Dataflow monitoring dashboard that tracks the watermark delay and create an alert when it exceeds the allowed lateness.

AnswerD

This directly monitors the pipeline's ability to process events within the window, confirming if late data is the root cause.

Why this answer

To determine whether late-arriving messages are the issue, the team should monitor the Dataflow watermark delay, which indicates how far behind the pipeline is compared to the event time. Setting up a metric and alert on watermark delay > allowed lateness will confirm if late data is being dropped.

Practice this question →

26

Drag & Dropmedium

Drag and drop the steps to create a Cloud Composer environment for Apache Airflow into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Cloud Composer provides a managed Airflow environment for orchestrating workflows.

Practice this question →

27

Matchingmedium

Match each machine learning term to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model trained on labeled data

Model trained on unlabeled data

Agent learns by interacting with environment

Model performs well on training data but poorly on new data

Why these pairings

Key ML concepts commonly tested in PDE exam.

Practice this question →

28

MCQmedium

Refer to the exhibit. A team configured a Cloud Monitoring alerting policy as shown. They recently started receiving false positive alerts. What is the most likely cause?

A.The duration of 60 seconds is too short, making the alert sensitive to brief spikes.

B.The alignment period of 60 seconds is too short, causing noise.

C.The threshold of 10 is too low.

D.The aggregator should be ALIGN_SUM instead of ALIGN_RATE.

AnswerA

A short duration means a spike lasting just over 60 seconds will trigger an alert; a longer duration (e.g., 300s) would reduce sensitivity.

Why this answer

Option C is correct because the duration is set to 60 seconds, meaning any 60-second window with a rate >10 will trigger an alert. If the error count is bursty, brief spikes cause false positives. Increasing the duration would smooth out transient spikes.

Option A (alignment period) affects granularity but does not cause false positives. Option B (threshold) might be low, but the primary issue is the short duration. Option D (aligner) is appropriate for rate.

Practice this question →

29

Multi-Selectmedium

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Select 2 answers

A.Use a pull subscription with a 10-second acknowledgment deadline.

B.Enable Dataflow Streaming Engine.

C.Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).

D.Disable autoscaling to prevent worker churn.

E.Use micro-batch processing with a small batch size.

AnswersB, C

Streaming Engine offloads state management to the backend, improving reliability.

Why this answer

Option B is correct because enabling Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing the impact of worker scaling and preemption. This improves reliability by providing consistent performance and fault tolerance for streaming pipelines, especially those with high throughput or stateful processing.

Exam trap

The trap here is that candidates often confuse reliability with throughput or latency, and may incorrectly choose micro-batching or disabling autoscaling as reliability improvements, when in fact Dataflow's reliability comes from its managed backend services like Streaming Engine.

Practice this question →

30

MCQhard

Refer to the exhibit. A BigQuery dataset is shared with the group 'analysts@example.com' using the IAM policy shown. A user who is a member of this group reports that they cannot run queries on the dataset, though they can see the tables. What is the most likely reason?

A.The group needs the 'roles/bigquery.jobUser' role at the project level.

B.The user is using an incorrect client library version.

C.The user's account is not activated in the group membership.

D.The dataset has an organization policy that denies query access.

AnswerA

DataViewer provides read access but not job submission; jobUser must be granted at the project level to run queries.

Why this answer

The role 'roles/bigquery.dataViewer' allows viewing table metadata and data but does not allow running queries; users also need 'roles/bigquery.jobUser' at the project level to submit query jobs.

Practice this question →

31

Multi-Selectmedium

A team runs a production application on Compute Engine. They want to ensure high availability and quality. Which three best practices should they implement? (Choose three.)

Select 3 answers

A.Use health checks and load balancing.

B.Use Cloud SQL read replicas for database load.

C.Enable OS Login for SSH access.

D.Use regional persistent disks for stateful data.

E.Use managed instance groups (MIGs) with autoscaling.

AnswersA, D, E

Health checks ensure only healthy instances receive traffic; load balancing provides fault tolerance.

Why this answer

Use managed instance groups for autoscaling and autohealing, regional persistent disks for durable high-availability storage, and health checks with load balancing to distribute traffic to healthy instances.

Practice this question →

32

MCQmedium

After migrating a production Cloud SQL for PostgreSQL database to a larger machine type, the team notices slower queries. What is the best step to identify the cause?

A.Reindex all tables to improve index efficiency.

B.Enable query caching through the database flags.

C.Enable pg_stat_statements and review query execution times.

D.Increase max_connections to handle more concurrent queries.

AnswerC

This extension captures per-query statistics, allowing identification of regressed queries.

Why this answer

Option C is correct because pg_stat_statements is a PostgreSQL extension that provides detailed query execution statistics, including total execution time, number of calls, and I/O metrics. After migrating to a larger machine type, slower queries often stem from plan changes due to different hardware characteristics or configuration settings; reviewing pg_stat_statements output helps pinpoint which queries are underperforming and why.

Exam trap

Google Cloud often tests the misconception that performance issues after a migration are always due to indexing or connection limits, when in fact the most effective first step is to gather query-level metrics using built-in tools like pg_stat_statements.

How to eliminate wrong answers

Option A is wrong because reindexing all tables is a maintenance task that can improve index bloat but does not address the root cause of slower queries after a migration; it is a reactive measure without diagnostic value. Option B is wrong because Cloud SQL for PostgreSQL does not support a generic 'query caching' database flag; PostgreSQL relies on shared buffers and the buffer cache, and enabling any such flag would not provide diagnostic insight into query performance. Option D is wrong because increasing max_connections can actually degrade performance by increasing context switching and memory contention; it does not help identify why queries are slower and may worsen the issue.

Practice this question →

33

MCQhard

A data science team uses AI Platform Training with hyperparameter tuning. They observe that some trials fail due to transient errors. To improve solution quality and reduce costs, what should they do?

A.Enable early stopping using a Bayesian optimization algorithm.

B.Set the maxFailedTrials parameter to a high value (e.g., 10).

C.Use larger machine types for each trial.

D.Increase the number of parallel trials.

AnswerB

This allows the tuning job to tolerate transient failures and continue searching without aborting, improving completion rate and model quality.

Why this answer

Option D is correct because setting maxFailedTrials to a high value allows more trials to complete despite transient failures, improving the chance of finding a good model without wasting resources on re-running failed trials. Option A increases parallelism but still pays for failed trials. Option B (early stopping) prunes unpromising trials, but does not address transient errors.

Option C increases cost per trial without solving the failure issue.

Practice this question →

34

MCQhard

A data engineering team uses Cloud Composer (Airflow) for workflow orchestration. They notice DAG runs frequently fail, and the error indicates insufficient Airflow workers. The team wants to ensure reliable execution. Which approach best addresses the issue?

A.Switch from Cloud Composer to Cloud Scheduler for simpler workloads.

B.Reduce the concurrency of all DAGs to fit within available workers.

C.Use the GKE-based Composer environment, which provides autoscaling of Airflow workers.

D.Increase the parallelism setting in the Airflow configuration.

AnswerC

GKE-based Composer auto-scales worker pods, handling variable loads effectively.

Why this answer

Running Airflow on GKE allows worker autoscaling based on load, ensuring sufficient capacity during peak DAG concurrency.

Practice this question →

35

MCQhard

A financial services company operates a real-time fraud detection pipeline using Apache Beam running on Google Cloud Dataflow. The pipeline reads transactions from Pub/Sub, enriches them with customer data from Bigtable, runs a machine learning model with side inputs from a Redis cluster, and writes results to BigQuery for downstream reporting. The data must be processed with exactly-once semantics to avoid duplicate fraud alerts or missing transactions. The pipeline currently uses a global window with 5-minute accumulation, but the team is experiencing high latency and occasional duplicates when the model side input is updated (triggered every 15 minutes via a WatchTransform). Additionally, the pipeline has a dead letter queue that outputs failed records to a separate Pub/Sub topic, but these records are never reprocessed. The team needs to ensure high reliability and data quality. Which course of action should the team take to improve solution quality?

A.Use fixed windows with a 10-minute duration and session gap of 2 minutes, disable side input caching, and log all dead letter records to Cloud Storage for manual inspection.

B.Switch to a batch processing approach that runs every minute using Cloud Composer, with data loaded from Pub/Sub into BigQuery and then processed with Dataproc to run the model.

C.Implement sliding windows of 5 minutes with a 2-minute allowed lateness, use side inputs with periodic refreshes using the .withUpdateFrequency transformation, and set up a Cloud Function to automatically replay dead letter records back to the main Pub/Sub topic after fixing the issue.

D.Keep the global window but use a custom trigger with early firings every 30 seconds and a late-firing threshold of 1 minute, and configure the side input to be broadcast every 5 minutes using a Read transform.

AnswerC

Sliding windows with allowed lateness handle late data without blocking, periodic side input refreshes reduce latency, and automatic replay of dead letters ensures data quality.

Why this answer

Option B is correct because switching to a sliding window with allowed lateness ensures that late-arriving transactions are captured without blocking the window, and using side inputs with periodic refreshes (e.g., .withUpdateFrequency) reduces latency from model updates. Adding a system to reprocess dead letter records (e.g., via a Cloud Function that replays to the main topic) ensures data completeness. Option A is incorrect because fixed windows with session gaps do not help with side input latency and may cause data loss.

Option C is incorrect because GlobalWindow with triggers can cause duplicates if not configured carefully; defaults may not achieve exactly-once. Option D is incorrect because it focuses on batching, which is not suitable for real-time detection and introduces latency.

Practice this question →

36

MCQeasy

A company runs batch jobs on Dataproc. They need to ensure that if a job fails, it automatically retries with exponential backoff. What is the recommended approach?

A.Schedule a cron job to check job status and restart manually.

B.Use a Cloud Function triggered by Stackdriver alerts to restart the job.

C.Use Dataproc Workflow Templates with the maxAttempts parameter set to 3.

D.Create a Cloud Composer DAG that monitors job status and retries on failure.

AnswerC

Workflow Templates natively support retries with configurable backoff, making it the simplest and most robust solution.

Why this answer

Option C is correct because Dataproc Workflow Templates support configuring maxAttempts and retry policy in the template, enabling automatic retries with exponential backoff. Option A (Composer) is overkill for simple retry. Option B (cron job) would need custom logic.

Option D (Cloud Functions) also requires custom implementation.

Practice this question →

37

MCQhard

A team deploys a Cloud Run service that processes user-uploaded files. Some requests time out after 60 minutes. They need to handle large files reliably without losing tasks. What is the best solution?

A.Containerize the processing logic and trigger it via Cloud Tasks.

B.Increase the request timeout to 3600 seconds.

C.Use Cloud Functions instead of Cloud Run.

D.Split the file into chunks and process them concurrently.

AnswerA

Cloud Tasks decouples the request, provides retry, and can handle long-running operations without timeout limits.

Why this answer

Cloud Run has a maximum request timeout of 60 minutes. Offloading processing to a Cloud Task decouples the request from the processing, allowing async handling with retries.

Practice this question →

38

Multi-Selectmedium

A company wants to ensure high availability for their Cloud SQL instance. Which TWO actions are most appropriate? (Choose two.)

Select 2 answers

A.Create a read replica in a different region.

B.Configure a failover replica in the same region.

C.Enable automatic backups with a retention period of 7 days.

D.Increase the instance's memory and storage size.

E.Set up horizontal scaling with multiple read replicas.

AnswersA, C

A cross-region read replica can be promoted to a standalone instance in a disaster, providing DR.

Why this answer

Options A and B are correct: A read replica in a different region provides disaster recovery, and automatic backups allow point-in-time recovery. Option C (failover replica in same region) provides HA but not DR. Option D (increased memory) does not improve availability.

Option E (horizontal scaling with read replicas) does not provide failover for writes.

Practice this question →

39

Matchingmedium

Match each data encryption concept to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Customer-supplied encryption key

Customer-managed encryption key via Cloud KMS

CSEK: keys provided by customer; CMEK: keys managed in Cloud KMS

Data encrypted while moving across networks

Why these pairings

Encryption options in Google Cloud.

Practice this question →

40

MCQeasy

A team is deploying a model on AI Platform Prediction. They want to monitor for data drift to maintain model quality. Which service should they use?

A.Cloud DLP

B.AI Platform Continuous Evaluation

C.Cloud Monitoring

D.Cloud Audit Logs

AnswerB

This service provides monitoring for model predictions and drift analysis.

Why this answer

Option B is correct because AI Platform Continuous Evaluation is designed to monitor model performance and detect data drift. Option A (Cloud Monitoring) is for infrastructure metrics. Option C (Cloud Audit Logs) is for API activity.

Option D (Cloud DLP) is for data loss prevention.

Practice this question →

41

Multi-Selecteasy

A company is developing a streaming Dataflow pipeline to process real-time sensor data. To ensure data quality, the team wants to detect malformed records and late data. Which two practices should they implement? (Choose two.)

Select 2 answers

A.Use Beam’s PAssert to validate each element in the pipeline.

B.Enable Dataflow’s built-in schema validation on the PCollection.

C.Configure a dead letter queue for unprocessable records.

D.Use Cloud Monitoring alerting on Dataflow system lag metric.

E.Run a separate batch pipeline to re-process data for validation.

AnswersC, D

A dead letter queue stores malformed records for later analysis, ensuring no data is silently lost.

Why this answer

Option C is correct because a dead letter queue (DLQ) is a standard pattern in streaming pipelines for isolating malformed or unprocessable records without blocking the main data flow. In Dataflow, this is typically implemented by writing bad records to a separate output (e.g., a Pub/Sub topic or Cloud Storage bucket) for later analysis or reprocessing. Option D is correct because the Dataflow system lag metric in Cloud Monitoring measures the time between when data enters the pipeline and when it is processed, making it an effective way to detect late data and trigger alerts for SLA violations.

Exam trap

Google Cloud often tests the misconception that PAssert can be used in production pipelines, but it is strictly a testing utility, and candidates may also confuse schema validation with Dataflow's built-in type checking, which does not exist for arbitrary record validation.

Practice this question →

42

MCQmedium

A company is deploying a large-scale streaming application on Google Kubernetes Engine. They need to ensure the application can handle sudden traffic spikes without dropping data. Which architectural pattern is most appropriate?

A.Implement custom retry logic with exponential backoff in the application.

B.Use Cloud SQL as a temporary buffer and process from there.

C.Pre-provision 3x the expected peak capacity to handle spikes.

D.Use a Pub/Sub topic as a buffer and autoscale consumer pods based on Pub/Sub subscription backlog.

AnswerD

Pub/Sub provides a highly scalable buffer; autoscaling consumers based on backlog ensures capacity matches demand.

Why this answer

Option A is correct because using a Pub/Sub buffer decouples producers from consumers, allowing autoscaling of consumers to handle spikes. Option B is wasteful and not dynamically scalable. Option C uses Cloud SQL, which is not designed for high-throughput buffering.

Option D only addresses retries, not overall throughput capacity.

Practice this question →

43

MCQhard

A company runs a large Dataflow pipeline that aggregates user activity data from Pub/Sub into BigQuery every 10 minutes using fixed windows. Recently, the daily summary reports have shown 5-10% lower user engagement for certain segments compared to historical trends. The pipeline is completing successfully with no errors in Cloud Monitoring, and the Dataflow job dashboard shows all steps in green. There are no alarms. The team suspects data is being dropped or missed. They have verified that the Pub/Sub topic is receiving data correctly. After reviewing the pipeline code, they find that the pipeline uses a global window with a default 10-minute trigger, and writes results to a single BigQuery table partitioned by date. They also use exactly-once processing mode. Which of the following is the most likely cause and the best course of action to diagnose and fix the data quality issue?

A.Implement a retry mechanism in the Pub/Sub subscription to ensure no messages are lost.

B.Enable Cloud Logging for all pipeline steps and analyze the logs for dropped elements.

C.Add a global window with a late-data trigger to capture any data arriving after the window ends.

D.Use Dataflow’s built-in metrics to compare the number of elements read from Pub/Sub and written to BigQuery for each window.

AnswerD

This identifies exactly where data is lost, enabling targeted debugging without overhead.

Why this answer

Option D is correct because the pipeline uses a global window with a default 10-minute trigger, which means data is processed in micro-batches but the global window never closes, so late-arriving data is included. However, the team suspects data is being dropped, and the most direct way to diagnose this is to compare the number of elements read from Pub/Sub (using the Pub/Sub subscription's 'pubsub_subscription' metric) with the number of elements written to BigQuery (using the BigQuery sink's 'bigquery_rows_written' metric) for each window. This comparison will reveal if any data is lost between reading and writing, which is a common issue when using exactly-once processing mode with streaming inserts that may silently fail due to schema mismatches or quota limits.

Exam trap

The trap here is that candidates assume 'exactly-once processing' guarantees no data loss, but in reality, exactly-once only ensures no duplicates, not that all data is successfully written to the sink; silent failures in streaming inserts to BigQuery can cause data to be dropped without triggering pipeline errors.

How to eliminate wrong answers

Option A is wrong because Pub/Sub subscriptions already have built-in retry mechanisms (e.g., at-least-once delivery) and the issue is not about message loss from Pub/Sub; the team verified the topic is receiving data correctly. Option B is wrong because enabling Cloud Logging for all pipeline steps would generate excessive logs and is not the most efficient diagnostic approach; Dataflow already provides built-in metrics (e.g., 'pubsub_subscription' and 'bigquery_rows_written') that can directly compare element counts without needing to parse logs. Option C is wrong because the pipeline already uses a global window with a default 10-minute trigger, which inherently captures late data (since the global window never closes); adding a late-data trigger is redundant and does not address the potential data loss between Pub/Sub and BigQuery.

Practice this question →

44

MCQmedium

A company uses BigQuery for analytics. They need to ensure data quality by preventing duplicate records from being inserted. Which approach is most effective?

A.Use BigQuery ML to train a model that identifies anomalies.

B.Use a DML MERGE statement that filters out duplicates based on a unique key.

C.Use Cloud Data Loss Prevention API to scan for duplicates.

D.Use COUNT DISTINCT in queries to ignore duplicates.

AnswerB

MERGE with deduplication logic ensures only one copy of each record is inserted, maintaining data quality.

Why this answer

Using MERGE with ROW_NUMBER() to identify and skip duplicates in a staging table before inserting into the final table is a common pattern for deduplication.

Practice this question →

45

MCQmedium

Refer to the exhibit. A team uses this Cloud Build configuration to deploy a service to Cloud Run. The deployment step fails with a 'Permission denied' error. What is the most likely cause?

A.The Dockerfile is missing from the repository.

B.The Docker image tag is missing or malformed.

C.The region 'us-central1' is incorrect for Cloud Run.

D.The Cloud Build service account does not have the Cloud Run Admin role.

AnswerD

The deploy step requires IAM permissions to create/update Cloud Run services; typically the Cloud Build service account needs roles/run.admin.

Why this answer

Cloud Build uses its default service account (or custom service account) which needs the Cloud Run Admin role (roles/run.admin) to deploy services. The error indicates the service account lacks permission to create or update the Cloud Run service.

Practice this question →

Ready to test yourself?

Try a timed practice session using only Ensuring solution quality questions.

Start 20-question session