Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 526–600

990 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 8 of 14

526

MCQmedium

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

A.The Cloud Storage bucket has been deleted.

B.The project has reached its Dataflow API quota.

C.The input data volume has increased significantly.

D.The BigQuery output table schema has changed.

AnswerB

Resource exhausted error indicates quota issue.

Why this answer

The 'Resources have been exhausted' error in Cloud Dataflow typically indicates that the project has reached its Dataflow API quota, such as the maximum number of concurrent jobs or API requests per minute. This is a common issue when multiple jobs run simultaneously or when the quota is set low by default. The error is distinct from resource exhaustion in the underlying compute or storage layers.

Exam trap

Google Cloud often tests the distinction between API quota exhaustion and resource exhaustion in the underlying infrastructure (e.g., Compute Engine CPU/memory), leading candidates to incorrectly attribute the error to increased data volume or schema changes.

How to eliminate wrong answers

Option A is wrong because deleting the Cloud Storage bucket would cause a 'bucket not found' or 'object not found' error, not a 'Resources have been exhausted' error. Option C is wrong because a significant increase in input data volume would lead to autoscaling limits or worker resource exhaustion (e.g., out of memory), but the specific 'Resources have been exhausted' message is tied to API quota limits, not data volume. Option D is wrong because a schema change in BigQuery would result in a schema mismatch or insertion error, not an API quota exhaustion error.

Full explanation →

527

MCQeasy

Your team uses Vertex AI Feature Store to serve features for online predictions. A feature value changes frequently (e.g., user session clicks). Which type of feature should you use to ensure low-latency writes and reads?

A.Streaming feature

B.Batch feature

C.Feature view

D.Bigtable-backed feature

AnswerA

Streaming features are designed for low-latency, high-frequency updates and reads.

Why this answer

A is correct because streaming features in Vertex AI Feature Store are designed for low-latency writes and reads, making them ideal for frequently changing values like user session clicks. They use an online serving infrastructure (typically backed by Bigtable) that supports real-time updates and sub-millisecond retrieval, ensuring predictions are based on the latest data without batch delays.

Exam trap

The trap here is that candidates confuse 'streaming feature' with 'Bigtable-backed feature' as if they are separate options, when in fact Bigtable is the underlying technology for streaming features, not a feature type itself.

How to eliminate wrong answers

Option B is wrong because batch features are optimized for bulk ingestion and offline serving, not for low-latency online writes and reads; they introduce latency due to periodic batch jobs. Option C is wrong because a feature view is a logical grouping of features for serving, not a type of feature with specific latency guarantees; it can reference either streaming or batch features. Option D is wrong because Bigtable-backed feature is not a distinct feature type in Vertex AI Feature Store; Bigtable is the underlying storage for streaming features, but the question asks for the feature type itself, not the storage backend.

Full explanation →

528

MCQmedium

Refer to the exhibit. A BigQuery dataset has the IAM policy shown above. An analyst is trying to run a SELECT query on a table in this dataset but receives an 'Access Denied' error. What is the most likely reason?

A.The analyst does not have permission to list datasets in the project.

B.The analyst only has the roles/bigquery.metadataviewer role, which does not allow reading table data.

C.The table is in a different region than the dataset, and the analyst's query is not cross-region compatible.

D.The analyst has not been granted the 'bigquery.jobs.create' permission to run queries.

AnswerB

D is correct because metadataviewer only allows viewing metadata, not querying data.

Why this answer

The roles/bigquery.metadataviewer role grants permissions to view table and dataset metadata (e.g., table names, schemas) but does not include the bigquery.tables.getData permission required to read table rows. Therefore, when the analyst runs a SELECT query, BigQuery denies access because the role lacks the data-reading privilege. This is the most likely reason for the 'Access Denied' error.

Exam trap

Google Cloud often tests the distinction between metadata-viewing roles and data-reading roles, trapping candidates who assume that being able to see table names and schemas implies permission to query the data.

How to eliminate wrong answers

Option A is wrong because listing datasets is not required to run a SELECT query; the error is about reading table data, not dataset enumeration. Option C is wrong because BigQuery does not enforce cross-region compatibility at the dataset-table level; tables reside within the same dataset and region, and cross-region queries are allowed with appropriate permissions. Option D is wrong because the 'bigquery.jobs.create' permission is needed to submit a query job, but the error specifically indicates a data access issue, not a job creation failure; the analyst likely has this permission if they can attempt a query.

Full explanation →

529

Multi-Selecthard

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

Select 3 answers

A.Integrate Cloud Pub/Sub as an intermediary to buffer and allow message retry

B.Use a try-catch block in the pipeline to retry processing failed records

C.Create a Cloud Monitoring alert on pipeline failures

D.Add schema validation before processing to reject invalid JSON records

E.Implement a dead-letter queue in the Dataflow pipeline to store failed records for later analysis

AnswersA, D, E

Pub/Sub can retry delivery of messages, improving reliability.

Why this answer

Option A is correct because integrating Cloud Pub/Sub as an intermediary decouples the ingestion of JSON files from the Dataflow pipeline. Pub/Sub provides at-least-once delivery and automatic retries for messages that are not acknowledged, which buffers against transient failures and malformed records. This allows the pipeline to pull messages at its own pace and retry processing without losing data.

Exam trap

The trap here is that candidates often confuse reactive monitoring (Option C) with proactive reliability improvements, or they assume a simple try-catch block (Option B) is sufficient in a distributed processing framework like Dataflow, where fault tolerance requires persistent retry mechanisms and dead-letter queues.

Full explanation →

530

MCQeasy

A team needs to migrate an existing on-premises Hadoop Hive workload to Google Cloud. They want to minimize code changes and use a managed service for transient clusters. Which service should they choose?

A.Cloud Dataflow

B.Cloud Dataprep

C.Cloud Dataproc

D.BigQuery

AnswerC

Dataproc is fully compatible with Hadoop/Hive and offers ephemeral clusters with minimal code changes.

Why this answer

Cloud Dataproc is the correct choice because it is a managed Spark and Hadoop service that supports Hive workloads natively, allowing you to run existing Hive scripts with minimal changes. It also supports transient clusters, which can be automatically scaled up and down, aligning with the requirement for transient clusters.

Exam trap

The trap here is that candidates often confuse Cloud Dataflow's ability to process batch data with Hadoop compatibility, but Dataflow does not support Hive or transient Hadoop clusters, making Dataproc the only correct option for minimizing code changes.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow is a unified stream and batch data processing service based on Apache Beam, not designed for Hive workloads or transient Hadoop clusters. Option B is wrong because Cloud Dataprep is a data preparation and cleaning service (based on Trifacta) that does not run Hive or provide transient clusters. Option D is wrong because BigQuery is a serverless data warehouse that does not support Hive execution engines or transient clusters; migrating Hive to BigQuery would require significant code changes.

Full explanation →

531

Multi-Selectmedium

A data engineer is monitoring a Dataflow streaming pipeline and notices that the 'System Lag' metric is increasing. Which TWO actions should be taken to diagnose the issue?

Select 2 answers

A.Check the Dataflow monitoring UI for each stage's throughput and backlog.

B.Cancel the pipeline and restart with a larger initial worker count.

C.Increase the maximum number of workers to handle backlog.

D.Examine the worker logs for error messages or stack traces.

E.Increase the BigQuery quota for streaming inserts.

AnswersA, D

Identifies bottleneck stages.

Why this answer

Option A is correct because the Dataflow monitoring UI provides per-stage metrics such as throughput and backlog, which directly indicate where data is accumulating. By examining these metrics, you can identify the specific stage causing the increasing system lag, enabling targeted troubleshooting without unnecessary pipeline changes.

Exam trap

Google Cloud often tests the distinction between diagnostic actions and remedial actions; the trap here is that candidates confuse scaling up workers (a fix) with diagnosing the root cause of the lag.

Full explanation →

532

MCQhard

A Dataflow streaming job is processing high-volume sensor data from thousands of IoT devices. The job uses global windows with a 10-minute processing time trigger. Recently, the job's CPU utilization is nearly 100% and it is falling behind. Which action is most likely to reduce CPU load while maintaining data freshness?

A.Increase the number of workers to distribute the load.

B.Change the trigger to event time with a 10-minute allowed lateness.

C.Replace GroupByKey with Combine.globally and use a fanout.

D.Use side inputs to broadcast a static lookup table to all workers.

AnswerC

Combine.globally with fanout reduces the number of unique keys tracked per worker, lowering CPU usage from grouping large numbers of keys.

Why this answer

Option C is correct because using `Combine.globally` with a fanout reduces the amount of data shuffled and merged in a single worker, lowering CPU load. In Dataflow, `GroupByKey` triggers a full shuffle and per-key aggregation, which is expensive for high-volume sensor data; `Combine.globally` with a fanout performs partial aggregation on each worker before a final merge, reducing network I/O and CPU cycles. This maintains data freshness because the 10-minute processing time trigger still fires on time, but with less per-element overhead.

Exam trap

Google Cloud often tests the misconception that scaling out workers (Option A) is the universal fix for performance issues, but the trap here is that the real bottleneck is the shuffle-heavy `GroupByKey` operation, not worker count.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers distributes load but does not address the root cause—the high CPU cost of per-key grouping and shuffling in `GroupByKey`; it may temporarily reduce backlog but adds cost and can still hit scaling limits. Option B is wrong because changing to event time with allowed lateness does not reduce CPU utilization; it only changes watermark semantics and may increase state size, worsening CPU pressure. Option D is wrong because using side inputs to broadcast a static lookup table does not reduce the CPU cost of the aggregation step; it adds memory overhead and does not address the shuffle bottleneck.

Full explanation →

533

MCQmedium

You deployed a model on Vertex AI Endpoints using a custom container. The model serves predictions but the latency is higher than expected. You suspect the container is not making full use of the CPU resources. What should you do to reduce latency?

A.Modify the container to use multi-threading or increase the number of workers in the prediction server (e.g., Gunicorn workers).

B.Enable response caching on the endpoint.

C.Change the machine type to a GPU-accelerated machine.

D.Increase the number of nodes by adjusting autoscaling limits.

AnswerA

Properly configuring concurrency allows each node to process multiple requests in parallel, reducing latency under load.

Why this answer

Option A is correct because high latency in a CPU-based custom container often stems from underutilizing available CPU cores. By increasing the number of workers (e.g., Gunicorn workers) or enabling multi-threading, you allow the prediction server to handle multiple requests concurrently, reducing queue time and improving throughput. This directly addresses the symptom of the container not making full use of CPU resources.

Exam trap

Google Cloud often tests the misconception that scaling out (adding more nodes) or upgrading hardware (GPU) is the default fix for latency, when the real issue is often software-level concurrency configuration within the container.

How to eliminate wrong answers

Option B is wrong because response caching reduces latency only for repeated identical requests, not for the general case of underutilized CPU resources; it does not improve concurrent request handling. Option C is wrong because switching to a GPU-accelerated machine would only help if the model benefits from GPU parallelism (e.g., deep learning models), but the question states the container is not making full use of CPU resources, implying the bottleneck is software configuration, not hardware type. Option D is wrong because increasing the number of nodes via autoscaling adds more instances but does not fix the per-instance CPU underutilization; it may even increase cost without addressing the root cause of inefficient request handling within each container.

Full explanation →

534

MCQmedium

A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?

A.Cloud Hive Metastore (self-managed)

B.Cloud Bigtable

C.Dataproc Metastore

D.Cloud Data Catalog

AnswerC

Dataproc Metastore is a fully managed, Hive-compatible metastore service.

Why this answer

Dataproc Metastore provides a Hive-compatible metastore that can be used across clusters and services.

Full explanation →

535

MCQmedium

A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?

A.Use Cloud Composer to orchestrate Dataproc jobs that run the transformations

B.Use Cloud Functions to trigger a Dataflow job that does the transformations

C.Use Cloud Data Fusion to design the pipeline and schedule it to run on a Dataproc cluster

D.Use Cloud Dataprep to design the transformation and export to BigQuery

AnswerC

Cloud Data Fusion orchestrates the execution on Dataproc, which is the expected approach.

Why this answer

Cloud Data Fusion supports scheduling pipelines and runs them on Dataproc. This is the standard way to run batch pipelines.

Full explanation →

536

MCQhard

You are a data engineer at a financial services company. You have deployed a credit risk model on Vertex AI Endpoints using a custom container with a TensorFlow SavedModel. The model expects input features as a JSON object. Recently, the model has been returning high prediction latency and occasional 503 errors. You have enabled autoscaling with minNodes=2 and maxNodes=10. The model is CPU-only and uses n1-standard-4 machines. Monitoring shows that during peak hours, CPU utilization reaches 90% and memory is at 80%. The number of prediction requests per second peaks at 100. You suspect that the model is not scaling fast enough. Which action will most effectively reduce latency and eliminate 503 errors?

A.Increase maxNodes to 20 to allow more replicas during peak

B.Change the machine type to n1-standard-4 with a GPU (e.g., NVIDIA T4) and update the custom container to use GPU

C.Set minNodes to 5 to keep more replicas warm

D.Switch to n1-highmem-4 machines to provide more memory per node

AnswerB

GPU acceleration reduces per-request latency and can handle more requests per node.

Why this answer

Option B is correct because the high CPU utilization (90%) indicates that the model's inference is compute-bound. Offloading the computation to a GPU (NVIDIA T4) significantly accelerates TensorFlow model inference, reducing per-request latency and allowing each replica to handle more requests per second. This directly addresses the root cause of the 503 errors (requests timing out due to slow inference) and reduces the need for rapid scaling.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing replicas) is always the solution to latency and 503 errors, when in fact the root cause may be per-replica performance (CPU vs. GPU) that scaling cannot fix.

How to eliminate wrong answers

Option A is wrong because increasing maxNodes to 20 does not address the fundamental bottleneck: each replica is CPU-bound at 90% utilization. More replicas would still be slow and may not scale quickly enough to handle sudden spikes, and they would increase cost without fixing latency. Option C is wrong because setting minNodes to 5 keeps more replicas warm but does not reduce the latency of each individual prediction; the replicas would still be CPU-bound, so 503 errors from slow inference would persist.

Option D is wrong because memory is only at 80%, not a bottleneck; switching to n1-highmem-4 provides more memory but does not accelerate the CPU-bound computation, so latency and 503 errors would remain.

Full explanation →

537

MCQeasy

Your Cloud Dataflow pipeline is failing due to a 'Permission denied' error when writing to a BigQuery table. The error persists even though the service account has bigquery.dataEditor role. What is the most likely missing permission?

A.pubsub.topics.publish on a notification topic

B.storage.objects.create on the staging bucket

C.bigquery.tables.get on the table

D.bigquery.tables.create on the dataset

AnswerD

Dataflow requires create permission if table is created automatically.

Why this answer

The bigquery.dataEditor role grants permissions to read and modify existing tables but does not include bigquery.tables.create, which is required when a Dataflow pipeline writes to a BigQuery table that does not already exist. The 'Permission denied' error occurs because the service account lacks the ability to create the destination table in the dataset, even though it can edit existing ones.

Exam trap

Cisco often tests the distinction between editing existing resources and creating new ones, trapping candidates who assume the dataEditor role covers all write operations, when in fact it excludes table creation.

How to eliminate wrong answers

Option A is wrong because pubsub.topics.publish is unrelated to BigQuery write permissions; it is needed only if the pipeline uses Pub/Sub notifications. Option B is wrong because storage.objects.create on the staging bucket is required for temporary file staging but is not the missing permission for writing to BigQuery; the error is specific to BigQuery table creation. Option C is wrong because bigquery.tables.get is a read permission that allows viewing table metadata, not creating tables; the Dataflow pipeline already has this via bigquery.dataEditor, but the error is about creation, not reading.

Full explanation →

538

Multi-Selecthard

A payment processing company needs to detect fraudulent transactions in real time. The system must have sub-second latency for high-value transactions and use a machine learning model. Which two components should be part of the architecture? (Choose TWO.)

Select 2 answers

A.Cloud Storage for transaction logs

B.Bigtable to store user profiles and transaction history for fast lookups

C.Dataflow for stream processing with sliding windows

D.Cloud SQL to store reference data

E.Cloud Functions for long-running batch model training

AnswersB, C

Bigtable offers sub-millisecond latency for point lookups, essential for real-time fraud scoring.

Why this answer

Bigtable is a fully managed, scalable NoSQL database that provides consistent sub-10ms latency for high-throughput read/write operations, making it ideal for real-time lookups of user profiles and transaction history in fraud detection. Its ability to handle large volumes of data with low latency supports the sub-second requirement for high-value transactions.

Exam trap

Google Cloud often tests the distinction between storage services optimized for real-time access (Bigtable) versus batch/archive (Cloud Storage) and between stream processing (Dataflow) versus batch processing or short-lived compute (Cloud Functions).

Full explanation →

539

MCQeasy

A startup wants to build a data lake on Google Cloud using Cloud Storage. They need to store raw data in its original format for future analysis. Which storage class should they use to optimize for cost given that data will be accessed occasionally after the first month?

A.Nearline storage class

B.Coldline storage class

C.Standard storage class

D.Archive storage class

AnswerA

Optimized for data accessed less than once a month, cost-effective.

Why this answer

Nearline storage class is the optimal choice because it offers low-cost storage for data accessed less than once a month, with a 30-day minimum storage duration. Since the data is accessed occasionally after the first month, Nearline provides significant cost savings over Standard while still offering low-latency access (milliseconds) suitable for analytics. Coldline and Archive have lower storage costs but impose higher retrieval fees and minimum storage durations (90 and 365 days respectively), making them more expensive for data that is accessed occasionally within the first year.

Exam trap

Google Cloud often tests the misconception that lower storage cost always means lower total cost, ignoring the impact of retrieval fees and minimum storage duration penalties, which can make Coldline or Archive more expensive for data accessed occasionally within the first year.

How to eliminate wrong answers

Option B (Coldline) is wrong because it is designed for data accessed less than once a quarter (90-day minimum storage duration) and has higher retrieval costs, making it more expensive than Nearline for data accessed occasionally after the first month. Option C (Standard) is wrong because it is optimized for frequently accessed data (no minimum storage duration) and has the highest storage cost, which is not cost-effective for data that is only accessed occasionally. Option D (Archive) is wrong because it is intended for long-term archival data accessed less than once a year (365-day minimum storage duration) and has very high retrieval costs and latency (hours), making it unsuitable for occasional access within a year.

Full explanation →

540

MCQmedium

A data engineer is designing a batch data pipeline that reads Avro files from Cloud Storage, transforms data using Apache Beam, and writes to BigQuery. The pipeline must handle daily runs and backfills. Which runner should they use?

A.FlinkRunner

B.DataflowRunner

C.SparkRunner

D.DirectRunner

AnswerB

DataflowRunner is a fully managed service that supports batch pipelines, backfills, and direct integration with GCS and BigQuery.

Why this answer

DataflowRunner is the correct choice because it is the fully managed service runner for Apache Beam on Google Cloud, optimized for batch and streaming pipelines. It automatically handles scaling, resource management, and exactly-once processing semantics, which are essential for reliable daily runs and backfills with Avro files from Cloud Storage and BigQuery sinks.

Exam trap

The trap here is that candidates may confuse the runner with the execution engine, assuming that any distributed runner (Flink, Spark) is suitable for production, when the question specifically tests knowledge of Google Cloud-native services and the need for managed infrastructure for batch pipelines with backfills.

How to eliminate wrong answers

Option A is wrong because FlinkRunner is designed for running Beam pipelines on Apache Flink clusters, which require manual cluster management and are not natively integrated with Google Cloud services like Cloud Storage and BigQuery. Option C is wrong because SparkRunner runs Beam pipelines on Apache Spark, which is not a managed service on Google Cloud and lacks the seamless integration with Cloud Storage and BigQuery that DataflowRunner provides. Option D is wrong because DirectRunner is intended for local testing and development only, not for production workloads or handling large-scale daily runs and backfills.

Full explanation →

541

Multi-Selectmedium

Which TWO are best practices for managing a Cloud Dataflow pipeline in production?

Select 2 answers

A.Always use batch mode for streaming data to reduce cost

B.Disable autoscaling to keep compute costs predictable

C.Set up Cloud Monitoring alerts based on Dataflow job metrics

D.Use pipeline updates (update) to modify running streaming pipelines

E.Restart the pipeline when code changes are needed

AnswersC, D

Alerts help detect issues proactively.

Why this answer

Option C is correct because Cloud Monitoring alerts on Dataflow job metrics (e.g., system lag, watermark delay, or element count) enable proactive detection of pipeline health issues such as backpressure or stuck workers. This is a best practice for production pipelines to ensure reliability and timely intervention.

Exam trap

Google Cloud often tests the misconception that disabling autoscaling or restarting pipelines is acceptable for cost control or simplicity, when in fact these actions violate production best practices for reliability and data integrity.

Full explanation →

542

MCQhard

A financial services company uses Cloud Pub/Sub with ordering keys to process transactions in order. Some messages are failing processing and getting stuck. The team wants to ensure that if a message fails, it can be reprocessed later without blocking subsequent messages. What should they implement?

A.Create multiple subscriptions for the same topic

B.Use a pull subscription with flow control settings

C.Configure a dead letter topic and handle the failed message separately

D.Increase the acknowledgment deadline to 600 seconds

AnswerC

Dead letter topics isolate failures, allowing forwarding of messages for later reprocessing.

Why this answer

Option C is correct because a dead letter topic (DLT) allows failed messages to be moved aside after exhausting retry attempts, so they do not block the processing of subsequent ordered messages. In Cloud Pub/Sub, ordering keys require messages with the same key to be delivered in order; if a message fails and is not acknowledged, it blocks all later messages with the same key. By configuring a dead letter topic, the failed message is automatically forwarded to the DLT after a maximum of 5 delivery attempts (default), and the original subscription can continue processing the next messages in order.

The team can then reprocess the failed message from the DLT separately, without affecting the order of other messages.

Exam trap

Google Cloud often tests the misconception that increasing the acknowledgment deadline or adding flow control can resolve stuck messages with ordering keys, but the real solution is to use a dead letter topic to offload the failing message and unblock the ordered stream.

How to eliminate wrong answers

Option A is wrong because creating multiple subscriptions for the same topic does not solve the blocking issue; each subscription independently receives all messages, but within a single subscription, ordering keys still cause a failed message to block subsequent messages with the same key. Option B is wrong because pull subscriptions with flow control settings only limit the rate of message delivery and do not handle failed messages that are stuck; they do not provide a mechanism to move failed messages out of the way to unblock ordering. Option D is wrong because increasing the acknowledgment deadline to 600 seconds only gives the subscriber more time to process a message before it is redelivered, but it does not prevent a persistently failing message from blocking subsequent ordered messages indefinitely.

Full explanation →

543

MCQeasy

A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?

A.Combine.perKey

B.ParDo

C.GroupByKey

D.CoGroupByKey

AnswerA

Combine.perKey applies a combining function (e.g., average) per key.

Why this answer

Combine.perKey with an averaging function computes per-key average. GroupByKey groups by key but requires manual combination. ParDo is for element-wise processing.

CoGroupByKey joins multiple PCollections.

Full explanation →

544

MCQhard

An organization uses Cloud Dataproc to run Spark jobs that process sensitive data. They need to ensure data is encrypted at rest and that only specific service accounts can access the data on cluster disks. What should they do?

A.Rely on the default encryption at rest and use VPC Service Controls to limit data exfiltration.

B.Use customer-supplied encryption keys (CSEK) and write a startup script to mount encrypted disks.

C.Enable encryption at rest using Google-managed encryption keys and grant all users the Dataproc Editor role.

D.Use customer-managed encryption keys (CMEK) for the cluster's persistent disks and assign a dedicated service account to the cluster with minimal IAM roles.

AnswerD

CMEK provides control over keys, and a dedicated service account restricts data access.

Why this answer

Option D is correct because using customer-managed encryption keys (CMEK) allows the organization to control and manage the encryption keys for persistent disks attached to the Dataproc cluster, ensuring data at rest is encrypted. Assigning a dedicated service account with minimal IAM roles ensures that only that service account can access the data on the cluster disks, following the principle of least privilege.

Exam trap

The trap here is that candidates often confuse CSEK (used for Cloud Storage) with CMEK (used for persistent disks), or assume that default encryption combined with VPC Service Controls is sufficient for granular access control to disk data.

How to eliminate wrong answers

Option A is wrong because default encryption at rest uses Google-managed keys, which does not allow the organization to control key access or restrict which service accounts can access data on cluster disks; VPC Service Controls prevent data exfiltration but do not enforce service-account-level access to disk data. Option B is wrong because customer-supplied encryption keys (CSEK) are used for encrypting data in Cloud Storage, not for persistent disks on Dataproc; mounting encrypted disks via a startup script is not a supported or recommended method for Dataproc clusters. Option C is wrong because granting all users the Dataproc Editor role would allow any user to access and modify cluster resources, violating the requirement that only specific service accounts can access data on cluster disks.

Full explanation →

545

MCQmedium

You need to store and query a large dataset of customer profiles. The data is semi-structured and frequently updated. The application requires offline support for mobile users. Which database is MOST appropriate?

A.Firestore

B.BigQuery

C.Cloud Bigtable

D.Cloud SQL

AnswerA

Correct: Firestore provides offline support and works with semi-structured documents.

Why this answer

Firestore is the most appropriate choice because it is a NoSQL document database designed for semi-structured data, real-time synchronization, and offline support. It provides built-in offline persistence for mobile clients, allowing users to read and write data even without network connectivity, and automatically syncs changes when the connection is restored. This directly meets the requirements of semi-structured data, frequent updates, and offline mobile support.

Exam trap

Cisco often tests the distinction between databases designed for transactional/operational workloads (like Firestore) versus analytical/warehouse databases (like BigQuery), and the trap here is assuming that any NoSQL database (like Bigtable) supports offline mobile sync, when in fact only Firestore provides native offline persistence and real-time synchronization for mobile clients.

How to eliminate wrong answers

Option B (BigQuery) is wrong because it is a serverless data warehouse optimized for analytical queries on large datasets, not for transactional or real-time updates, and it lacks native offline support for mobile applications. Option C (Cloud Bigtable) is wrong because it is a wide-column NoSQL database designed for high-throughput, low-latency workloads like time-series or IoT data, but it does not support offline mobile synchronization or semi-structured document models. Option D (Cloud SQL) is wrong because it is a relational database (MySQL, PostgreSQL, SQL Server) requiring a fixed schema, which is unsuitable for semi-structured data, and it does not provide built-in offline support for mobile clients.

Full explanation →

546

Multi-Selecteasy

A company is deploying a machine learning model for fraud detection. The model is trained using TensorFlow and will be served on Vertex AI Prediction. The team wants to implement model monitoring to detect prediction drift. Which TWO actions should they take? (Choose 2)

Select 2 answers

A.Configure Vertex AI Model Monitoring to compare online prediction inputs against training data statistics.

B.Collect ground truth labels for all predictions to measure accuracy drift.

C.Set up a separate Cloud Monitoring alerting policy to watch for prediction errors.

D.Enable automatic model retraining in Vertex AI Model Monitoring when drift is detected.

E.Enable prediction drift monitoring to detect changes in model output distribution.

AnswersA, E

This detects feature drift, which is a common monitoring need.

Why this answer

Option A is correct because Vertex AI Model Monitoring can be configured to compare online prediction inputs against training data statistics to detect skew, which is a form of drift. This is a standard capability of Vertex AI Model Monitoring, where you specify a baseline dataset (typically training data) and the service automatically computes statistics on incoming prediction requests to identify distribution shifts.

Exam trap

Google Cloud often tests the distinction between monitoring for drift (which focuses on input/output distributions) versus monitoring for model accuracy (which requires ground truth labels), and candidates mistakenly think collecting ground truth is a prerequisite for drift detection.

Full explanation →

547

MCQmedium

A company uses Dataflow streaming pipelines to process real-time events. They notice increasing system lag over time. Which two Cloud Monitoring metrics should be examined to diagnose the cause?

A.Pub/Sub subscription/num_undelivered_messages and Dataflow job/watermark_lag

B.Dataproc cluster/yarn_allocated_memory_percentage and Dataflow job/worker_cpu

C.Dataflow job/system_lag and Dataflow job/data_freshness

D.BigQuery query/execution_times and Dataflow job/elapsed_time

AnswerC

system_lag indicates processing delay; data_freshness shows watermark progress. Both are key for streaming lag.

Why this answer

System lag measures the time between event ingestion and processing. Data freshness shows the watermark. Worker CPU indicates compute resource issues.

Full explanation →

548

Multi-Selecthard

A global fintech company needs a database that can serve transactional (OLTP) and analytical (OLAP) workloads with strong consistency. They require high availability and PostgreSQL compatibility. Which TWO Google Cloud databases meet these requirements? (Choose 2 correct options)

Select 2 answers

A.Cloud SQL

B.Cloud Bigtable

C.Cloud Spanner

D.BigQuery

E.AlloyDB

AnswersC, E

Strong consistency, globally distributed, OLTP+analytics via SQL, PostgreSQL interface available.

Why this answer

Cloud Spanner is correct because it provides a globally distributed, strongly consistent relational database service that supports both OLTP and OLAP workloads via PostgreSQL-compatible interfaces (including the PostgreSQL dialect). It offers high availability through synchronous replication across zones and regions, and its TrueTime-based atomic clocks ensure external consistency for transactions, meeting the fintech company's requirements for strong consistency and PostgreSQL compatibility.

Exam trap

Cisco often tests the misconception that Cloud SQL or BigQuery can serve both OLTP and OLAP workloads with strong consistency, but Cloud SQL lacks global scaling and OLAP performance, and BigQuery is purely analytical without transactional support.

Full explanation →

549

MCQeasy

Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?

A.Dataproc

B.Cloud Data Fusion

C.Dataprep

D.BigQuery

AnswerB

Cloud Data Fusion is a visual ETL tool with CDAP plugins and a Hub for marketplace transforms.

Why this answer

Cloud Data Fusion offers a visual, code-free ETL tool with a rich set of plugins from the Hub. Dataprep is for data wrangling, not full ETL.

Full explanation →

550

MCQmedium

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

A.Session windowing is configured with too short a gap duration.

B.Bigtable schema design causes row key collisions.

C.Dataflow's default behavior discards late-arriving data.

D.Pub/Sub provides at-least-once delivery, and Dataflow does not deduplicate by default.

AnswerD

At-least-once delivery leads to duplicates without dedup in pipeline.

Why this answer

D is correct because Pub/Sub offers at-least-once delivery, meaning the same message may be delivered multiple times. Dataflow does not automatically deduplicate messages unless explicitly configured (e.g., using idempotent sinks or custom deduplication logic). Without deduplication, the same session event can be processed more than once, leading to duplicate session data in Bigtable.

Exam trap

Google Cloud often tests the misconception that Pub/Sub provides exactly-once delivery or that Dataflow automatically deduplicates messages from Pub/Sub, when in fact Pub/Sub is at-least-once and Dataflow requires explicit deduplication for idempotent processing.

How to eliminate wrong answers

Option A is wrong because a short gap duration would cause sessions to be split prematurely, leading to missing data (events not grouped into the same session), not duplicates. Option B is wrong because row key collisions in Bigtable would cause overwrites or errors, not missing or duplicate session data; Bigtable uses lexicographic ordering and row keys are unique per write. Option C is wrong because Dataflow's default behavior for late-arriving data depends on the windowing strategy; with session windows, late data can be included if within the allowed lateness, and Dataflow does not discard late data by default—it uses a default allowed lateness of 0 seconds, which would cause late data to be dropped, but this would result in missing data, not duplicates.

Full explanation →

551

MCQeasy

A data engineer wants to store archived log files in Cloud Storage with a retention policy that prevents deletion for 5 years. Which feature should they use?

A.Object Lifecycle rule with Delete action after 5 years

B.Retention Policy on the bucket set to 5 years

C.Object Hold (temporal)

D.Versioning enabled

AnswerB

Retention policies ensure objects cannot be deleted or replaced until the retention period expires.

Why this answer

A retention policy on a Cloud Storage bucket enforces a minimum retention period for all objects in the bucket, preventing deletion or overwrite until the policy duration has elapsed. Setting it to 5 years ensures that archived log files cannot be deleted before that time, meeting the data engineer's requirement exactly. This is a bucket-level, immutable setting that applies to all objects, unlike object-level holds or lifecycle rules.

Exam trap

Cisco often tests the distinction between lifecycle rules that delete objects and retention policies that prevent deletion, so the trap here is assuming that a lifecycle rule with a Delete action can enforce a retention period, when in fact it does the opposite.

How to eliminate wrong answers

Option A is wrong because an Object Lifecycle rule with a Delete action after 5 years would automatically delete objects after 5 years, which is the opposite of preventing deletion; it does not enforce a retention period. Option C is wrong because an Object Hold (temporal) is a temporary hold placed on individual objects for a specific duration (e.g., days), not a bucket-wide policy for 5 years, and it is typically used for legal or compliance holds, not long-term retention. Option D is wrong because Versioning enabled preserves previous versions of objects but does not prevent deletion of the current version; it allows recovery after deletion but does not block deletion itself, so it does not enforce a retention policy.

Full explanation →

552

MCQmedium

A team needs to run analytics on data stored in Cloud Storage (Parquet format) without moving it into BigQuery storage. They want to use SQL queries and BigQuery features like caching and partitioning. Which approach should they use?

A.Create a BigQuery external table pointing to the Cloud Storage location.

B.Use Cloud Dataproc to run Spark SQL on the data.

C.Load the data into BigQuery tables using a batch load job.

D.Use BigLake tables with Cloud Storage.

AnswerA

External tables allow querying data in GCS without loading. Note that partitioning and clustering are not available for external tables.

Why this answer

Option A is correct because an external table in BigQuery allows you to query data stored in Cloud Storage (Parquet format) using standard SQL without moving the data into BigQuery's managed storage. This approach supports BigQuery features like caching (results caching) and partitioning (by defining a Hive-partitioned external table), meeting all requirements.

Exam trap

Cisco often tests the distinction between external tables (which query data in place) and BigLake tables (which add security and governance layers but are not required for basic SQL queries with caching and partitioning), leading candidates to overcomplicate the solution by choosing BigLake when a simple external table suffices.

How to eliminate wrong answers

Option B is wrong because Cloud Dataproc with Spark SQL requires spinning up a cluster and does not leverage BigQuery's native SQL engine, caching, or partitioning features; it also moves compute to the data but not the query engine. Option C is wrong because loading data into BigQuery tables moves the data into BigQuery storage, which violates the requirement to not move the data. Option D is wrong because BigLake tables are a separate concept that unifies data lakes and warehouses but still requires creating a BigLake connection and does not directly provide BigQuery's caching and partitioning features for external data without additional configuration; the simpler external table approach is the correct fit.

Full explanation →

553

Multi-Selecteasy

Which TWO are benefits of using Vertex AI Endpoints for model serving?

Select 2 answers

A.Batch prediction support out of the box.

B.Integrated monitoring for prediction latency and error rates.

C.Automatic scaling based on traffic.

D.Automatic model retraining when drift is detected.

E.Built-in support for A/B testing without any additional configuration.

AnswersB, C

Vertex AI endpoints integrate with Cloud Monitoring for operational metrics.

Why this answer

Vertex AI Endpoints provide integrated monitoring for prediction latency and error rates out of the box, enabling you to track model performance and detect anomalies without additional instrumentation. This is a core operational feature that helps maintain service-level objectives (SLOs) and quickly identify degradation in production.

Exam trap

Google Cloud often tests the distinction between features that are 'built-in' versus those that require separate services or additional configuration, so candidates mistakenly assume batch prediction or automatic retraining are part of Endpoints when they are actually separate Vertex AI components.

Full explanation →

554

Multi-Selecthard

A team is deploying a complex model with multiple preprocessing steps. They want to ensure consistent preprocessing during training and serving. Which three approaches can achieve this? (Select 3)

Select 3 answers

A.Store preprocessing logic in a shared Python module

B.Use a separate preprocessing service called from the model

C.Use two separate pipelines for training and serving

D.Use Vertex AI Feature Transform Engine

E.Embed preprocessing logic in the model graph

AnswersA, D, E

A shared module ensures the same code is used in training and serving if properly versioned.

Why this answer

Option A is correct because storing preprocessing logic in a shared Python module ensures that the same code is used during both training and serving, eliminating drift between environments. This approach leverages version control and dependency management to guarantee consistency, which is critical for reproducibility in production ML pipelines.

Exam trap

Google Cloud often tests the misconception that a separate preprocessing service (Option B) is a good architectural pattern for consistency, when in fact it introduces a single point of failure and versioning complexity that undermines the goal of identical preprocessing.

Full explanation →

555

Multi-Selecthard

A company uses Cloud Storage to store sensitive customer data. They need to restrict access to the data so that only requests from within a specific VPC network are allowed, and block all access from the public internet. Which TWO configurations should they implement? (Choose 2.)

Select 2 answers

A.Enable Private Google Access on the VPC subnet.

B.Use Cloud Storage signed URLs for all access.

C.Disable public internet access by turning off the default internet gateway.

D.Use IAM conditions to restrict access based on VPC network.

E.Use VPC Service Controls to create a service perimeter around Cloud Storage.

AnswersD, E

IAM conditions can restrict access to requests originating from a specific VPC network.

Why this answer

Option D is correct because IAM conditions can be used to restrict access to Cloud Storage based on the requester's VPC network, ensuring only requests originating from the specified VPC are allowed. Option E is correct because VPC Service Controls create a service perimeter that prevents data exfiltration and blocks access from outside the perimeter, effectively denying public internet requests.

Exam trap

Cisco often tests the misconception that disabling the internet gateway or using Private Google Access alone can block public access to Cloud Storage, when in fact these controls affect connectivity from within the VPC, not inbound requests to Google-managed services.

Full explanation →

556

Multi-Selectmedium

You need to ingest streaming data from a custom application into BigQuery with exactly-once semantics and low latency. The data volume is up to 10 MB/s. Which TWO services should you combine?

Select 2 answers

A.Pub/Sub

B.Cloud Functions

C.BigQuery legacy streaming inserts

D.Dataflow with Storage Write API

E.Datastream

AnswersA, D

Pub/Sub is the recommended message ingestion service for streaming data.

Why this answer

Pub/Sub provides reliable, low-latency message ingestion, and Dataflow can read from Pub/Sub and write to BigQuery using the Storage Write API, which supports exactly-once semantics. The Storage Write API with committed mode ensures exactly-once delivery.

Full explanation →

557

Multi-Selectmedium

A data engineer is designing a BigQuery table for time-series data that will be queried frequently by time range and also by a customer_id. Which TWO design decisions will improve query performance and manage costs? (Choose two.)

Select 2 answers

A.Partition the table by day on the timestamp column

B.Cluster the table on customer_id

C.Disable automatic reclustering to save costs

D.Set partition expiration to 1 year

E.Use nested repeated fields for customer data

AnswersA, B

Enables partition pruning for time-range queries.

Why this answer

Partitioning the table by day on the timestamp column allows BigQuery to prune partitions when queries filter by a time range, scanning only the relevant partitions instead of the entire table. This directly reduces the amount of data read, improving query performance and lowering costs.

Exam trap

Google Cloud often tests the misconception that disabling automatic reclustering saves costs, but in reality it is free and essential for maintaining clustering benefits, while partition expiration is a lifecycle management feature, not a performance optimization.

Full explanation →

558

MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?

A.Use exactly-once processing by setting the pipeline's streaming engine to exactly-once and using a BigQuery sink with exactly-once semantics

B.Use at-least-once processing and rely on BigQuery's automatic deduplication

C.Use exactly-once processing and write results to a staging table, then use a scheduled merge query to combine with the main table

D.Use at-most-once processing to guarantee no duplicates, and accept data loss

AnswerA

Dataflow's exactly-once sink for BigQuery ensures no duplicates even with late data, using a combination of idempotent writes and deduplication.

Why this answer

To prevent duplicate aggregations, you need exactly-once processing. Dataflow supports exactly-once sinks (like BigQuery) when using the FILE_LOADS method or streaming inserts with exactly-once semantics. Using at-least-once with deduplication in BigQuery is not reliable.

Exactly-once semantics require idempotent writes; the recommended approach is to use the BigQuery sink with exactly-once support (by setting the trigger frequency appropriately and using a dedup key). However, the simplest way is to use the Dataflow streaming engine and the BigQuery sink with exactly-once enabled. Among the options, using the BigQuery streaming inserts with exactly-once semantics (available in Dataflow) is correct.

Full explanation →

559

MCQeasy

Which Google Cloud database offers global distribution, strong consistency, and a 99.999% SLA?

A.Cloud Spanner

B.Cloud Bigtable

C.Firestore

D.Cloud SQL

AnswerA

Correct: Spanner offers global distribution, strong consistency, and 99.999% SLA.

Why this answer

Cloud Spanner is the only Google Cloud database that provides global distribution (horizontally scaling across regions), strong consistency (external consistency with TrueTime), and a 99.999% SLA. It combines the benefits of relational database structure with non-relational horizontal scale, making it ideal for globally distributed, strongly consistent workloads.

Exam trap

Cisco often tests the distinction between strong consistency and eventual consistency in globally distributed databases, and the trap here is that candidates may confuse Firestore's multi-region eventual consistency with the strong consistency required for the 99.999% SLA, or assume Cloud Bigtable's high throughput implies strong consistency.

How to eliminate wrong answers

Option B (Cloud Bigtable) is wrong because it offers only eventual consistency (not strong consistency) and a 99.99% SLA, not 99.999%. Option C (Firestore) is wrong because it provides strong consistency only within a single region; its multi-region mode uses eventual consistency, and its SLA is 99.999% only for single-region, not globally distributed strong consistency. Option D (Cloud SQL) is wrong because it is a single-region relational database with no global distribution capability and a 99.95% SLA.

Full explanation →

560

MCQeasy

A data engineer needs to load 10 GB of CSV files from Amazon S3 into BigQuery on a daily basis. The files arrive in a specific S3 bucket at 3 AM UTC each day. Which service should be used to automate this transfer?

A.Cloud Storage Transfer Service

B.Dataflow with Pub/Sub

C.Transfer Appliance

D.BigQuery Data Transfer Service

AnswerD

BigQuery Data Transfer Service can schedule and automate data loads from Amazon S3 directly into BigQuery.

Why this answer

BigQuery Data Transfer Service supports scheduled transfers from Amazon S3 directly to BigQuery, making it the appropriate choice for this recurring batch load.

Full explanation →

561

Multi-Selecthard

You are building a time-series forecasting model with BigQuery ML. Which three steps should you perform to properly split the data and evaluate the model? (Choose THREE)

Select 3 answers

A.Evaluate on a holdout set that is later in time than the training set.

B.Use time-series cross-validation with expanding windows.

C.Split the data randomly into training and testing sets.

D.Use a chronological split based on a cutoff date.

E.Use k-fold cross-validation with random folds.

AnswersA, B, D

Testing on future data simulates real-world forecasting.

Why this answer

For time-series, you must maintain temporal order: split chronologically (not randomly), use a cutoff date for training/validation, and evaluate on unseen future data. Cross-validation should be time-series aware (e.g., expanding window). Random split is invalid for time-series.

Using a single train/test split may be insufficient; multiple windows are better.

Full explanation →

562

MCQeasy

You need to preprocess tabular data for training a classification model using Vertex AI. The dataset has missing values in numerical columns and categorical columns with high cardinality. Which Vertex AI service provides automated feature engineering and preprocessing as part of the pipeline?

A.Vertex AI Pipelines

B.AutoML Tables

C.Vertex AI Feature Store

D.Vertex AI Workbench

AnswerA

Vertex AI Pipelines orchestrates preprocessing steps such as imputation and encoding as part of an ML pipeline.

Why this answer

Vertex AI Pipelines allows you to build ML pipelines with components for feature engineering, including handling missing values and encoding. Vertex AI Feature Store is for serving features, not preprocessing.

Full explanation →

563

MCQhard

A company has a batch prediction job that runs daily using AI Platform Batch Prediction. The job uses a TensorFlow model and processes 10 GB of data. Recently, the job started failing with the error 'The replica worker 0 exited with a non-zero exit code: Out of memory'. Which action should the team take to resolve this without rewriting the model?

A.Increase the number of workers (parallelism) to distribute the data across more machines.

B.Use a machine type with more memory, such as n1-highmem-8.

C.Reduce the batch size parameter in the prediction job configuration.

D.Optimize the model to use less memory by pruning or quantization.

AnswerB

Directly addresses the out-of-memory error by providing more RAM per worker.

Why this answer

The error 'Out of memory' on replica worker 0 indicates that the machine type assigned to the prediction job does not have enough RAM to load the model and process the 10 GB batch. Increasing the machine type to one with more memory (e.g., n1-highmem-8) directly addresses the memory constraint without requiring any code changes. This is the most straightforward fix because AI Platform Batch Prediction allows you to specify machine types in the job configuration, and the error is purely a resource allocation issue.

Exam trap

Google Cloud often tests the distinction between scaling horizontally (adding workers) and scaling vertically (increasing machine resources), where candidates mistakenly assume parallelism solves memory issues, but the error is per-worker memory exhaustion, not throughput.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers (parallelism) distributes the data across more machines but does not increase the memory per worker; each replica still has the same limited memory, so the out-of-memory error would persist on each worker. Option C is wrong because reducing the batch size parameter controls how many predictions are processed per step, which can reduce peak memory usage per request, but the error occurs during model loading or initial data processing, not during per-step prediction; the 10 GB dataset and model size still require sufficient base memory. Option D is wrong because while pruning or quantization could reduce model memory footprint, the question explicitly states 'without rewriting the model,' and these techniques require modifying the model architecture or retraining, which is a form of rewriting.

Full explanation →

564

MCQeasy

A company needs to process large files (100GB each) from Cloud Storage using Dataproc. They want to minimize job execution time. Which configuration is most appropriate?

A.Use a single-node cluster

B.Use a cluster with preemptible worker nodes and high-CPU machine types

C.Use HDFS for input data to avoid network latency

D.Use a cluster with many standard worker nodes

AnswerB

Preemptible VMs reduce cost, high-CPU machines improve speed.

Why this answer

Option B is correct because preemptible worker nodes are significantly cheaper than standard nodes, allowing you to scale out the cluster with many more workers for the same cost, which directly reduces job execution time for embarrassingly parallel data processing tasks. High-CPU machine types are ideal for compute-intensive Dataproc jobs like data transformation or machine learning, as they provide more vCPUs per core for parallel processing. This combination maximizes parallelism and minimizes wall-clock time for large-scale batch jobs.

Exam trap

The trap here is that candidates often assume standard worker nodes are always better for performance, ignoring the cost-benefit of preemptible nodes that allow scaling to many more workers for the same budget, which directly reduces execution time for parallelizable jobs.

How to eliminate wrong answers

Option A is wrong because a single-node cluster lacks parallelism, so processing 100GB files would be severely bottlenecked by a single machine's CPU and memory, leading to long execution times. Option C is wrong because HDFS is not used for input data from Cloud Storage; Dataproc reads directly from Cloud Storage via the gs:// connector, and using HDFS would require copying data first, adding network latency and storage overhead. Option D is wrong because using many standard worker nodes is less cost-effective than using preemptible nodes; standard nodes are more expensive, so for the same budget you can provision fewer workers, resulting in longer job execution times compared to a larger cluster of preemptible nodes.

Full explanation →

565

MCQmedium

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

A.Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.

B.Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.

C.Use Default Encryption and enable Data Loss Prevention (DLP) API.

D.Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.

AnswerB

CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.

Why this answer

Option B is correct because Customer-Managed Encryption Keys (CMEK) allow the team to control and manage the encryption keys used to protect data at rest in Cloud Storage and BigQuery, while enabling Cloud Audit Logs provides the necessary audit trail for access to both the data and the keys. This combination directly satisfies the requirements for encryption at rest and auditability.

Exam trap

Google Cloud often tests the distinction between encryption key management (CMEK vs. CSEK vs. Default) and security controls (VPC Service Controls vs.

Audit Logs), leading candidates to conflate network perimeter controls with audit capabilities.

How to eliminate wrong answers

Option A is wrong because VPC Service Controls provide network-based security boundaries to prevent data exfiltration, but they do not provide audit logging of access to data or keys, which is a separate requirement. Option C is wrong because Default Encryption uses Google-managed keys, which do not give the team control over encryption keys, and the DLP API is for inspecting and classifying sensitive data, not for encryption at rest or audit logging. Option D is wrong because Customer-Supplied Encryption Keys (CSEK) require the customer to manage their own keys outside Google Cloud, which adds operational complexity and does not integrate with Cloud Audit Logs for key access auditing; VPC Service Controls again do not provide audit logging.

Full explanation →

566

MCQmedium

You are building a streaming pipeline to ingest real-time clickstream data from a website into BigQuery for immediate analysis. The data must be available in BigQuery within seconds and you need to handle late-arriving data (e.g., browser offline events) that may arrive hours later. Which approach should you use?

A.Use Pub/Sub with Dataflow, writing to BigQuery using the Storage Write API in committed mode.

B.Use Cloud Logging to capture logs and export to BigQuery via a sink.

C.Use Pub/Sub with Cloud Functions, writing each event directly via BigQuery legacy streaming inserts.

D.Use Datastream to stream clickstream data from Cloud SQL to BigQuery.

AnswerA

This provides low-latency streaming, late data handling via Dataflow's triggers, and efficient writes.

Why this answer

Option A is correct because Pub/Sub provides a scalable, durable ingestion layer for real-time clickstream data, and Dataflow can handle late-arriving data via its built-in watermark and trigger mechanisms. The Storage Write API in committed mode ensures exactly-once semantics and low-latency writes to BigQuery, meeting the sub-second availability requirement while preserving data consistency for delayed events.

Exam trap

The trap here is that candidates assume legacy streaming inserts (Option C) are sufficient for real-time needs, but they overlook the 1-hour buffer delay and lack of late-data handling, which are explicitly tested in the PDE exam's focus on streaming pipelines with out-of-order events.

How to eliminate wrong answers

Option B is wrong because Cloud Logging is designed for log ingestion and analysis, not for high-throughput real-time clickstream pipelines; exporting logs via a sink introduces latency (typically minutes) and cannot guarantee sub-second BigQuery availability. Option C is wrong because BigQuery legacy streaming inserts have a 1-hour buffer before data is available for queries, do not support exactly-once semantics, and Cloud Functions lack the stateful processing capabilities (e.g., windowing, triggers) needed to handle late-arriving data correctly. Option D is wrong because Datastream is built for continuous replication from databases like Cloud SQL to BigQuery, not for ingesting raw clickstream events from a website; it requires an intermediary database, which adds unnecessary complexity and latency.

Full explanation →

567

MCQeasy

A team needs to orchestrate a multi-step workflow that involves calling external APIs, running BigQuery queries, and conditionally executing Cloud Functions. Which Google Cloud service is best suited for this?

A.Dataflow

B.Workflows

C.Cloud Composer

D.Cloud Scheduler

AnswerB

Lightweight orchestration service with steps, conditions, and error handling.

Why this answer

Workflows is a serverless orchestration service that allows you to define multi-step workflows as a sequence of steps, including HTTP calls to external APIs, BigQuery queries, and conditional logic to invoke Cloud Functions. It integrates natively with other Google Cloud services via the Workflows API and supports error handling, retries, and parallel steps, making it ideal for this use case.

Exam trap

Cisco often tests the distinction between orchestration services (Workflows) and data processing services (Dataflow) or scheduling services (Cloud Scheduler), leading candidates to choose Dataflow because they confuse data processing with workflow orchestration.

How to eliminate wrong answers

Option A is wrong because Dataflow is a stream and batch data processing service based on Apache Beam, not an orchestration tool for coordinating API calls, BigQuery queries, and Cloud Functions. Option C is wrong because Cloud Composer is a managed Apache Airflow service that is designed for complex, scheduled workflows with dependencies, but it is heavier, requires more setup, and is overkill for a simple multi-step orchestration that Workflows handles more efficiently. Option D is wrong because Cloud Scheduler is a cron job service for triggering tasks on a schedule, but it cannot orchestrate conditional logic, API calls, or BigQuery queries within a single workflow.

Full explanation →

568

MCQhard

You are migrating a large on-premises data warehouse to BigQuery. The data includes sensitive PII columns that must be masked for certain users. Which BigQuery feature can automatically redact PII in query results based on user roles?

A.IAM conditions on tables

B.Authorized views

C.Column-level security with data masking

D.Cloud DLP API

AnswerC

Data masking policy tags can automatically redact PII based on user roles.

Why this answer

BigQuery column-level security with policy tags and data masking can automatically mask sensitive data based on IAM roles. Authorized views require manual creation. Dynamic data masking is part of column-level security.

IAM conditions don't mask data.

Full explanation →

569

MCQeasy

You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?

A.Single-node cluster

B.Standard cluster with preemptible workers for the primary worker nodes

C.Standard cluster with preemptible secondary workers

D.Standard cluster with all standard (non-preemptible) workers

AnswerC

Secondary workers can be preemptible, reducing cost. Primary workers handle coordination and must be standard.

Why this answer

Preemptible VMs are significantly cheaper than standard VMs and suitable for fault-tolerant batch jobs like nightly Spark processing. Standard mode is fine but using preemptible workers reduces cost.

Full explanation →

570

MCQmedium

You configured a model deployment monitor on your Vertex AI endpoint as shown. What will happen when the feature 'age' has a skew of 0.4?

A.An alert will be sent to admin@example.com

B.The endpoint will automatically roll back to a previous model version

C.No alert will be sent because the skew threshold is 0.2 for income

D.An alert will be sent only if both features exceed their thresholds

AnswerA

Skew 0.4 exceeds threshold 0.3 for age.

Why this answer

Option A is correct because the monitoring configuration shows an alert threshold of 0.2 for the feature 'age', and a skew of 0.4 exceeds that threshold. Vertex AI Model Monitoring will trigger the configured alert action, which in this case is sending an email to admin@example.com. The alert is based on the specific feature's threshold, not on any other feature's threshold.

Exam trap

Google Cloud often tests the misconception that alerts require multiple features to exceed thresholds or that the system can automatically roll back models, when in reality each feature is evaluated independently and only notifications are sent.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Monitoring does not automatically roll back model deployments; it only sends alerts based on configured actions, and auto-rollback is not a supported feature in this context. Option C is wrong because the skew threshold for 'age' is 0.2, not 0.2 for 'income'; the question states the skew for 'age' is 0.4, which exceeds its own threshold, so an alert will be sent regardless of the 'income' feature's threshold. Option D is wrong because the alert is triggered per feature when its individual threshold is exceeded; there is no requirement for both features to exceed their thresholds simultaneously.

Full explanation →

571

Drag & Dropmedium

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Database Migration Service enables minimal-downtime migrations using replication.

Full explanation →

572

Multi-Selecthard

A company uses Cloud Composer to orchestrate Dataproc and BigQuery jobs. They need to implement retry logic for transient failures. Which THREE features can help?

Select 3 answers

A.Dataflow pipeline retries

B.DAG retry_delay

C.BigQuery job retries

D.Cloud Composer high availability

E.Task retries and retry_delay

AnswersB, C, E

Composer can retry the entire DAG on failure with a delay.

Why this answer

Option B is correct because Cloud Composer (Apache Airflow) allows setting `retry_delay` at the DAG level to define the time delay between task retries. This is a native Airflow feature that helps handle transient failures by automatically retrying failed tasks after a specified delay, reducing manual intervention.

Exam trap

The trap here is confusing infrastructure-level high availability (Option D) with application-level retry logic, leading candidates to select HA as a retry mechanism when it only ensures environment uptime, not task-level failure recovery.

Full explanation →

573

MCQmedium

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

A.Use DataFrame.join with broadcast hint on the dimension DataFrame

B.Read the fact table and dimension table into separate DataFrames and use standard join

C.Read the dimension table as an RDD and collect as a map, then use map-side join

D.Increase the spark.sql.autoBroadcastJoinThreshold to a large value

AnswerA

Forces broadcast join regardless of table size.

Why this answer

Option A is correct because broadcasting the small dimension table using the broadcast hint (e.g., `broadcast(dimensionDF)`) forces Spark to replicate the dimension data to all executor nodes, eliminating the need for a shuffle during the join. This is ideal when the dimension table is small enough to fit in executor memory, and since the pipeline reads the latest CSV daily, the broadcast will automatically use the updated data without additional code changes.

Exam trap

The trap here is that candidates may think increasing `spark.sql.autoBroadcastJoinThreshold` is a safe global fix, but it can cause memory pressure and does not guarantee a broadcast join if the table size fluctuates, whereas the explicit broadcast hint provides deterministic behavior.

How to eliminate wrong answers

Option B is wrong because a standard join without any hint or optimization will trigger a full shuffle of both datasets, which is exactly the performance problem described. Option C is wrong because manually collecting the dimension table as an RDD and using a map-side join is an outdated, error-prone approach that bypasses Spark SQL's Catalyst optimizer and broadcast join optimizations; it also requires manual handling of updates and memory management. Option D is wrong because increasing `spark.sql.autoBroadcastJoinThreshold` globally may cause the dimension table to be broadcast automatically, but it does not guarantee the join uses a broadcast if the table size exceeds the threshold, and it can lead to out-of-memory errors if the threshold is set too high without considering executor memory limits.

Full explanation →

574

MCQeasy

A data engineer tries to grant a service account read access to a Cloud Storage bucket using the IAM policy above. The service account still cannot read objects. What is the most likely reason?

A.The role does not include the necessary permission

B.The condition prevents access because the request time is after 2023

C.The service account is misspelled

D.The role should be roles/storage.admin

AnswerB

The condition expression requires request.time before 2023, which is likely no longer true.

Why this answer

Option B is correct because the IAM condition explicitly restricts access to requests made before January 1, 2023. Since the current time is after that date, the condition evaluates to false, denying the service account's read access regardless of the role binding. IAM conditions are evaluated at request time, and if the condition is not met, the permission is not granted.

Exam trap

Google Cloud often tests the subtlety that IAM conditions are evaluated at request time and can override a valid role binding, leading candidates to mistakenly focus on the role's permissions rather than the condition's effect.

How to eliminate wrong answers

Option A is wrong because roles/storage.objectViewer includes the storage.objects.get permission required to read objects, so the role does include the necessary permission. Option C is wrong because a misspelled service account would result in the role not being bound at all, but the question states the policy was applied, implying the service account name is correct. Option D is wrong because roles/storage.admin is an overly permissive role that includes many additional permissions beyond read access; the issue is not the role's permissions but the condition blocking access.

Full explanation →

575

MCQeasy

A company uses Cloud Monitoring to track application latency. They notice a spike in latency every 30 minutes. What is the best initial step to diagnose the issue?

A.Increase the number of instances to handle the load.

B.Enable Cloud Trace for all requests.

C.Check if scheduled jobs or cron tasks overlap.

D.Change the alert threshold to ignore the spikes.

AnswerC

Regularly recurring spikes suggest a scheduled job causing contention; investigating this is the most direct diagnostic step.

Why this answer

Recurring spikes at regular intervals often indicate a scheduled process (e.g., cron job, batch job) that runs every 30 minutes. Checking for overlapping scheduled jobs is the most efficient first step before scaling or other actions.

Full explanation →

576

MCQmedium

A data engineer needs to store raw sensor data in Cloud Storage and automatically transition it to a lower-cost storage class after 30 days, then delete it after 365 days. What should they configure?

A.Use Cloud Pub/Sub notifications to trigger a Cloud Function that moves objects.

B.Use gsutil rewrite command in a cron job.

C.Configure a lifecycle rule with SetStorageClass to Nearline after 30 days and Delete after 365 days.

D.Set a bucket retention policy with a retention period of 365 days.

AnswerC

Lifecycle rules can automatically transition objects to a different storage class and then delete them based on age.

Why this answer

Option C is correct because Cloud Storage lifecycle management rules allow you to automatically transition objects to a lower-cost storage class (such as Nearline) after a specified number of days and then delete them after another period. This is the native, serverless way to manage object lifecycle without external scripts or compute resources.

Exam trap

Cisco often tests the distinction between lifecycle management (which automates transitions and deletions) and retention policies (which only prevent deletion/overwrites), leading candidates to confuse the two.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub notifications and Cloud Functions introduce unnecessary complexity and cost; lifecycle rules handle this natively without custom code. Option B is wrong because using gsutil rewrite in a cron job is a manual, error-prone approach that does not scale and incurs additional egress/operation costs; lifecycle rules are the intended automated solution. Option D is wrong because a bucket retention policy prevents deletion before the retention period ends, but it does not automatically transition objects to a lower-cost storage class; it only enforces immutability.

Full explanation →

577

Multi-Selectmedium

Which TWO statements are correct about designing a data pipeline using Cloud Dataflow for processing unbounded data?

Select 2 answers

A.Watermarks are used to measure the progress of event time.

B.Triggers can only emit results at the end of a window.

C.Dataflow guarantees exactly-once processing for streaming pipelines.

D.Cloud Pub/Sub is the recommended source for streaming pipelines.

E.Fixed windows are always based on processing time.

AnswersA, D

Watermarks track event time progress.

Why this answer

Watermarks in Cloud Dataflow measure the progress of event time, indicating when all data up to a certain timestamp is expected to have arrived. This allows the pipeline to handle late-arriving data and determine when to close windows for unbounded data streams.

Exam trap

Google Cloud often tests the misconception that triggers only fire at window boundaries, when in fact Dataflow supports early, on-time, and late firings for flexible result emission.

Full explanation →

578

MCQmedium

A Cloud Build pipeline is set up to train a model on Vertex AI. The build fails with the error: 'ERROR: (gcloud.ai-platform.jobs.submit.training) NOT_FOUND: The parent project does not exist.' The project ID and the service account are correctly configured. What is the most likely cause?

A.The region specified for the training job does not exist.

B.The training job requires a GPU, which is not available in the specified region.

C.The Cloud Build service account does not have the aiplatform.jobs.create permission on the project.

D.The training package is not uploaded to Cloud Storage before the pipeline runs.

AnswerC

Insufficient permissions can cause the project to appear as not found to the service account.

Why this answer

The error 'NOT_FOUND: The parent project does not exist' indicates that the Cloud Build service account lacks the necessary IAM permission to submit a training job to Vertex AI. Even though the project ID and service account are correctly configured, the Cloud Build service account must have the 'aiplatform.jobs.create' permission (or the 'Vertex AI User' role) on the project. Without this, the API call fails because the service account is not authorized to access the project resource.

Exam trap

Google Cloud often tests the misconception that a 'NOT_FOUND' error always means a missing resource (like a project ID or region), when in fact it can indicate an IAM permission issue where the service account is not authorized to see or use the project.

How to eliminate wrong answers

Option A is wrong because an invalid region would produce a different error, such as 'INVALID_ARGUMENT: Region not found' or 'PERMISSION_DENIED', not 'NOT_FOUND: The parent project does not exist'. Option B is wrong because GPU availability issues would result in a 'RESOURCE_EXHAUSTED' or 'ZONE_RESOURCE_POOL_EXHAUSTED' error, not a project-level not found error. Option D is wrong because a missing training package in Cloud Storage would cause a 'FILE_NOT_FOUND' or 'INVALID_ARGUMENT' error during job submission, not a project not found error.

Full explanation →

579

MCQeasy

You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?

A.roles/bigquery.dataOwner

B.roles/bigquery.dataViewer

C.roles/bigquery.jobUser

D.roles/bigquery.dataEditor

AnswerB

DataViewer grants read-only access to data and metadata, ideal for analysts.

Why this answer

BigQuery Data Viewer grants read-only access to datasets and tables. roles/bigquery.dataViewer allows running queries and viewing metadata but not modifying or deleting data.

Full explanation →

580

MCQmedium

A company has deployed a machine learning model to AI Platform Prediction. The model uses a custom container with a TensorFlow SavedModel. After deployment, the prediction latency is higher than expected. Which action is most likely to reduce latency without significantly impacting model accuracy?

A.Convert the model to TensorFlow Lite and use a smaller model.

B.Increase the number of prediction nodes in the AI Platform Prediction cluster.

C.Enable XLA (Accelerated Linear Algebra) compilation on model loading.

D.Apply quantization to the model weights to reduce size.

AnswerC

XLA compiles and optimizes the TensorFlow graph, often improving latency without affecting accuracy.

Why this answer

Option C is correct because enabling XLA (Accelerated Linear Algebra) compilation on model loading optimizes the TensorFlow computation graph by fusing operations and reducing runtime overhead, which directly lowers prediction latency without altering model weights or architecture. XLA works by compiling the graph into efficient machine code at load time, improving execution speed while preserving the original model accuracy.

Exam trap

Cisco often tests the distinction between latency reduction and throughput scaling, so candidates mistakenly choose increasing nodes (Option B) thinking it reduces latency, when it actually only improves concurrent request handling.

How to eliminate wrong answers

Option A is wrong because converting to TensorFlow Lite and using a smaller model typically reduces model size and latency but often at the cost of accuracy due to pruning or reduced precision, and it is not a latency optimization that preserves accuracy. Option B is wrong because increasing the number of prediction nodes scales horizontally to handle more concurrent requests but does not reduce the per-request latency; it addresses throughput, not the latency of a single prediction. Option D is wrong because applying quantization to model weights reduces model size and can improve latency but usually introduces a trade-off with accuracy, especially with post-training quantization, and the question specifies 'without significantly impacting model accuracy,' making XLA a safer choice.

Full explanation →

581

MCQmedium

A company needs to store petabytes of time-series IoT sensor data and query it with single-digit millisecond latency at millions of reads per second. The data has a simple key-value structure with timestamps. Which Google Cloud database is MOST appropriate?

A.Cloud Bigtable

B.BigQuery

C.Cloud Spanner

D.Firestore

AnswerA

Bigtable is the correct choice: wide-column NoSQL, designed for time-series and IoT workloads, single-digit ms latency, and scales to millions of QPS with additional nodes.

Why this answer

Cloud Bigtable is a fully managed, scalable NoSQL database designed for large analytical and operational workloads, handling petabytes of data with consistent sub-10ms latency at millions of reads per second. Its key-value storage model and automatic sharding make it ideal for time-series IoT sensor data with simple timestamp-based keys, supporting high-throughput, low-latency access without the overhead of relational features.

Exam trap

Cisco often tests the distinction between operational databases (Bigtable) and analytical warehouses (BigQuery), so the trap here is assuming that 'petabytes of data' automatically means BigQuery, ignoring the critical requirement for single-digit millisecond latency at millions of reads per second.

How to eliminate wrong answers

Option B (BigQuery) is wrong because it is a serverless data warehouse optimized for analytical SQL queries on large datasets, not for single-digit millisecond point reads at millions of operations per second; its latency is typically in the seconds range for interactive queries. Option C (Cloud Spanner) is wrong because it is a globally distributed relational database with strong consistency and ACID transactions, which introduces overhead unsuitable for the simple key-value time-series pattern and cannot match Bigtable's throughput for millions of reads per second. Option D (Firestore) is wrong because it is a mobile and web document database with limited throughput (up to 10,000 writes/second per database) and is not designed for petabyte-scale time-series data or sub-millisecond latency at millions of reads per second.

Full explanation →

582

Multi-Selectmedium

Which THREE Google Cloud services are typically used together in a production ML pipeline?

Select 3 answers

A.Cloud Storage

B.Cloud Functions

C.Vertex AI Training

D.Vertex AI Prediction

E.BigQuery

AnswersA, C, D

For storing training data, model artifacts, etc.

Why this answer

Cloud Storage is correct because it serves as the central artifact repository in a production ML pipeline on Google Cloud. It stores training data, model artifacts, and prediction inputs/outputs, enabling seamless integration with Vertex AI Training for model training and Vertex AI Prediction for serving. Without Cloud Storage, there is no durable, scalable, and cost-effective way to manage the large datasets and model binaries required for production ML workflows.

Exam trap

The trap here is that candidates confuse 'services used in an ML pipeline' with 'services that can be used somewhere in ML' — Cloud Functions and BigQuery are often used in ML workflows (e.g., triggering retraining or storing features), but they are not the three core services that are typically used together in a production ML pipeline for training, storing artifacts, and serving predictions.

Full explanation →

583

MCQmedium

A data engineer needs to ingest daily Salesforce reports into BigQuery without writing custom code. The reports are exported to an Amazon S3 bucket on a schedule. Which service should they use to automate the transfer?

A.Cloud Dataproc

B.BigQuery Data Transfer Service

C.Cloud Composer

D.Cloud Storage Transfer Service

AnswerB

Supports Amazon S3 as a source for scheduled transfers directly into BigQuery.

Why this answer

The BigQuery Data Transfer Service (BQDTS) is the correct choice because it natively supports scheduled, automatic ingestion of Salesforce reports into BigQuery without requiring any custom code. It connects directly to the Salesforce API, extracts the reports, and loads them into BigQuery tables on a user-defined schedule, handling incremental updates and schema detection automatically.

Exam trap

The trap here is that candidates often confuse Cloud Storage Transfer Service (which only moves files between storage buckets) with BigQuery Data Transfer Service (which directly ingests from SaaS applications like Salesforce into BigQuery), leading them to pick option D when the requirement is for a no-code, direct-to-BigQuery solution.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc is a managed Spark/Hadoop service for running big data processing jobs, not a no-code data ingestion tool; it would require writing custom code to extract from Salesforce and load into BigQuery. Option C is wrong because Cloud Composer is a managed Apache Airflow service for orchestrating workflows; while it could be used to build a custom pipeline, it requires writing DAGs and code, which contradicts the 'without writing custom code' requirement. Option D is wrong because Cloud Storage Transfer Service is designed for moving data between cloud storage systems (e.g., S3 to GCS) and does not directly ingest data into BigQuery or connect to Salesforce APIs.

Full explanation →

584

MCQmedium

A Dataflow pipeline reads log files from Cloud Storage, parses them into LogEvent objects, and writes to BigQuery. The pipeline fails with the above errors. What is the most likely cause?

A.The LogEvent class does not have a no-argument constructor.

B.The pipeline is missing required import statements for LogEvent.

C.The BigQuery table schema does not match the LogEvent fields.

D.The log files are not in the expected format, causing parsing failures.

AnswerA

Beam requires a no-arg constructor for Avro or Serializable coders.

Why this answer

Apache Beam's SDK requires that custom types used as PCollection elements (like LogEvent) have a no-argument constructor so that the framework can deserialize objects during distributed processing, especially when using the Dataflow runner. Without it, the pipeline fails at runtime with a serialization error because Beam's default coder (e.g., SerializableCoder) cannot reconstruct the object.

Exam trap

The trap here is that candidates confuse runtime serialization errors with compile-time import issues or schema mismatches, overlooking the fundamental requirement for a no-argument constructor in Beam's default coders.

How to eliminate wrong answers

Option B is wrong because missing import statements would cause a compile-time error, not a runtime pipeline failure with the described errors. Option C is wrong because a BigQuery table schema mismatch would produce a write-time error (e.g., schema mismatch), not a serialization failure during parsing. Option D is wrong because parsing failures from malformed log files would result in exceptions during the parse step, not a serialization error related to the LogEvent class itself.

Full explanation →

585

MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?

A.Use global windows with a trigger that fires every 10 minutes

B.Use sliding windows of 10 minutes with a 5-minute period and allowed lateness of 1 hour

C.Use fixed windows of 10 minutes with allowed lateness of 0 seconds

D.Use fixed windows of 10 minutes with allowed lateness of 1 hour and a trigger that fires after watermark plus early firings

AnswerD

This allows late data up to 1 hour and provides timely results.

Why this answer

To handle late data, you need to set the allowed lateness to 1 hour. The trigger with AfterWatermark with early firings ensures results are emitted on time and updated when late data arrives.

Full explanation →

586

MCQmedium

A data team runs regular analytical queries on a BigQuery table that stores 2 years of sales data (approximately 10 TB). Queries frequently filter on a `sale_date` column and also group by `product_id`. To optimize cost and performance, which design approach is most effective?

A.Do not partition; only cluster by `sale_date`.

B.Partition by `sale_date` and set a table expiration of 90 days.

C.Partition the table by `sale_date` and cluster by `product_id`.

D.Partition by `product_id` and cluster by `sale_date`.

AnswerC

Partitioning by date enables partition elimination on date filters; clustering by product_id co-locates rows with the same product_id within each partition, improving GROUP BY performance.

Why this answer

Option C is correct because partitioning by `sale_date` allows BigQuery to perform partition pruning, eliminating scans of irrelevant date ranges, while clustering by `product_id` physically co-locates rows with the same product ID within each partition. This combination minimizes the data scanned for queries that filter on `sale_date` and group by `product_id`, directly reducing both cost (bytes billed) and query latency.

Exam trap

Google Cloud often tests the misconception that partitioning by a high-cardinality column like `product_id` is acceptable, but the trap here is that BigQuery enforces a hard limit of 4,000 partitions per table, making such a design infeasible and forcing candidates to recognize that clustering is the correct mechanism for high-cardinality grouping columns.

How to eliminate wrong answers

Option A is wrong because without partitioning, BigQuery must scan the entire 10 TB table even for queries filtering on a narrow date range, leading to unnecessarily high costs and slower performance. Option B is wrong because setting a table expiration of 90 days would delete historical data needed for 2-year analysis, and partitioning alone without clustering does not optimize the GROUP BY on `product_id` within each partition. Option D is wrong because partitioning by `product_id` (a high-cardinality column) would create millions of tiny partitions, exceeding BigQuery's partition limit (4,000 partitions per table) and causing poor performance and management overhead.

Full explanation →

587

MCQmedium

A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?

A.Switch to Dataproc and write a Spark job.

B.Use the Data Fusion Hub to download a custom plugin.

C.Use Dataprep to create the transformation.

D.Write a custom plugin using the CDAP SDK and deploy it.

AnswerD

The CDAP SDK allows building custom plugins.

Why this answer

Cloud Data Fusion supports custom plugins using the CDAP SDK.

Full explanation →

588

MCQeasy

You need to ingest Google Ads performance data into BigQuery on a daily basis for reporting. Which service should you use?

A.BigQuery Data Transfer Service for Google Ads

B.Cloud Scheduler to call Google Ads API and load to BigQuery

C.Pub/Sub with a Google Ads subscriber

D.Storage Transfer Service for Google Ads

AnswerA

This service is specifically designed to import data from Google Ads into BigQuery on a scheduled basis.

Why this answer

The BigQuery Data Transfer Service for Google Ads is the correct choice because it provides a fully managed, scheduled connector that automatically ingests Google Ads performance data into BigQuery on a daily basis without requiring any custom code. It handles authentication, schema mapping, and incremental loads, making it the simplest and most reliable solution for this specific use case.

Exam trap

Cisco often tests the distinction between fully managed services (like BigQuery Data Transfer Service) and generic infrastructure components (like Cloud Scheduler or Pub/Sub) that require custom development, leading candidates to overcomplicate the solution by choosing a more flexible but less appropriate option.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler is a cron job service that can trigger HTTP requests, but it does not natively integrate with the Google Ads API or handle the complex authentication, pagination, and schema mapping required to load data into BigQuery; you would still need to build and maintain a custom application. Option C is wrong because Pub/Sub is a messaging service for asynchronous event streaming, not a batch ingestion tool; while you could theoretically publish Google Ads data to Pub/Sub, there is no native Google Ads subscriber, and you would need to build a custom subscriber to write to BigQuery, which is far more complex than using the dedicated transfer service. Option D is wrong because Storage Transfer Service is designed for moving data from on-premises or cloud storage (like S3 or HTTP endpoints) into Google Cloud Storage, not for directly ingesting data from Google Ads into BigQuery.

Full explanation →

589

Multi-Selecthard

A company is designing a data lake on Cloud Storage for analytics. They need to store data in various formats (Avro, Parquet, CSV) and enable efficient querying with BigQuery and Dataproc. Which THREE practices should they follow?

Select 3 answers

A.Use BigLake to create BigQuery tables that reference Cloud Storage data.

B.Store data in columnar formats like Parquet for analytics workloads.

C.Disable encryption on the bucket to improve read performance.

D.Partition data by date in a logical folder structure (e.g., /data/yyyy/mm/dd).

E.Store all data in CSV format for simplicity.

AnswersA, B, D

Enables querying data without loading.

Why this answer

BigLake allows you to create BigQuery tables that reference data stored in Cloud Storage, enabling unified governance and fine-grained access control without moving data. This is essential for a data lake architecture where BigQuery and Dataproc need to query the same underlying data in various formats like Avro, Parquet, and CSV.

Exam trap

Google Cloud often tests the misconception that disabling encryption improves performance, but Cloud Storage encryption is transparent and has no measurable impact on read throughput, so candidates should recognize that security controls are non-negotiable in cloud data lakes.

Full explanation →

590

MCQeasy

A company uses Cloud Functions to process events from Cloud Storage. They notice that occasionally functions are not triggered. What should they check first to ensure solution quality?

A.Verify that the Cloud Storage bucket has notifications configured for the correct event type.

B.Check the logs for function execution.

C.Increase the function memory allocation.

D.Increase the function timeout.

AnswerA

A misconfigured notification will prevent the function from being triggered at all.

Why this answer

The most common reason for Cloud Functions not being triggered by Cloud Storage events is that the bucket's notification configuration is missing or misconfigured. Cloud Functions relies on Pub/Sub notifications from the bucket to invoke the function; if the notification is not set for the correct event type (e.g., `OBJECT_FINALIZE`), the function will never be triggered. Therefore, verifying the notification configuration is the first and most direct diagnostic step.

Exam trap

Cisco often tests the misconception that performance tuning (memory or timeout) is the first step to fix trigger issues, when the root cause is almost always a missing or misconfigured event notification.

How to eliminate wrong answers

Option B is wrong because checking logs for function execution assumes the function was invoked, but if the trigger is not firing, there will be no execution logs to review. Option C is wrong because increasing memory allocation addresses performance issues like out-of-memory errors, not trigger failures. Option D is wrong because increasing the timeout addresses function execution duration limits, not the absence of invocation.

Full explanation →

591

MCQeasy

An organization wants to automate their batch data processing pipeline using Cloud Composer. The pipeline consists of multiple tasks: extract from Cloud Storage, transform with Dataflow, and load into BigQuery. Which Airflow operator should be used to run Dataflow jobs?

A.BigQueryInsertJobOperator

B.DataflowCreatePythonJobOperator

C.GCSToBigQueryOperator

D.DataprocSubmitJobOperator

AnswerB

This operator submits a Dataflow job written in Python.

Why this answer

B is correct because the DataflowCreatePythonJobOperator is specifically designed to submit and manage Apache Beam pipelines written in Python as Dataflow jobs in Google Cloud. This operator handles the creation of a Dataflow job from a Python file, which aligns with the requirement to run Dataflow transformations within a Cloud Composer DAG.

Exam trap

Google Cloud often tests the distinction between Dataflow and Dataproc operators, so the trap here is that candidates might confuse DataprocSubmitJobOperator (for Hadoop/Spark) with Dataflow operators, especially when the question mentions 'transform' without specifying the processing framework.

How to eliminate wrong answers

Option A is wrong because BigQueryInsertJobOperator is used to run BigQuery jobs (e.g., queries, load jobs), not to submit Dataflow pipelines. Option C is wrong because GCSToBigQueryOperator loads data directly from Cloud Storage to BigQuery without using Dataflow for transformation, bypassing the required transform step. Option D is wrong because DataprocSubmitJobOperator submits jobs to Dataproc (Hadoop/Spark clusters), not to Dataflow, which is a different processing service.

Full explanation →

592

MCQeasy

A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?

A.Cloud Pub/Sub + Cloud Data Fusion

B.Cloud Pub/Sub + Cloud Dataproc

C.Cloud Pub/Sub + Cloud Dataflow

D.Cloud Pub/Sub + Cloud Dataprep

AnswerC

Cloud Pub/Sub for ingestion with at-least-once delivery, combined with Cloud Dataflow which provides exactly-once processing via its streaming engine.

Why this answer

Dataflow supports exactly-once processing via its streaming engine and checkpointing. Pub/Sub is the ingest service. Together they provide the required semantics.

Full explanation →

593

Multi-Selectmedium

A data scientist wants to use Vertex AI Workbench for exploratory data analysis. Which TWO statements are true about Vertex AI Workbench?

Select 2 answers

A.It is a serverless service that scales to zero when not in use.

B.It supports custom container images for the notebook environment.

C.It can only be used with TensorFlow.

D.It provides a managed JupyterLab environment with pre-installed ML libraries.

E.It includes a built-in SQL query editor for BigQuery.

AnswersB, D

Correct: supports custom containers.

Why this answer

Vertex AI Workbench provides managed Jupyter notebooks with pre-installed ML frameworks. It integrates with BigQuery via the Python client. It does not provide a SQL editor; that's BigQuery.

It supports custom containers. It is not serverless; it runs on Compute Engine VMs.

Full explanation →

594

MCQmedium

A data team wants to load millions of small JSON files (each <1 MB) from GCS into BigQuery daily with the lowest cost and fastest performance. They need exactly-once semantics and the ability to detect new files automatically. Which approach is most suitable?

A.Use Dataflow to read files from GCS, combine them, and write to BigQuery using the Storage Write API in exactly-once mode

B.Use Cloud Functions to trigger on new files and stream each row via the legacy streaming inserts API

C.Use Storage Transfer Service to copy files to a staging bucket, then run a scheduled query to load them

D.Use BigQuery batch load jobs with a wildcard URI to load all files directly

AnswerA

Dataflow can efficiently combine small files and write to BigQuery with exactly-once semantics using the Storage Write API.

Why this answer

BigQuery batch loads from GCS with wildcard URIs and the 'auto detect' option can handle many small files efficiently. However, for many small files, loading them directly may be slow. A better approach is to combine files first.

Dataflow with file processing can combine and load with exactly-once. Storage Write API is for streaming, not batch. Transfer Service is for copying data, not loading into BigQuery.

Full explanation →

595

MCQhard

A company runs a data pipeline that ingests clickstream events from multiple websites into Cloud Pub/Sub, then processed by Dataflow to generate user sessions, and written to BigQuery for analytics. The pipeline runs 24/7. Recently, the team noticed that some sessions are incomplete due to missing events, and data quality checks reveal that about 2% of sessions have gaps of more than 30 minutes. The pipeline uses fixed 30-minute windows for sessionization, with allowed lateness set to 10 minutes. They have Cloud Monitoring dashboards tracking system throughput and pipeline lag but do not have custom metrics tracking per-element delays or watermark progress. The team suspects two possible causes: (a) the Pub/Sub subscription accumulates backlog and some messages are delivered after the window end; (b) the Dataflow job has insufficient workers causing checkpoint failures. The team needs to determine the root cause and improve data quality. What is the best first course of action?

A.Change the Pub/Sub subscription to pull mode with more aggressive flow control settings.

B.Increase the number of Dataflow workers and set autoscaling to the maximum allowed.

C.Modify the Dataflow pipeline to use session windows instead of fixed windows, and increase allowed lateness to 60 minutes.

D.Set up a Dataflow monitoring dashboard that tracks the watermark delay and create an alert when it exceeds the allowed lateness.

AnswerD

This directly monitors the pipeline's ability to process events within the window, confirming if late data is the root cause.

Why this answer

To determine whether late-arriving messages are the issue, the team should monitor the Dataflow watermark delay, which indicates how far behind the pipeline is compared to the event time. Setting up a metric and alert on watermark delay > allowed lateness will confirm if late data is being dropped.

Full explanation →

596

MCQmedium

A BigQuery table contains streaming data from Cloud Pub/Sub. The table is partitioned by ingestion time. A user runs a query that accesses data from the last 5 minutes and gets correct results. After 90 minutes, the user runs the same query again but notices that some rows are missing. What is the most likely cause?

A.The query is using time travel to a snapshot before the streaming buffer was committed

B.The query is using cached results that exclude recent data

C.The schema of the table was modified after the initial query

D.The table has a partition expiration of 30 days

AnswerA

Time travel queries return data from a snapshot; if the snapshot is before the buffer is flushed, recent data is missing.

Why this answer

Option A is correct because BigQuery's streaming buffer provides low-latency access to recently ingested data, but this data is not immediately committed to managed storage. After the streaming buffer is flushed (typically within 90 minutes), the data becomes available in the table's base storage. If the user runs a query using time travel (e.g., `FOR SYSTEM_TIME AS OF`) to a snapshot taken before the buffer was committed, the query will only see data that was in managed storage at that snapshot time, missing rows that were still in the streaming buffer at that point.

Exam trap

Google Cloud often tests the misconception that cached results or schema changes are responsible for data inconsistencies, when the real issue is the separation between BigQuery's streaming buffer and managed storage, and how time travel queries only see committed data.

How to eliminate wrong answers

Option B is wrong because BigQuery caches query results only for identical queries within a 24-hour period, but the user ran the same query after 90 minutes; if cached results were used, they would include the same rows as the initial query, not missing rows. Option C is wrong because schema modifications do not cause rows to disappear from query results; they may affect column access or data types but do not remove existing rows. Option D is wrong because a partition expiration of 30 days would only remove partitions older than 30 days, not affect data from the last 5 minutes or 90 minutes.

Full explanation →

597

MCQhard

A team is using BigQuery to analyze petabyte-scale data. They notice that queries are slow and expensive due to full table scans. They have already partitioned by date. What additional optimization should they implement?

A.Use materialized views

B.Cluster by frequently filtered columns

C.Convert to native tables

D.Use query caching

AnswerB

Clustering reduces bytes read when filtering on those columns.

Why this answer

Clustering by frequently filtered columns (option B) organizes data within each partition based on the sort order of those columns. This allows BigQuery to prune blocks during query execution, significantly reducing the amount of data scanned and improving both performance and cost. Since the table is already partitioned by date, clustering adds a secondary ordering that targets the most common filter predicates, avoiding full table scans within each partition.

Exam trap

Google Cloud often tests the distinction between partitioning and clustering, where candidates mistakenly believe partitioning alone is sufficient for all filtering scenarios, but clustering is required to avoid full scans on non-date columns.

How to eliminate wrong answers

Option A is wrong because materialized views precompute and store query results, which can speed up repeated aggregations but do not reduce the scan cost of ad-hoc filters on raw data; they are not a substitute for physical data organization like clustering. Option C is wrong because BigQuery tables are already native (managed) tables; converting to native tables is not a valid operation and does not address scan efficiency. Option D is wrong because query caching only returns results for identical queries run within 24 hours, but it does not reduce the scan cost or improve performance for new or slightly different queries that still trigger full table scans.

Full explanation →

598

MCQhard

Your Vertex AI model deployed on an endpoint is experiencing high tail latency during online predictions. The model uses a large embedding layer, and the input size varies. You have enabled automatic scaling with a minimum of 2 replicas and maximum of 10. What is the most likely cause of the latency spikes and the best first step to diagnose?

A.The model's SavedModel is too large due to the embedding layer; reduce embedding dimensions to lower latency.

B.The endpoint's target CPU utilization might be set too low, causing rapid scale-down and cold starts. Check Cloud Logging for scaling events.

C.The model uses a custom prediction routine that is not optimized; use tf.function to improve performance.

D.Enable model monitoring for online prediction and add a buffer to the endpoint's machine type.

AnswerB

If target utilization is low, replicas scale down quickly; cold starts on new requests cause latency. Logs show scaling.

Why this answer

High tail latency with variable input sizes and a large embedding layer often points to cold starts from aggressive scaling. When the target CPU utilization is set too low, the endpoint scales down quickly during lulls, and a subsequent burst of requests forces new replicas to spin up, causing latency spikes. Checking Cloud Logging for scaling events is the best first step because it directly reveals whether the endpoint is scaling down and then experiencing cold starts.

Exam trap

Google Cloud often tests the misconception that high tail latency is always due to model size or inference optimization, when in fact the most common cause in managed serving environments is autoscaling misconfiguration leading to cold starts.

How to eliminate wrong answers

Option A is wrong because reducing embedding dimensions would lower model accuracy and does not address the root cause of latency spikes from scaling dynamics; the model size is not the primary driver of tail latency in this scenario. Option C is wrong because while a custom prediction routine could be suboptimal, the question describes a standard model with a large embedding layer and variable input size, and the latency pattern (spikes) is more characteristic of cold starts than of per-request optimization issues; tf.function would help steady-state performance but not sudden spikes. Option D is wrong because model monitoring detects drift or anomalies but does not diagnose scaling-related latency, and adding a buffer to the machine type (e.g., increasing memory) does not fix the scaling policy that causes cold starts.

Full explanation →

599

MCQmedium

A data engineer notices that BigQuery queries are slower than expected. They want to identify the most expensive stages in the query execution. Which tool or command should they use?

A.Use bq show to view job statistics

B.Use bq query --format=prettyjson and look at statistics

C.Use Cloud Monitoring to view query execution graphs

D.Use EXPLAIN statement in BigQuery

AnswerD

EXPLAIN shows query plan, stage cost, and steps.

Why this answer

EXPLAIN provides query plan details including costs per stage. It can be run via bq or in the console.

Full explanation →

600

Multi-Selecteasy

Which TWO approaches are recommended for handling late-arriving data in a streaming Dataflow pipeline?

Select 2 answers

A.Use side inputs to provide default values for late data.

B.Use fixed windows with a duration of 1 second to minimize lateness.

C.Configure allowed lateness on the window to accept late data.

D.Set the trigger to fire only at the end of the window.

E.Use a filter transform to drop late-arriving elements.

AnswersA, C

Side inputs can supply missing data.

Why this answer

Option A is correct because side inputs in Apache Beam (the programming model underlying Dataflow) allow you to provide default values or supplementary data to handle late-arriving elements gracefully. When a late element arrives after the window has been emitted, a side input can supply a fallback value, ensuring the pipeline can still process the data without discarding it. This approach is recommended for handling late data in streaming pipelines where completeness is not critical.

Exam trap

Google Cloud often tests the misconception that simply using small windows or dropping late data is a valid handling strategy, when in fact the recommended approaches involve configuring allowed lateness and using side inputs for graceful fallback.

Full explanation →

Page 8 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →