Google Professional Data Engineer PDE Questions 676–750 | Page 10/14

676

MCQhard

A data scientist is training a binary classification model on an imbalanced dataset (95% negative, 5% positive) using AutoML Tables. Which strategy should they use to handle the class imbalance?

A.Set the budget to a higher value to allow more training on minority class.

B.Use SMOTE in a Dataflow pipeline before importing the data to AutoML Tables.

C.Specify a weight column with higher weights for positive examples in the dataset.

D.Create duplicate copies of the positive class rows to balance the dataset.

AnswerC

AutoML Tables supports a weight column to give more importance to minority class.

Why this answer

AutoML Tables automatically handles class imbalance by applying class weights and downsampling. Users can also specify a weight column explicitly.

Full explanation →

677

Multi-Selectmedium

Which TWO security best practices should be applied to secure data in transit for a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery? (Choose 2)

Select 2 answers

A.Use Cloud Key Management Service (Cloud KMS) to encrypt data in transit

B.Enable TLS encryption on all endpoints

C.Use VPC Service Controls to create a service perimeter

D.Use Cloud Armor to protect against DDoS

E.Use private IP addresses for Dataflow workers

AnswersB, C

TLS ensures data encryption between Google Cloud services, which is already enabled by default but should be verified.

Why this answer

Option B is correct because TLS (Transport Layer Security) encryption ensures that data is encrypted during transmission between endpoints, such as between Cloud Pub/Sub and Dataflow workers, and between Dataflow workers and BigQuery. This is a fundamental security best practice for protecting data in transit against eavesdropping and man-in-the-middle attacks.

Exam trap

The trap here is that candidates often confuse encryption at rest (Cloud KMS) with encryption in transit, or assume that using private IPs alone secures data in transit without needing TLS.

Full explanation →

678

MCQeasy

A team has multiple versions of a model and wants to manage them centrally, including tracking metadata and promoting versions to production. Which tool should they use?

A.Cloud Storage

B.BigQuery

C.GitHub

D.Vertex AI Model Registry

AnswerD

Centralized model versioning and metadata.

Why this answer

Vertex AI Model Registry is the correct tool because it is purpose-built for centrally managing multiple model versions, tracking metadata (such as training parameters, evaluation metrics, and lineage), and promoting versions through stages like staging to production. Unlike generic storage or version control systems, it provides native integration with Vertex AI Pipelines and endpoints for controlled rollout and rollback.

Exam trap

Cisco often tests the misconception that a general-purpose version control system like GitHub is sufficient for ML model management, but the exam expects candidates to recognize that model registries provide specialized metadata tracking and lifecycle promotion features absent in code-only repositories.

How to eliminate wrong answers

Option A is wrong because Cloud Storage is an object storage service for raw data and artifacts, not a model management system; it lacks built-in version tracking, metadata indexing, and promotion workflows for ML models. Option B is wrong because BigQuery is a serverless data warehouse for analytics and SQL queries, not designed to store or manage ML model versions or their lifecycle. Option C is wrong because GitHub is a code repository for version control of source code and configuration files, but it does not natively handle ML model artifacts, track model-specific metadata (e.g., evaluation metrics, training hyperparameters), or provide staging-to-production promotion workflows without extensive custom tooling.

Full explanation →

679

MCQhard

A data pipeline uses Cloud Pub/Sub to ingest events, then a Dataflow job writes to Cloud Storage in Avro format. The Dataflow job uses Global windows with a 10-minute trigger. The data is later loaded into BigQuery. They notice duplicate rows in BigQuery because the trigger produced multiple panes. What should the Dataflow pipeline change to eliminate duplicates?

A.Enable exactly-once sink to BigQuery via Dataflow

B.Use a sharded output to Cloud Storage with unique filenames

C.Write to a staging table and use a MERGE statement in BigQuery

D.Use a session window instead of global window

AnswerA

Dataflow's exactly-once sink to BigQuery uses record IDs to deduplicate, preventing duplicates caused by trigger panes.

Why this answer

Option A is correct because enabling exactly-once sinks in Dataflow ensures that each record is written to the sink only once, even if the pipeline produces multiple panes due to triggers. In this scenario, the 10-minute trigger on a global window causes multiple output panes, leading to duplicate rows in BigQuery. Exactly-once sinks use idempotent writes and deduplication mechanisms to prevent duplicates, directly addressing the issue without changing the windowing or trigger logic.

Exam trap

Google Cloud often tests the misconception that changing windowing or output file naming can solve duplicate data issues, when the real solution is to enable exactly-once processing guarantees at the sink level.

How to eliminate wrong answers

Option B is wrong because sharded output with unique filenames only prevents file-level collisions in Cloud Storage, but does not eliminate duplicate rows within the Avro files; duplicates from multiple panes still exist. Option C is wrong because writing to a staging table and using a MERGE statement is a workaround that does not fix the root cause in the Dataflow pipeline; it adds complexity and latency, and is not a Dataflow-native solution. Option D is wrong because session windows group events based on activity gaps, not time intervals; they do not prevent duplicate panes from triggers and are inappropriate for a global-windowed pipeline that needs to deduplicate across all data.

Full explanation →

680

MCQeasy

A data engineer needs to inspect a BigQuery table for sensitive data such as credit card numbers and email addresses before sharing it with a third party. The engineer also wants to de-identify the data by masking the sensitive columns. Which Google Cloud service should be used?

A.Dataplex

B.BigQuery column-level security

C.Data Catalog

D.Cloud DLP

AnswerD

Cloud DLP can inspect BigQuery tables for sensitive info and apply de-identification transformations like masking, tokenization, etc.

Why this answer

Cloud DLP (Data Loss Prevention) is the correct service because it is specifically designed to inspect, classify, and de-identify sensitive data such as credit card numbers and email addresses. It provides built-in infoType detectors for over 150 types of sensitive data and supports masking, tokenization, and other de-identification techniques. The engineer can use Cloud DLP to scan BigQuery tables and then apply transformations to mask the sensitive columns before sharing.

Exam trap

The trap here is that candidates confuse BigQuery column-level security (access control) with data masking, but BigQuery column-level security only hides data from unauthorized users and does not inspect or transform the data itself, whereas Cloud DLP performs actual de-identification.

How to eliminate wrong answers

Option A is wrong because Dataplex is a data fabric service for managing, governing, and cataloging data across lakes, warehouses, and marts, but it does not natively inspect or de-identify sensitive data; it relies on integration with Cloud DLP for such tasks. Option B is wrong because BigQuery column-level security (using policy tags) only controls access at the column level by granting or denying read permissions, but it does not inspect content for sensitive data or perform masking/de-identification. Option C is wrong because Data Catalog is a metadata management service for discovering and tagging data assets, but it cannot scan for sensitive data patterns or apply de-identification transformations.

Full explanation →

681

Multi-Selectmedium

A data engineer needs to schedule a recurring batch load of CSV files from an on-premises SFTP server into BigQuery. The files are generated daily and need to be loaded into a partitioned table by date. Which THREE steps should the engineer take? (Choose THREE.)

Select 3 answers

A.Create a Cloud Function triggered by Cloud Scheduler to load files directly from SFTP to BigQuery

B.Use Storage Transfer Service to copy files from the SFTP server to Cloud Storage every day

C.Use BigQuery Data Transfer Service with SFTP as a source

D.Set up a scheduled BigQuery load job using the Cloud Console or `bq` command to load from Cloud Storage

E.Configure the load job to write to a specific partition using `--time_partitioning_field` or `--range_partitioning`

AnswersB, D, E

Storage Transfer Service supports scheduled transfers from SFTP to Cloud Storage.

Why this answer

Option B is correct because the Storage Transfer Service is designed to move data from on-premises sources (including SFTP servers) into Cloud Storage on a scheduled basis. This is the recommended first step for ingesting files from an external SFTP server into Google Cloud, as it handles the network transfer, retries, and scheduling natively without requiring custom code.

Exam trap

Cisco often tests the misconception that BigQuery Data Transfer Service can directly ingest from SFTP, but it only supports a limited set of SaaS and cloud sources, not on-premises SFTP servers.

Full explanation →

682

Multi-Selecteasy

Which TWO actions can help reduce prediction latency for a Vertex AI endpoint?

Select 2 answers

A.Increase the number of features

B.Optimize the model architecture to reduce size

C.Use a custom prediction container with optimized dependencies

D.Use a larger machine type with more vCPUs

E.Set min replicas to 0 to save cost

AnswersB, C

Smaller models predict faster.

Why this answer

Optimizing the model architecture to reduce size directly decreases the computational load during inference, which lowers prediction latency. Smaller models require fewer floating-point operations (FLOPs) per prediction, enabling faster response times on Vertex AI endpoints.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (larger machine types) always reduces latency, when in fact it can increase overhead and does not address the root cause of slow inference, which is model complexity.

Full explanation →

683

MCQmedium

Your team uses Cloud Composer to run Apache Airflow DAGs. One DAG uses a BigQueryInsertJobOperator to run a query and then uses BigQueryCheckOperator to verify the results. The DAG is failing intermittently because the query result is not ready when the check operator runs. How should you modify the DAG to ensure the check operator runs only after the query completes successfully?

A.Add a BigQueryTableExistenceSensor before the BigQueryCheckOperator to wait for the table to be created.

B.Change the BigQueryInsertJobOperator to use deferrable mode to make it async.

C.Use a PythonOperator to call the BigQuery API directly and wait for the job to finish.

D.Increase the timeout of the BigQueryCheckOperator.

AnswerA

Sensors wait for a condition; this ensures the table exists before the check runs.

Why this answer

The BigQueryInsertJobOperator is synchronous by default, so the DAG should already wait for completion. However, if the check operator fails due to data not being written, adding a sensor operator (e.g., BigQueryTableExistenceSensor) between the two tasks ensures the table exists before checking.

Full explanation →

684

MCQmedium

A company uses Cloud Storage as a data lake with raw, curated, and processed zones. Data in the raw zone should be automatically moved to a cheaper storage class after 30 days, and deleted after 1 year. What is the most efficient way to implement this?

A.Use Object Lifecycle Management with rules to transition to Coldline after 30 days and delete after 365 days.

B.Write a Cloud Function that runs daily, checks object ages, and moves/deletes them.

C.Use Cloud Scheduler to run a script that changes storage class and deletes objects.

D.Set a retention policy on the raw zone to prevent deletion and manually clean up.

AnswerA

Correct: Lifecycle rules automate this efficiently.

Why this answer

Object Lifecycle Management in Cloud Storage allows you to set rules based on object age. You can transition objects to a lower-cost storage class (e.g., Nearline or Coldline) after 30 days and delete after 365 days.

Full explanation →

685

MCQeasy

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A.Dataproc Serverless with PySpark

B.Dataflow with batch mode

C.Cloud Data Fusion

D.BigQuery Data Transfer Service

AnswerA

Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.

Why this answer

Dataproc Serverless with PySpark is the most cost-effective choice because it eliminates cluster management overhead and automatically scales resources based on workload, charging only for the processing time used. For 10 GB CSV files processed daily within a 24-hour window, the serverless model avoids the fixed costs of a persistent cluster, making it ideal for batch, non-time-sensitive jobs. PySpark's native support for CSV parsing and BigQuery integration via the Spark BigQuery connector ensures efficient data loading without additional services.

Exam trap

The trap here is that candidates often choose Dataflow (Option B) because it is a popular batch processing service, but they overlook that Dataproc Serverless is more cost-effective for non-time-sensitive, large CSV batch jobs due to its serverless pricing model and native Spark support for CSV processing.

How to eliminate wrong answers

Option B is wrong because Dataflow with batch mode, while capable, uses a streaming-optimized runner that incurs higher per-job overhead and cost for simple batch CSV processing, especially when the data is not time-sensitive and can tolerate longer processing windows. Option C is wrong because Cloud Data Fusion is a visual ETL tool designed for complex data pipelines and integration scenarios, not for cost-effective batch processing of large CSV files; it adds unnecessary abstraction and cost for a straightforward load operation. Option D is wrong because BigQuery Data Transfer Service is designed for scheduled imports from SaaS applications (e.g., Google Ads, YouTube) or Cloud Storage only when using a predefined schema and format (e.g., Avro, Parquet), and it does not support direct CSV loading with custom transformations or PySpark logic, making it unsuitable for processing raw CSV files before loading.

Full explanation →

686

Multi-Selectmedium

A company deploys an ML model using Vertex AI Pipelines. They want to ensure reproducibility and traceability. Which TWO practices should they implement?

Select 2 answers

A.Pin all dependency versions

B.Record dataset version using Vertex AI Dataset

C.Use custom containers for every step

D.Store pipeline run metadata in Vertex AI Experiments

E.Use Kubeflow Pipelines instead

AnswersA, D

Pinning versions ensures consistent environments across runs.

Why this answer

Pinning all dependency versions (Option A) ensures that every pipeline run uses the exact same library versions, eliminating variability from package updates. This is a fundamental practice for reproducibility because even a minor version bump can change model behavior or break code. In Vertex AI Pipelines, dependencies are typically specified in a `requirements.txt` or `Dockerfile`, and pinning them (e.g., `tensorflow==2.12.0`) guarantees consistent execution environments across runs.

Exam trap

Google Cloud often tests the misconception that dataset versioning (Option B) is a core requirement for reproducibility in Vertex AI Pipelines, but the exam emphasizes that dependency pinning and experiment metadata storage are the two primary practices for ensuring reproducibility and traceability in ML pipelines.

Full explanation →

687

MCQhard

A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?

A.Older partitions are stored on slower storage tiers

B.The older partitions lack clustering metadata because clustering was enabled after data was ingested

C.The table has too many partitions, causing high metadata overhead

D.Queries are using different SQL syntax for older partitions

AnswerB

Clustering is applied only to data ingested after it is enabled. Older partitions remain unclustered, slowing queries.

Why this answer

Clustering can improve query performance by sorting data within partitions. If older partitions were created before clustering was enabled, they are not clustered, leading to slower scans. Re-clustering only applies to new data unless you manually rewrite older partitions.

Full explanation →

688

MCQmedium

A data engineer is building a data lake on Google Cloud and needs to separate raw ingested data, curated/cleaned data, and processed/aggregated data. Which Cloud Storage bucket structure is recommended?

A.Create three separate folders in a single bucket: raw, curated, processed.

B.Store all data in one bucket and use object labels to distinguish raw, curated, and processed.

C.Store raw data in a different project for security isolation.

D.Use different storage classes for raw, curated, and processed data within the same bucket.

AnswerA

Using prefixes (folders) within a bucket is a standard pattern for organizing data lake zones, allowing different lifecycle rules per prefix.

Why this answer

A common best practice for data lakes on GCS is to use separate buckets or folders within a bucket (e.g., raw, curated, processed) to manage different stages of data refinement and apply appropriate lifecycle policies.

Full explanation →

689

MCQeasy

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

A.roles/bigquery.admin

B.roles/bigquery.user

C.roles/bigquery.jobUser

D.roles/bigquery.dataEditor

AnswerD

Includes bigquery.tables.get.

Why this answer

The error indicates the service account lacks the `bigquery.tables.get` permission, which is required to read table metadata. `roles/bigquery.dataEditor` includes this permission along with `bigquery.tables.get`, `bigquery.tables.update`, and `bigquery.tables.export`, making it the minimal role that resolves the access denied error for a Dataflow job reading from a BigQuery table.

Exam trap

Google Cloud often tests the misconception that `roles/bigquery.user` or `roles/bigquery.jobUser` provide sufficient read access for Dataflow jobs, when in fact they lack the specific `bigquery.tables.get` permission needed for table metadata retrieval.

How to eliminate wrong answers

Option A is wrong because `roles/bigquery.admin` grants full control over BigQuery resources, including dataset deletion and IAM policy management, which is excessive and violates the principle of least privilege for a Dataflow job that only needs to read table data. Option B is wrong because `roles/bigquery.user` provides `bigquery.datasets.get` and `bigquery.jobs.create` but does not include `bigquery.tables.get`, so it would not resolve the specific permission error. Option C is wrong because `roles/bigquery.jobUser` only allows creating and managing jobs (e.g., queries) but does not grant any direct table read permissions like `bigquery.tables.get`.

Full explanation →

690

MCQmedium

A company is building a real-time streaming pipeline to ingest clickstream events from web servers, enrich them with user profile data from Cloud Bigtable, and aggregate metrics into BigQuery. The expected throughput is 10,000 events per second with occasional spikes up to 50,000. The data must be processed with low latency (seconds) and exactly-once semantics. Which Google Cloud service should be the core processing engine?

A.Cloud Dataflow (Apache Beam runner)

B.Cloud Pub/Sub with Cloud Functions

C.Cloud Dataproc with Apache Spark Streaming

D.Cloud Data Fusion

AnswerA

Dataflow provides auto-scaling, exactly-once semantics, low latency, and native integration with BigQuery and Bigtable.

Why this answer

Cloud Dataflow, as a managed Apache Beam runner, is the correct choice because it provides exactly-once processing semantics, low-latency streaming (sub-second to seconds), and autoscaling to handle throughput spikes from 10,000 to 50,000 events per second. Its unified batch and streaming model allows you to enrich clickstream events with user profile data from Cloud Bigtable via side inputs or asynchronous lookups, and write aggregated metrics to BigQuery with exactly-once guarantees using the Beam BigQuery I/O connector.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub with Cloud Functions is sufficient for low-latency streaming, but candidates overlook that Cloud Functions lacks stateful processing and exactly-once semantics, making it unsuitable for aggregation and enrichment at high throughput.

How to eliminate wrong answers

Option B (Cloud Pub/Sub with Cloud Functions) is wrong because Cloud Functions has a maximum timeout of 9 minutes and does not support exactly-once processing semantics; it is at-least-once by default and lacks checkpointing for stateful operations like aggregation. Option C (Cloud Dataproc with Apache Spark Streaming) is wrong because Spark Streaming's micro-batch architecture introduces a minimum latency of several seconds (typically 5-10 seconds), which does not meet the 'seconds' low-latency requirement, and managing exactly-once semantics requires additional configuration (e.g., Kafka offsets) that is not natively handled by the managed service. Option D (Cloud Data Fusion) is wrong because it is a visual ETL tool designed for batch-oriented data integration and does not support real-time streaming ingestion or exactly-once processing; its pipelines are not suitable for sub-second latency or high-throughput event streams.

Full explanation →

691

MCQhard

A company uses Cloud Dataproc to run Spark jobs on ephemeral clusters. The input data is in Cloud Storage and output is also to Cloud Storage. The cluster is created and deleted daily. The cost is high due to spinning up nodes. Which change can reduce cost without sacrificing performance?

A.Use standard VMs with a larger number of smaller machines

B.Use Cloud Dataflow instead

C.Use a combination of standard and preemptible VMs for worker nodes

D.Use preemptible VMs for all nodes

AnswerC

Preemptible VMs for workers reduce cost significantly; standard VMs for the master and a few worker nodes ensure reliability.

Why this answer

Option C is correct because using a combination of standard and preemptible VMs for worker nodes reduces cost significantly while maintaining performance. Preemptible VMs are up to 80% cheaper than standard VMs, and since Spark is fault-tolerant and can handle node preemptions via speculative execution, the job can complete without performance degradation. Standard VMs for master nodes ensure cluster stability, while preemptible workers handle the bulk of data processing.

Exam trap

Google Cloud often tests the misconception that preemptible VMs can be used for all nodes, but the trap here is that the master node must be a standard VM to avoid cluster instability, while workers can safely use preemptible VMs due to Spark's fault tolerance.

How to eliminate wrong answers

Option A is wrong because using a larger number of smaller machines increases overhead from inter-node communication and task scheduling, potentially degrading performance and not necessarily reducing cost. Option B is wrong because Cloud Dataflow is a different service for batch and stream processing, not a direct replacement for Spark on Dataproc; migrating would require rewriting jobs and may not preserve existing Spark-specific logic or performance characteristics. Option D is wrong because using preemptible VMs for all nodes, including the master node, risks cluster failure if the master is preempted, as Dataproc does not automatically recover the master; this sacrifices reliability and can cause job failures.

Full explanation →

692

MCQhard

A team needs to run hybrid transactional/analytical workloads on PostgreSQL-compatible data with low latency. They require high performance on both OLTP and OLAP queries, leveraging a columnar engine. Which Google Cloud service is best suited?

A.AlloyDB

B.Cloud SQL for PostgreSQL

C.Cloud Spanner

D.BigQuery

AnswerA

AlloyDB combines PostgreSQL compatibility with a columnar engine for fast analytics.

Why this answer

AlloyDB is the correct choice because it is a fully managed PostgreSQL-compatible database service on Google Cloud that combines a columnar engine for fast analytical queries with high transactional performance. It uses a columnar query accelerator to offload analytical workloads from the transactional engine, enabling low-latency hybrid transactional/analytical processing (HTAP) without data movement.

Exam trap

The trap here is that candidates may confuse BigQuery's columnar storage with PostgreSQL compatibility, or assume Cloud SQL's PostgreSQL support is sufficient for HTAP workloads, overlooking the need for a dedicated columnar engine.

How to eliminate wrong answers

Option B (Cloud SQL for PostgreSQL) is wrong because it lacks a columnar engine and is optimized primarily for OLTP workloads, so analytical queries would suffer from high latency and poor performance. Option C (Cloud Spanner) is wrong because it is a globally distributed, strongly consistent relational database designed for horizontal scalability and high availability, not for columnar analytics or PostgreSQL compatibility. Option D (BigQuery) is wrong because it is a serverless data warehouse with a columnar storage engine but is not PostgreSQL-compatible and is designed for OLAP, not low-latency OLTP transactions.

Full explanation →

693

MCQmedium

A company uses Looker Studio to create dashboards from BigQuery data. They notice that dashboard queries take several seconds to load. They want to improve performance without changing the underlying data or creating materialized views. Which option should they use?

A.Enable BigQuery BI Engine for the project

B.Increase the number of BigQuery slots

C.Switch to Looker instead of Looker Studio

D.Replicate the data to Cloud SQL for faster queries

AnswerA

BI Engine accelerates BI queries by caching data in memory, reducing latency.

Why this answer

BigQuery BI Engine is an in-memory analysis service that accelerates queries from Looker Studio (and other BI tools) by caching data in memory, significantly reducing latency. Replicating data to Cloud SQL would add complexity and may not handle the volume. Using Looker instead of Looker Studio doesn't inherently speed up queries.

Increasing BigQuery slots would help but is more expensive and not as targeted for BI tools.

Full explanation →

694

MCQhard

A healthcare company uses Vertex AI to deploy a medical image classification model. The model is deployed on a private endpoint with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The model uses a custom container with a GPU for inference. Recently, during peak business hours (9 AM - 5 PM), users report that prediction requests frequently time out after 60 seconds, and the error rate increases. The team checks Cloud Monitoring and observes that CPU utilization averages 40%, GPU utilization averages 30%, and the number of replicas stays at 2. There are no errors in the container logs. The model serves a few hundred requests per second during peak. The team suspects the issue is not resource saturation but something else. What should they do to resolve the problem?

A.Switch from online prediction to batch prediction using Vertex AI Batch Prediction.

B.Increase the minReplicaCount to 5 to ensure more replicas are always available.

C.Increase the request timeout setting on the load balancer to 120 seconds.

D.Optimize the prediction container to handle requests faster by reducing image pre-processing and using async I/O.

AnswerD

Improving request handling efficiency directly addresses the timeout. Likely the container is blocking on I/O or serialization.

Why this answer

Option D is correct because the symptoms—low CPU/GPU utilization, replicas stuck at 2, and timeouts—indicate that the container is taking too long to process each request, not that resources are saturated. Optimizing the container (e.g., reducing image pre-processing, using async I/O) reduces per-request latency, allowing the model to handle the same request rate within the 60-second timeout. This directly addresses the root cause without changing scaling or timeout settings.

Exam trap

The trap here is that candidates assume low resource utilization means the system is under-provisioned (leading them to increase replicas or timeout), when in fact the bottleneck is per-request latency within the container, which autoscaling cannot fix.

How to eliminate wrong answers

Option A is wrong because switching to batch prediction would not solve real-time inference timeouts; batch prediction is for offline, non-latency-sensitive workloads and would break the real-time requirement. Option B is wrong because increasing minReplicaCount to 5 does not address the fact that existing replicas are underutilized (30-40% CPU/GPU) and requests are timing out due to slow processing, not lack of replicas. Option C is wrong because increasing the load balancer timeout to 120 seconds would only mask the symptom; the container still cannot process requests fast enough, and the underlying latency issue would persist, potentially causing cascading failures.

Full explanation →

695

MCQeasy

A startup is deploying a machine learning model for real-time fraud detection. They need low latency and automatic scaling during peak hours. Which Google Cloud service should they use?

A.Cloud Functions

B.Batch Prediction on Vertex AI

C.Cloud AI Platform Prediction with custom containers

D.Vertex AI Endpoints

AnswerD

Vertex AI Endpoints provides managed online prediction with automatic scaling and low latency.

Why this answer

Vertex AI Endpoints provide managed, autoscaling infrastructure designed for low-latency online predictions, making them ideal for real-time fraud detection. They automatically scale the number of compute nodes based on incoming traffic, ensuring peak-hour demand is met without manual intervention.

Exam trap

The trap here is that candidates confuse Cloud Functions or Batch Prediction for real-time serving, overlooking that Vertex AI Endpoints are the only option purpose-built for low-latency, autoscaling online predictions in the modern Vertex AI ecosystem.

How to eliminate wrong answers

Option A is wrong because Cloud Functions is a serverless compute service for event-driven, short-lived tasks, not designed for sustained, low-latency model serving with autoscaling for prediction traffic. Option B is wrong because Batch Prediction on Vertex AI is intended for asynchronous, offline predictions on large datasets, not for real-time, low-latency inference. Option C is wrong because Cloud AI Platform Prediction with custom containers is a legacy service that lacks the integrated autoscaling and endpoint management capabilities of Vertex AI Endpoints, which is the modern, recommended service for online predictions.

Full explanation →

696

MCQeasy

A company wants to use BigQuery to query data stored in Cloud Storage as Parquet files without loading the data into BigQuery storage. Which feature should they use?

A.BigQuery ingestion from Cloud Storage using load jobs

B.Cloud Storage FUSE to mount bucket and query

C.BigQuery federated queries with Cloud Storage

D.BigQuery external tables

AnswerD

External tables enable querying data directly in Cloud Storage without loading.

Why this answer

Option D is correct because BigQuery external tables allow querying data stored in Cloud Storage (including Parquet files) directly without loading it into BigQuery storage. This is achieved by defining a table schema that references the external data source, enabling BigQuery to read the Parquet files on-the-fly using its federated query engine.

Exam trap

The trap here is that candidates confuse 'federated queries' (which typically query external databases like Cloud SQL or Bigtable via BigQuery Omni) with the ability to query Cloud Storage files, which is specifically implemented through external tables.

How to eliminate wrong answers

Option A is wrong because BigQuery ingestion using load jobs imports data into BigQuery's internal storage, which contradicts the requirement to query data without loading it. Option B is wrong because Cloud Storage FUSE mounts a bucket as a filesystem, but BigQuery cannot directly query files via FUSE; it requires a different integration mechanism. Option C is wrong because BigQuery federated queries with Cloud Storage is not a distinct feature; the correct term is 'external tables' or 'federated data sources', and 'federated queries' typically refers to querying external databases like Cloud SQL, not Cloud Storage files.

Full explanation →

697

MCQhard

Your company uses Vertex AI Pipelines to automate the ML lifecycle. The pipeline includes training, evaluation, and deployment steps. You want to ensure that if a pipeline run fails due to a transient error (e.g., resource quota shortage), it automatically retries before marking the run as failed. What is the best way to implement this?

A.Configure Vertex AI Pipelines to automatically restart failed runs.

B.In the pipeline component code, implement retry logic using exponential backoff for specific exceptions.

C.Set a high timeout value for the pipeline so that transient errors resolve before timeout.

D.Use Cloud Tasks to schedule pipeline runs and retry upon failure.

AnswerB

Retrying within the component handles transient failures gracefully without failing the entire pipeline.

Why this answer

Vertex AI Pipelines does not have built-in retry logic for failed steps. You can wrap each step's logic to catch transient errors and retry, or use a retry mechanism in the container itself. Kubeflow Pipelines' retry policy can be specified.

Modifying pipeline code is the most direct way.

Full explanation →

698

MCQmedium

An organization needs to prevent data exfiltration from BigQuery by ensuring all traffic to BigQuery APIs goes through VPC boundaries and is restricted to a specific service perimeter. Which Google Cloud security control should they use?

A.Access Transparency

B.IAM conditions on BigQuery roles

C.Cloud Armor

D.VPC Service Controls

AnswerD

VPC Service Controls define a perimeter that restricts data movement to authorized networks and prevents exfiltration.

Why this answer

VPC Service Controls (D) is the correct answer because it allows you to define a service perimeter around BigQuery APIs, ensuring that all traffic to BigQuery must originate from within the defined VPC boundaries. This prevents data exfiltration by blocking unauthorized access from outside the perimeter, even if valid credentials are used. It works by enforcing context-aware access policies at the Google Cloud network edge, not at the application layer.

Exam trap

The trap here is that candidates confuse IAM conditions (which control who can access data) with VPC Service Controls (which control where data can be accessed from), leading them to pick Option B instead of D.

How to eliminate wrong answers

Option A is wrong because Access Transparency provides logs of Google personnel access to your data, not network-level controls for data exfiltration. Option B is wrong because IAM conditions on BigQuery roles control authorization based on attributes like IP address or time, but they do not restrict traffic to VPC boundaries or create a service perimeter; they are identity-based, not network-based. Option C is wrong because Cloud Armor is a web application firewall (WAF) for HTTP(S) traffic to load balancers, not for BigQuery API traffic, and it cannot enforce VPC boundaries or service perimeters.

Full explanation →

699

MCQeasy

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

A.Migrate the pipeline to Cloud Dataflow with Apache Beam for auto-scaling

B.Read inventory data from BigQuery and pre-join in BigQuery, then export to Cloud Storage as ORC files

C.Write intermediate results to Cloud SQL instead of BigQuery for faster access

D.Increase the number of worker nodes in the Dataproc cluster

AnswerB

Reduces data shuffle in Spark and speeds up processing.

Why this answer

Option B is correct because it offloads the join operation to BigQuery, which is optimized for large-scale analytics and can process the join much faster than Spark. By pre-joining and exporting the result as ORC files (a columnar format optimized for Spark), the pipeline avoids the expensive shuffle and data transfer between Cloud Storage and BigQuery, significantly reducing the overall processing time to meet the 30-minute target.

Exam trap

The trap here is that candidates often assume that simply scaling up the existing infrastructure (more workers or auto-scaling) is the most effective optimization, but Cisco tests the understanding that architectural changes to reduce data movement and leverage service-specific strengths (like BigQuery for joins) are far more impactful than brute-force scaling.

How to eliminate wrong answers

Option A is wrong because migrating to Cloud Dataflow with Apache Beam introduces auto-scaling but does not address the fundamental bottleneck of joining large datasets across Cloud Storage and BigQuery; the join operation would still require significant data movement and processing, likely not achieving the required speedup. Option C is wrong because writing intermediate results to Cloud SQL instead of BigQuery would actually slow down the pipeline, as Cloud SQL is a transactional database not designed for high-throughput batch writes, and it would introduce additional latency and potential contention. Option D is wrong because simply increasing the number of worker nodes in the Dataproc cluster may improve parallelism but does not eliminate the costly shuffle and data transfer inherent in the join between Cloud Storage and BigQuery; it would also increase costs without guaranteeing the 6x performance improvement needed.

Full explanation →

700

MCQmedium

A company uses Cloud Pub/Sub for a real-time data pipeline. The subscription has a backlog of millions of messages that are not being processed quickly enough. In Cloud Monitoring, you observe that the 'subscription/num_undelivered_messages' metric is high and growing, while 'subscription/oldest_unacked_message_age' is also increasing. Which action is MOST likely to reduce the backlog?

A.Delete the subscription and recreate it with a larger message retention duration.

B.Reduce the acknowledgment deadline to force faster processing.

C.Change the subscription type from push to pull.

D.Increase the number of subscribers or the throughput capacity of the existing subscribers.

AnswerD

Adding more subscribers (e.g., scaling out Dataflow workers) increases the rate of message processing, reducing the backlog.

Why this answer

Option D is correct because the backlog indicates that subscribers cannot keep up with the message flow. Increasing the number of subscribers or scaling their throughput capacity directly addresses the processing bottleneck, allowing messages to be pulled and acknowledged faster. Cloud Pub/Sub scales horizontally, so adding more pull subscribers or increasing the resources of existing ones (e.g., more worker threads, higher CPU/memory) reduces the backlog.

Exam trap

Cisco often tests the misconception that reducing the acknowledgment deadline or changing subscription type will speed up processing, when in reality these actions can increase redeliveries or do not address the root cause of insufficient subscriber capacity.

How to eliminate wrong answers

Option A is wrong because deleting and recreating the subscription with a larger message retention duration does not increase processing speed; it only keeps messages longer, which does not reduce the existing backlog. Option B is wrong because reducing the acknowledgment deadline forces subscribers to acknowledge messages faster, but if they cannot process them in time, it leads to more redeliveries and can worsen the backlog. Option C is wrong because changing from push to pull does not inherently increase throughput; both modes can be scaled, and the bottleneck is subscriber capacity, not the delivery mechanism.

Full explanation →

701

MCQeasy

A company wants to share a large BigQuery dataset with a partner for analysis. The partner needs read-only access to a specific snapshot of the data as of a certain point in time, and the company wants to avoid additional storage costs for the partner. What is the most cost-effective approach?

A.Create a BigQuery table snapshot at the desired point in time and share it.

B.Create a BigQuery table clone at the desired point in time and share it with the partner.

C.Export the table to Cloud Storage as Avro and share a signed URL.

D.Grant the partner access to the original table with an authorized view.

AnswerB

A table clone is a zero-copy, read-only snapshot that does not incur additional storage costs (until data changes). It provides point-in-time consistency.

Why this answer

BigQuery table clones are zero-copy clones that share the underlying storage with the base table. They do not incur additional storage costs until the data in the clone is modified. Snapshots incur storage costs for the snapshot.

Authorized views or datasets require the partner to query the base table, which may incur analysis costs but no extra storage; however, the partner may see changes to the base table. The question specifies a point-in-time snapshot, so a clone is best.

Full explanation →

702

MCQeasy

Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?

A.Cloud Dataproc with Apache Spark Streaming

B.Cloud Data Fusion with batch pipelines

C.Cloud Dataflow with Apache Beam

D.Cloud Pub/Sub Lite with push subscriptions

AnswerC

Dataflow is fully managed, supports streaming with exactly-once semantics, and integrates well with Pub/Sub and BigQuery for real-time dashboards.

Why this answer

Dataflow with Apache Beam is the only service among the options that natively supports exactly-once processing and can handle both streaming and batch pipelines with minimal operational overhead. Dataflow is fully managed, handles autoscaling, and provides exactly-once guarantees for streaming data.

Full explanation →

703

MCQmedium

After migrating a production Cloud SQL for PostgreSQL database to a larger machine type, the team notices slower queries. What is the best step to identify the cause?

A.Reindex all tables to improve index efficiency.

B.Enable query caching through the database flags.

C.Enable pg_stat_statements and review query execution times.

D.Increase max_connections to handle more concurrent queries.

AnswerC

This extension captures per-query statistics, allowing identification of regressed queries.

Why this answer

Option C is correct because pg_stat_statements is a PostgreSQL extension that provides detailed query execution statistics, including total execution time, number of calls, and I/O metrics. After migrating to a larger machine type, slower queries often stem from plan changes due to different hardware characteristics or configuration settings; reviewing pg_stat_statements output helps pinpoint which queries are underperforming and why.

Exam trap

Google Cloud often tests the misconception that performance issues after a migration are always due to indexing or connection limits, when in fact the most effective first step is to gather query-level metrics using built-in tools like pg_stat_statements.

How to eliminate wrong answers

Option A is wrong because reindexing all tables is a maintenance task that can improve index bloat but does not address the root cause of slower queries after a migration; it is a reactive measure without diagnostic value. Option B is wrong because Cloud SQL for PostgreSQL does not support a generic 'query caching' database flag; PostgreSQL relies on shared buffers and the buffer cache, and enabling any such flag would not provide diagnostic insight into query performance. Option D is wrong because increasing max_connections can actually degrade performance by increasing context switching and memory contention; it does not help identify why queries are slower and may worsen the issue.

Full explanation →

704

Multi-Selectmedium

Which THREE features of Cloud Pub/Sub guarantee at-least-once delivery and enable exactly-once processing downstream? (Choose three.)

Select 3 answers

A.Subscriber-retry policy with exponential backoff.

B.Exactly-once delivery source feature (enabled by default in current gcloud).

C.Message ordering by message key.

D.Cloud Dataproc integration for message replay.

E.Acknowledgment deadlines and message persistence.

AnswersA, B, E

Retries ensure messages are eventually delivered on failure.

Why this answer

Option A is correct because a subscriber-retry policy with exponential backoff ensures that messages that fail to be processed are retried with increasing delays, preventing transient failures from causing message loss. This mechanism, combined with Pub/Sub's persistent storage, guarantees that each message is delivered at least once, as the subscriber will keep retrying until it acknowledges the message.

Exam trap

Google Cloud often tests the misconception that message ordering or replay features contribute to delivery guarantees, when in fact ordering is about sequence and replay is not a native Pub/Sub capability; the key trap is confusing 'exactly-once delivery' (which Pub/Sub does not offer) with 'exactly-once processing' (which requires subscriber-side idempotency).

Full explanation →

705

MCQeasy

An e-commerce company processes real-time clickstream data using Pub/Sub and Dataflow. They want to ensure that if a Dataflow worker fails, the pipeline can resume processing from the point of failure without data loss. Which feature should they enable?

A.At-least-once delivery mode

B.Exactly-once processing mode

C.Snapshot-based recovery

D.Streaming engine

AnswerC

Allows periodic saving of pipeline state and resumption from saved snapshots.

Why this answer

Snapshot-based recovery (Option C) is the correct feature because Dataflow snapshots capture the entire pipeline state, including the current position in each Pub/Sub subscription and the state of all transforms. If a worker fails, the pipeline can be resumed from the exact snapshot point, ensuring no data loss and exactly-once processing semantics for the recovered data.

Exam trap

Google Cloud often tests the misconception that exactly-once processing alone guarantees failure recovery, but it only prevents duplicates during normal operation, not resumption after a worker crash.

How to eliminate wrong answers

Option A is wrong because at-least-once delivery mode ensures messages are delivered at least once but does not provide a mechanism to resume from a specific point of failure; it may cause duplicate processing but not lossless recovery. Option B is wrong because exactly-once processing mode is a processing guarantee that prevents duplicates but does not inherently provide a recovery mechanism to resume from a failure point; it relies on other features like snapshots for stateful resumption. Option D is wrong because Streaming Engine is a Dataflow feature that moves state and shuffle data to a backend service to reduce worker resource usage, but it does not directly provide a point-of-failure recovery mechanism; snapshots are required for that.

Full explanation →

706

MCQeasy

A startup is using Cloud Build to automate the training and deployment of their machine learning models. The workflow is defined in cloudbuild.yaml and includes steps to: 1) Run a training job on AI Platform Training, 2) Build a custom prediction container, 3) Deploy the container to Cloud Run for serving. The deployment step fails intermittently with the error: 'Cloud Run service already exists and is not owned by the calling user.' You need to fix this so that deployments are reliable. What should you do?

A.Ensure the Cloud Build service account has the 'run.services.update' permission on the Cloud Run service.

B.Delete the existing Cloud Run service manually before each build.

C.Use 'gcloud run deploy --replace' in the build step to force replace the existing service.

D.Use Cloud Run for Anthos instead of fully managed Cloud Run to avoid ownership issues.

AnswerA

The error suggests a permissions issue; granting the correct role to the Cloud Build service account resolves it.

Why this answer

The error indicates that the Cloud Run service already exists and the Cloud Build service account does not own it. The Cloud Build service account needs the 'run.services.update' IAM permission on the specific Cloud Run service to modify it during deployment. Granting this permission allows the service account to update the existing service reliably, resolving the intermittent failure.

Exam trap

Cisco often tests the misconception that a command-line flag like '--replace' can bypass IAM permission errors, but the root cause is always IAM misconfiguration, not a missing flag.

How to eliminate wrong answers

Option B is wrong because manually deleting the service before each build is not a scalable or automated solution; it introduces manual steps and potential downtime, and does not address the underlying permission issue. Option C is wrong because 'gcloud run deploy --replace' does not exist; the correct flag is '--no-traffic' or '--async', and the error is about ownership, not about forcing replacement. Option D is wrong because switching to Cloud Run for Anthos does not resolve the ownership issue; the same IAM permission model applies, and the error would persist if the service account lacks the necessary permissions.

Full explanation →

707

MCQeasy

Your team uses Cloud Dataproc to run a Spark ML training job. The job is failing with an error: 'Container killed by YARN for exceeding memory limits.' What should you do to fix this?

A.Increase the spark.executor.memory property

B.Use preemptible VMs for faster execution

C.Increase the number of worker nodes

D.Enable the external shuffle service

AnswerA

This directly addresses the memory limit for each executor.

Why this answer

The error 'Container killed by YARN for exceeding memory limits' indicates that the Spark executor process is using more memory than the YARN container allows. Increasing `spark.executor.memory` allocates a larger YARN container for each executor, providing the necessary headroom for the Spark application's memory demands, including overhead for off-heap memory and JVM internals.

Exam trap

The trap here is that candidates often confuse scaling horizontally (adding nodes) with scaling vertically (increasing per-node resources), and assume more nodes will fix memory limits when the issue is per-container allocation.

How to eliminate wrong answers

Option B is wrong because preemptible VMs are cheaper but can be terminated at any time, which does not address memory limits and can actually cause more failures due to preemption. Option C is wrong because increasing the number of worker nodes adds more executors but does not increase the memory per executor; the existing executors will still exceed their container limits. Option D is wrong because the external shuffle service helps with shuffle data persistence and reduces executor memory pressure during shuffle operations, but it does not increase the per-executor memory allocation; the root cause is insufficient container memory, not shuffle management.

Full explanation →

708

MCQhard

A data science team uses AI Platform Training with hyperparameter tuning. They observe that some trials fail due to transient errors. To improve solution quality and reduce costs, what should they do?

A.Enable early stopping using a Bayesian optimization algorithm.

B.Set the maxFailedTrials parameter to a high value (e.g., 10).

C.Use larger machine types for each trial.

D.Increase the number of parallel trials.

AnswerB

This allows the tuning job to tolerate transient failures and continue searching without aborting, improving completion rate and model quality.

Why this answer

Option B is correct because setting maxFailedTrials to a high value (e.g., 10) allows the hyperparameter tuning job to continue even when some trials fail due to transient errors. This improves solution quality by ensuring that enough successful trials complete to explore the search space, and it reduces costs by avoiding premature job termination that would waste resources on restarts.

Exam trap

Cisco often tests the misconception that early stopping (Option A) is the best way to handle transient errors, but early stopping is for performance-based pruning, not for fault tolerance; the trap is confusing optimization strategies with error-handling parameters.

How to eliminate wrong answers

Option A is wrong because early stopping using Bayesian optimization is designed to stop poorly performing trials early to save resources, but it does not address transient errors that cause trials to fail; in fact, early stopping could prematurely terminate trials that might have succeeded after a transient error. Option C is wrong because using larger machine types for each trial increases cost per trial and does not prevent transient errors from causing failures; it is a brute-force approach that does not improve reliability. Option D is wrong because increasing the number of parallel trials can increase the chance of transient errors occurring simultaneously and may lead to more failed trials, increasing costs without addressing the root cause of transient failures.

Full explanation →

709

Multi-Selecteasy

You want to query data across Google Cloud and AWS using a single SQL interface without moving data. Which TWO services can you use?

Select 2 answers

A.BigQuery Data Transfer Service

B.Cloud Spanner

C.BigQuery Omni

D.Cloud Data Fusion

E.BigQuery cross-cloud query with Omni

AnswersC, E

BigQuery Omni enables multi-cloud queries.

Why this answer

BigQuery Omni allows querying data in AWS (and Azure) using BigQuery SQL. BigQuery Omni runs on multi-cloud. BigQuery itself is GCP-only.

Cloud Spanner and Data Fusion are not for multi-cloud SQL queries across clouds.

Full explanation →

710

Multi-Selectmedium

A data engineer needs to set up a Dataplex data quality scan to run weekly on a BigQuery table. The scan should check that: (1) the 'email' column is not null, (2) the 'age' column is between 0 and 120, and (3) the 'country_code' column matches a list of valid ISO codes. Which TWO Dataplex features should the engineer use?

Select 3 answers

A.Schedule the data quality scan using Dataplex scheduling options

B.Use Cloud DLP to inspect the table for data quality issues

C.Create a column rule for the 'age' column using SQL condition 'age BETWEEN 0 AND 120'

D.Create a Dataplex asset and add a tag template to enforce constraints

E.Create a row rule for the 'email' column using SQL condition 'email IS NOT NULL'

AnswersA, C, E

Dataplex allows scheduling data quality tasks (e.g., weekly) directly from the UI or API.

Why this answer

Dataplex data quality tasks use SQL-based rules. Row rules validate conditions per row (e.g., IS NOT NULL, BETWEEN). Column rules validate column-level conditions (e.g., in set).

Schedule is set via the scheduling feature. The other options are not Dataplex data quality features: DLP is for sensitive data, Data Catalog is for metadata.

Full explanation →

711

MCQmedium

A media company streams real-time viewer data from Pub/Sub to BigQuery using a Dataflow pipeline. They need to handle occasional malformed messages without losing valid data. Which pattern should they implement?

A.Raise an exception in the pipeline and stop processing

B.Use retry logic in the pipeline to reprocess malformed messages indefinitely

C.Implement a dead letter sink to store malformed messages for later analysis

D.Discard malformed messages and log an error

AnswerC

Dead letter sinks store problematic records without blocking the pipeline, enabling later inspection and reprocessing.

Why this answer

Option C is correct because a dead letter sink (e.g., a separate Pub/Sub topic or a BigQuery error table) allows the Dataflow pipeline to route malformed messages out of the main processing path while continuing to process valid data. This pattern ensures no valid data is lost and provides a durable location for later analysis or reprocessing of the malformed records, which is essential for streaming pipelines where data quality issues are intermittent.

Exam trap

Cisco often tests the dead letter pattern to see if candidates understand that streaming pipelines must handle bad data gracefully without stopping or losing valid records, and the trap is that many candidates choose retry logic (Option B) because they confuse transient errors with permanent data quality issues.

How to eliminate wrong answers

Option A is wrong because raising an exception and stopping the pipeline would cause all processing to halt, leading to data loss for valid messages and violating the requirement to handle malformed messages without losing valid data. Option B is wrong because retrying malformed messages indefinitely would cause the pipeline to stall on bad records, potentially blocking the processing of subsequent valid messages and increasing latency; Dataflow's retry mechanisms are intended for transient errors, not for permanently malformed data. Option D is wrong because discarding malformed messages and logging an error results in permanent data loss, which contradicts the requirement to preserve data for later analysis and violates best practices for data integrity in streaming pipelines.

Full explanation →

712

MCQeasy

Your team needs to store time-series data from millions of IoT devices. Each device sends a reading every 5 minutes, and the total data volume is about 2 TB per month. The most common query pattern is retrieving all readings for a specific device over a time range (e.g., last 24 hours). Which storage service should you choose?

A.Cloud Storage (objects per device per time interval)

B.BigQuery

C.Cloud Bigtable

D.Cloud Spanner

AnswerC

Bigtable is ideal for time-series data with high write throughput and row-key-based range scans for device/time.

Why this answer

Cloud Bigtable is a fully managed, scalable NoSQL database designed for high-throughput, low-latency time-series data. It supports single-row key lookups and range scans, making it ideal for retrieving all readings for a specific device over a time range (e.g., last 24 hours) from millions of IoT devices generating 2 TB/month. Its row key design (e.g., device_id + timestamp) enables efficient time-range queries without full table scans, unlike object storage or analytical warehouses.

Exam trap

Google Cloud often tests the misconception that BigQuery is suitable for operational, low-latency time-series queries, but the trap here is that BigQuery is an analytical warehouse optimized for large-scale batch queries, not for repeated, sub-second per-device range scans, which is a classic NoSQL (Bigtable) workload.

How to eliminate wrong answers

Option A is wrong because Cloud Storage (object storage) is optimized for immutable blob storage and lacks native indexing for time-range queries; retrieving all readings for a device over a time range would require listing and filtering millions of objects, which is slow and costly. Option B is wrong because BigQuery is a serverless data warehouse designed for analytical SQL queries on large datasets, not for real-time, high-throughput point lookups or range scans with sub-millisecond latency; it would incur high query costs and latency for repeated per-device time-range retrievals. Option D is wrong because Cloud Spanner is a globally distributed relational database with strong consistency and ACID transactions, which is overkill for time-series IoT data and would be prohibitively expensive and slower for high-volume, simple key-value range scans compared to Bigtable.

Full explanation →

713

MCQmedium

A user named Charlie needs to deploy a model to a Vertex AI Endpoint and also create training jobs. Which role should be assigned to Charlie?

A.roles/aiplatform.user

B.roles/owner

C.roles/aiplatform.modelUser

D.roles/editor

AnswerA

aiplatform.user allows creating models, deploying endpoints, and running training jobs.

Why this answer

Charlie needs to deploy a model to a Vertex AI Endpoint and create training jobs. The `roles/aiplatform.user` role grants the necessary permissions to use all Vertex AI resources, including creating and managing endpoints, training jobs, models, and predictions. This role is the minimum required for a user to interact with Vertex AI services without granting broader project-level permissions.

Exam trap

The trap here is that candidates often confuse `roles/aiplatform.user` with `roles/aiplatform.modelUser`, mistakenly thinking the latter is sufficient for creating training jobs, when in fact it only allows prediction on existing models.

How to eliminate wrong answers

Option B is wrong because `roles/owner` grants full project-level access, including the ability to delete resources and manage IAM policies, which is excessive and violates the principle of least privilege. Option C is wrong because `roles/aiplatform.modelUser` only allows a user to deploy and predict from an existing model, but does not include permissions to create training jobs or manage endpoints. Option D is wrong because `roles/editor` grants broad project-level edit permissions across all Google Cloud services, not just Vertex AI, and is too permissive for the specific task of deploying models and creating training jobs.

Full explanation →

714

MCQmedium

You are designing a Dataflow pipeline that reads from Pub/Sub, performs transformations, and writes to BigQuery. The pipeline must handle schema changes in the incoming data (e.g., new fields appearing). The BigQuery schema should evolve automatically to accept new fields without failing. Which approach should you use?

A.Use Dataprep to clean and standardise the data before loading into BigQuery.

B.Predefine the BigQuery table schema with all possible fields and use UPDATE to add missing fields.

C.Use a JavaScript UDF to parse incoming data and map to a fixed schema, ignoring new fields.

D.Set the table schema to allow unknown fields and use BigQuery's schema auto-update feature in the pipeline.

AnswerD

BigQuery can automatically add nullable columns when new fields appear if configured correctly.

Why this answer

Using BigQuery's schema auto-detection combined with specifying the write disposition as WRITE_APPEND and allowing unknown fields can handle schema drift. However, the most robust approach is to use a flexible schema in the pipeline and set BigQuery's schema update options to allow automatic addition of nullable fields.

Full explanation →

715

MCQeasy

A company runs a nightly Dataproc batch job to process large log files. The job is idempotent and can tolerate node failures if restarted. Minimizing cost is critical. What is the most cost-effective cluster design?

A.Use preemptible instances for all nodes and enable automatic restart

B.Use standard instances with autoscaling based on YARN memory

C.Use all preemptible instances and configure the cluster to delete after the job completes

D.Use a single-node cluster with a high-memory machine type

AnswerA

Preemptible instances are 60-80% cheaper, and automatic restart allows the job to continue after a preemption.

Why this answer

Preemptible instances cost about 80% less than standard instances, making them the most cost-effective choice for fault-tolerant, idempotent batch jobs. Enabling automatic restart ensures that if a preemptible instance is terminated (which can happen at any time), Dataproc will automatically recreate it, maintaining cluster capacity without manual intervention. This design minimizes cost while preserving the job's ability to complete despite node failures.

Exam trap

Google Cloud often tests the misconception that deleting the cluster after the job completes is the primary cost-saving measure, but the trap here is that without automatic restart, preemptible instances alone can cause job failure due to node preemption, negating cost benefits.

How to eliminate wrong answers

Option B is wrong because standard instances are significantly more expensive than preemptible instances, and autoscaling based on YARN memory does not reduce cost as effectively as using preemptible instances for a fault-tolerant batch job. Option C is wrong because configuring the cluster to delete after the job completes is a good practice for cost savings, but using all preemptible instances without enabling automatic restart risks job failure if preemptible instances are reclaimed, as the cluster may lose nodes and become unable to complete the job. Option D is wrong because a single-node cluster with a high-memory machine type is not cost-effective for processing large log files; it lacks fault tolerance and scalability, and high-memory instances are expensive compared to using multiple preemptible instances.

Full explanation →

716

MCQmedium

A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?

A.Reduce the minimum number of nodes to zero to allow scale-to-zero during low traffic.

B.Place a Google Cloud Load Balancer in front of the Vertex AI endpoint to distribute requests across multiple endpoints.

C.Configure horizontal autoscaling with a higher maximum number of nodes and set a CPU utilization target.

D.Implement A/B testing by splitting traffic between two model versions to distribute load.

AnswerC

Autoscaling allows the endpoint to add nodes during high traffic, reducing latency and preventing timeouts.

Why this answer

Option C is correct because configuring horizontal autoscaling with a higher maximum number of nodes and a CPU utilization target allows Vertex AI Prediction to automatically add more nodes during peak traffic, distributing the inference load and reducing latency. This directly addresses the root cause—insufficient compute resources under high demand—without requiring architectural changes or sacrificing availability.

Exam trap

The trap here is that candidates often confuse load balancing (Option B) with autoscaling, thinking that distributing requests across multiple endpoints is the same as adding more compute capacity, but Vertex AI endpoints are single resources that cannot be fronted by a load balancer to increase capacity—they require autoscaling to add nodes.

How to eliminate wrong answers

Option A is wrong because reducing the minimum number of nodes to zero would cause cold starts when traffic arrives, increasing latency rather than reducing it, and scale-to-zero is not suitable for a customer-facing application requiring real-time predictions. Option B is wrong because placing a Google Cloud Load Balancer in front of a single Vertex AI endpoint does not distribute requests across multiple endpoints—it would only add unnecessary network hops and complexity without solving the resource bottleneck. Option D is wrong because A/B testing splits traffic between model versions for evaluation purposes, not for load distribution; it does not increase the total compute capacity available to handle peak traffic.

Full explanation →

717

MCQmedium

A company needs to store petabytes of time-series IoT sensor data and query it with single-digit millisecond latency at millions of reads per second. The data has a simple key-value structure with timestamps. Which Google Cloud database is MOST appropriate?

A.Cloud Bigtable

B.BigQuery

C.Cloud Spanner

D.Firestore

AnswerA

Bigtable is the correct choice: wide-column NoSQL, designed for time-series and IoT workloads, single-digit ms latency, and scales to millions of QPS with additional nodes.

Why this answer

Cloud Bigtable is a fully managed, scalable NoSQL database designed for large analytical and operational workloads, handling petabytes of data with consistent sub-10ms latency at millions of reads per second. Its key-value model with timestamps directly matches the time-series IoT sensor data structure, and it supports high-throughput, low-latency access via the HBase API or Bigtable client libraries.

Exam trap

Cisco often tests the distinction between operational databases (Bigtable) and analytical warehouses (BigQuery), so the trap here is assuming that 'petabytes of data' automatically requires a data warehouse like BigQuery, ignoring the real-time, low-latency key-value access pattern that Bigtable is purpose-built for.

How to eliminate wrong answers

Option B (BigQuery) is wrong because it is a serverless data warehouse optimized for analytical SQL queries on large datasets, not for single-digit millisecond point reads at millions of operations per second; it incurs higher latency (typically hundreds of milliseconds) and is not designed for real-time key-value lookups. Option C (Cloud Spanner) is wrong because it is a globally distributed relational database with strong consistency and SQL support, but its latency and throughput for simple key-value reads are higher than Bigtable, and it is overkill for time-series data that does not require relational joins or transactions. Option D (Firestore) is wrong because it is a mobile and web document database optimized for real-time updates and moderate throughput, not for petabyte-scale time-series data with millions of reads per second; it has throughput limits (e.g., 10,000 writes/second per database) and higher latency for such high-volume workloads.

Full explanation →

718

MCQmedium

Your organization uses Vertex AI Feature Store to serve features for a real-time fraud detection model. The model is deployed on a Vertex AI endpoint. After a data pipeline update, the model's online predictions became inconsistent. What is the most likely cause?

A.The model's prediction server is running out of memory.

B.The feature store's online serving values are not synchronized with the batch feature values used during training.

C.The model was retrained with a different training dataset.

D.The online serving endpoint's model version was accidentally rolled back.

AnswerB

If the pipeline update changed how features are computed or stored, online serving might use out-of-sync values, leading to inconsistent predictions.

Why this answer

In Vertex AI Feature Store, batch feature values used during model training and online serving values are stored separately. If a data pipeline update changes the batch feature values but the online serving values are not updated or synchronized, the model will receive different feature values at inference time than it was trained on, leading to inconsistent predictions. This is the most common cause of prediction drift after a pipeline change.

Exam trap

The trap here is that candidates may confuse a data pipeline update with a model retraining or version rollback, but the key is recognizing that feature store synchronization between batch and online stores is a distinct operational concern that directly causes prediction inconsistency.

How to eliminate wrong answers

Option A is wrong because running out of memory on the prediction server would cause errors or timeouts, not inconsistent predictions; the model would either fail or produce no output, not produce varying results. Option C is wrong because retraining with a different dataset would produce a new model version, but the question states predictions became inconsistent after a data pipeline update, not after a retraining event; a retrained model would be deployed as a new version, not cause inconsistency in the existing model's outputs. Option D is wrong because a rollback of the model version would revert to a previous consistent state, not introduce inconsistency; the predictions would be consistent with the older model version, not inconsistent.

Full explanation →

719

MCQeasy

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

A.Add a secondary worker group using preemptible VMs and increase the number of workers.

B.Enable local SSDs on all worker nodes.

C.Increase the master node's machine type to n1-highmem-32.

D.Use Cloud Composer to schedule the job with a higher priority.

AnswerA

Preemptible VMs are cost-effective and add parallelism.

Why this answer

Adding a secondary worker group with preemptible VMs is the most cost-effective way to improve performance because it allows you to scale out the cluster horizontally with compute instances that are significantly cheaper (up to 80% discount) than regular VMs. This directly addresses the bottleneck of processing capacity during peak hours without requiring any pipeline redesign, as Cloud Dataproc can automatically distribute work across additional workers.

Exam trap

The trap here is that candidates assume scaling up the master node or improving local storage will help, but the exam tests understanding that horizontal scaling with cheap, ephemeral workers is the most cost-effective approach for batch processing workloads that are CPU-bound and fault-tolerant.

How to eliminate wrong answers

Option B is wrong because enabling local SSDs on all worker nodes improves I/O performance for intermediate data, but the pipeline reads from Cloud Storage and writes to BigQuery, which are network-based operations; the bottleneck is CPU/memory for transformation, not local disk speed, making this an expensive upgrade with minimal impact. Option C is wrong because increasing the master node's machine type to n1-highmem-32 only improves the coordination and management of the cluster, not the actual data processing capacity; the master node does not perform data transformation work, so this does not address the performance bottleneck. Option D is wrong because Cloud Composer is a workflow orchestration tool that schedules and monitors jobs, but it does not directly improve the runtime performance of the ETL pipeline; setting a higher priority only affects scheduling order, not execution speed.

Full explanation →

720

MCQeasy

Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?

A.Dataflow

B.Dataproc Serverless

C.Dataprep

D.Cloud Data Fusion

AnswerB

Dataproc Serverless provides a fully managed, serverless Spark runtime. You submit jobs and Google Cloud manages the cluster.

Why this answer

Dataproc Serverless allows you to submit Spark jobs that run on auto-scaled infrastructure without cluster management. Dataflow is for Beam pipelines. Dataprep is for data wrangling.

Data Fusion is for visual ETL.

Full explanation →

721

MCQmedium

You are designing a Dataflow pipeline to process streaming data. The pipeline may encounter malformed records. You need to handle these errors without failing the entire pipeline and store the bad records for later analysis. What is the best practice?

A.Use a dead letter sink to write malformed records to a separate Pub/Sub topic or GCS location.

B.Catch the exception and log it, then continue processing.

C.Write all records to BigQuery using the Storage Write API and handle errors in the write operation.

D.Raise an exception in the DoFn to stop the pipeline for manual intervention.

AnswerA

This is the recommended pattern: isolate bad records for later reprocessing while allowing the pipeline to continue.

Why this answer

Dead letter sinks are a common pattern: route erroneous records to a separate output (e.g., Pub/Sub topic or GCS) for later investigation. Writing to BigQuery using Storage Write API with error handling is good, but for malformed records you want to isolate them. Raising exceptions would fail the pipeline.

Logging only loses the data.

Full explanation →

722

MCQhard

A data engineering team uses Cloud Composer (Airflow) for workflow orchestration. They notice DAG runs frequently fail, and the error indicates insufficient Airflow workers. The team wants to ensure reliable execution. Which approach best addresses the issue?

A.Switch from Cloud Composer to Cloud Scheduler for simpler workloads.

B.Reduce the concurrency of all DAGs to fit within available workers.

C.Use the GKE-based Composer environment, which provides autoscaling of Airflow workers.

D.Increase the parallelism setting in the Airflow configuration.

AnswerC

GKE-based Composer auto-scales worker pods, handling variable loads effectively.

Why this answer

Option C is correct because Cloud Composer environments backed by GKE (Google Kubernetes Engine) can automatically scale the number of Airflow workers based on the workload. This autoscaling capability directly addresses the 'insufficient Airflow workers' error by dynamically adding worker pods when the queue of tasks grows, ensuring reliable execution without manual intervention.

Exam trap

Cisco often tests the distinction between configuration parameters that control task scheduling (like `parallelism` or `concurrency`) and infrastructure-level scaling mechanisms; the trap here is assuming that increasing `parallelism` alone can resolve worker shortages, when in fact it only increases the demand on a fixed pool of workers.

How to eliminate wrong answers

Option A is wrong because Cloud Scheduler is a cron-like service for simple, scheduled jobs and lacks the workflow orchestration, retries, and dependency management that Airflow provides; switching to it would not solve worker scaling issues and would lose critical DAG functionality. Option B is wrong because reducing DAG concurrency only limits the number of tasks that can run simultaneously, which may prevent the error but also reduces throughput and does not address the root cause of needing more workers to handle the actual workload. Option D is wrong because increasing the `parallelism` setting in Airflow configuration tells the scheduler how many tasks can run at once, but if the underlying worker infrastructure (e.g., the number of Celery workers or Kubernetes pods) is fixed and insufficient, tasks will still fail due to resource exhaustion; parallelism alone does not add compute capacity.

Full explanation →

723

Multi-Selecthard

A large enterprise is migrating its data warehouse from Teradata to BigQuery. They need to transfer historical data (100 TB) and set up ongoing daily incremental loads. They also need to transform the data using dbt. Which THREE Google Cloud services should they use?

Select 3 answers

A.Datastream

B.BigQuery Data Transfer Service for Teradata

C.Transfer Appliance

D.dbt (data build tool)

E.Cloud Composer

AnswersB, D, E

Supports both backfill and incremental transfers from Teradata to BigQuery.

Why this answer

BigQuery Data Transfer Service for Teradata handles both historical and incremental loads, dbt runs on BigQuery for transformations, and Cloud Composer can orchestrate the dbt runs on a schedule.

Full explanation →

724

MCQmedium

You need to create a reusable Dataflow pipeline for transforming CSV files in Cloud Storage into Avro files in another bucket. The pipeline should be configurable via runtime parameters (e.g., input and output paths). Which approach should you use?

A.Use Cloud Functions triggered by Cloud Storage events.

B.Create a Dataflow Classic Template with the pipeline code and parameters.

C.Use Cloud Run Jobs to run the transformation as a container.

D.Create a Dataflow Flex Template with a Docker image and parameterized metadata.

AnswerD

Flex Templates allow full customization and runtime parameters, making the pipeline reusable.

Why this answer

Option D is correct because Dataflow Flex Templates allow you to package your pipeline code and dependencies into a Docker image, and define parameterized metadata (e.g., input and output paths) that are exposed as runtime parameters. This approach provides full customization of the execution environment and supports reusable, configurable pipelines for transforming CSV to Avro across different Cloud Storage buckets.

Exam trap

Cisco often tests the distinction between Classic Templates (limited parameterization, no custom Docker) and Flex Templates (full parameterization, custom Docker), and the trap here is that candidates may choose Classic Templates because they are simpler, overlooking the requirement for configurable runtime parameters and reusable custom transformations.

How to eliminate wrong answers

Option A is wrong because Cloud Functions triggered by Cloud Storage events are stateless, have a limited execution timeout (9 minutes for HTTP functions, 10 minutes for background functions), and are not designed for long-running or complex data transformations like converting CSV to Avro; they also lack built-in support for Dataflow's parallel processing and schema handling. Option B is wrong because Dataflow Classic Templates are pre-packaged with fixed pipeline code and limited parameterization (only a few predefined parameters), and they do not support custom Docker images or advanced dependency management, making them less flexible for arbitrary runtime paths. Option C is wrong because Cloud Run Jobs are stateless containers with a maximum timeout of 60 minutes and are not optimized for large-scale data processing; they lack Dataflow's auto-scaling, exactly-once processing, and integration with Avro schema evolution.

Full explanation →

725

MCQmedium

A Dataflow batch job fails consistently with the error shown. The job uses a custom container image and runs in a VPC with a private IP. What should the engineer do to resolve the issue?

A.Request a CPU quota increase in the region.

B.Verify that the VPC has Private Google Access enabled and that Cloud NAT is configured for outbound internet access if needed.

C.Rebuild the custom container image and upload it to Container Registry.

D.Check that the custom image is based on the latest Dataflow SDK version.

AnswerB

In a private VPC, workers need connectivity to Dataflow API and container registry.

Why this answer

The error indicates that the Dataflow batch job cannot access required resources (e.g., container image, dependencies) because the VPC with private IPs lacks outbound internet connectivity. Option B is correct because enabling Private Google Access allows the VMs to reach Google APIs (like Container Registry) via the Google network, and Cloud NAT provides outbound internet access for non-Google APIs or external dependencies. Without these, the job fails to pull the custom container image or download necessary artifacts.

Exam trap

The trap here is that candidates often assume the error is due to the container image or SDK version, overlooking the VPC networking prerequisites (Private Google Access and Cloud NAT) that are required for Dataflow jobs using private IPs.

How to eliminate wrong answers

Option A is wrong because a CPU quota increase would not resolve connectivity issues; the error is about network access, not resource limits. Option C is wrong because rebuilding the container image does not fix the underlying network configuration problem; the image itself is not the cause of the failure. Option D is wrong because the Dataflow SDK version in the custom image is irrelevant to VPC networking; the job fails due to lack of outbound connectivity, not SDK compatibility.

Full explanation →

726

MCQhard

You manage a large-scale machine learning system that recommends products to users. The model is a deep neural network trained on TensorFlow and deployed on Vertex AI Endpoint with global load balancing. The model receives over 10,000 requests per second. Recently, the team added a new feature: the user's current geographic location (latitude/longitude). After deploying the updated model, you notice that the average prediction latency has doubled, and the error rate has increased, particularly for requests from regions far from the model's primary training data (North America). You suspect the location feature is causing issues. What should you do to diagnose and mitigate the problem?

A.Remove the location feature from the model and retrain without it to restore performance.

B.Increase the number of replicas for the endpoint to handle the increased latency.

C.Switch to a regional endpoint in North America to reduce latency for the majority of users.

D.Examine the latency breakdown using Cloud Monitoring to see if the location feature is causing computationally expensive operations, then consider feature engineering like bucketing coordinates.

AnswerD

Understanding the latency source and engineering the feature properly can resolve the issue without sacrificing model accuracy.

Full explanation →

727

MCQeasy

A company needs to process streaming data from IoT devices with sub-second latency and exactly-once processing guarantees. Which Google Cloud service should they use?

A.BigQuery

B.Cloud Dataproc

C.Cloud Dataflow

D.Cloud Pub/Sub

AnswerC

Dataflow supports streaming with auto-scaling and exactly-once processing, meeting the requirements.

Why this answer

Cloud Dataflow is the correct choice because it provides a unified stream and batch processing model with exactly-once processing guarantees and sub-second latency via its Apache Beam SDK. It supports event-time processing, watermarks, and triggers to handle out-of-order data from IoT devices while ensuring each record is processed exactly once, even in the case of failures.

Exam trap

Google Cloud often tests the distinction between data ingestion (Pub/Sub) and data processing (Dataflow), so the trap here is that candidates confuse Pub/Sub's streaming ingestion capability with the processing guarantees needed for exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because BigQuery is a serverless data warehouse designed for analytical queries on large datasets, not for real-time stream processing with sub-second latency and exactly-once guarantees; it can ingest streaming data but does not provide the fine-grained per-record processing semantics required. Option B is wrong because Cloud Dataproc is a managed Hadoop/Spark service that can process streaming data via Spark Streaming, but it does not natively guarantee exactly-once processing out of the box and typically has higher latency due to micro-batching. Option D is wrong because Cloud Pub/Sub is a messaging and ingestion service that provides at-least-once delivery by default and does not perform data processing; it is a transport layer, not a processing engine.

Full explanation →

728

MCQhard

A company is building a continuous training pipeline that retrains a model daily using new data from a feature store. The training data must include features computed up to the timestamp of each training run. Which architecture should be used to ensure time-consistent feature values without label leakage?

A.Train on a fixed window of the most recent features without considering timestamps.

B.Use Vertex AI Feature Store with point-in-time lookup enabled to retrieve features as of the training timestamp.

C.Store all features in a Cloud SQL database and perform a join at training time.

D.Use Pub/Sub to stream new features into Cloud Storage and train on the latest snapshot.

AnswerB

Point-in-time lookups ensure that for each training example, features are retrieved as they existed at the prediction time, preventing leakage.

Why this answer

Option B is correct because Vertex AI Feature Store's point-in-time lookup retrieves the exact feature values as they existed at the specified training timestamp, ensuring time-consistency and preventing label leakage. This mechanism avoids using future data that would not have been available at the time of prediction, which is critical for realistic model evaluation and production performance.

Exam trap

Google Cloud often tests the misconception that simply using the most recent data or a snapshot is sufficient for time-consistency, but the key requirement is to retrieve features as of the exact training timestamp to prevent label leakage, which only point-in-time lookup guarantees.

How to eliminate wrong answers

Option A is wrong because training on a fixed window of the most recent features without considering timestamps can introduce label leakage by including future feature values relative to the label timestamp, and it ignores the temporal ordering required for time-series data. Option C is wrong because storing all features in Cloud SQL and performing a join at training time lacks point-in-time semantics, meaning the join may inadvertently use features from after the label timestamp, causing leakage and inconsistent feature values. Option D is wrong because using Pub/Sub to stream new features into Cloud Storage and training on the latest snapshot does not guarantee that features are retrieved as of the exact training timestamp; the snapshot may include data that arrived after the label was generated, leading to leakage.

Full explanation →

729

MCQmedium

A financial services firm uses Cloud Pub/Sub to ingest real-time market data. The data is processed by a Cloud Dataflow streaming pipeline that aggregates trades per symbol and writes to BigQuery. The pipeline currently uses a single global window with a trigger that fires every minute. The firm now needs to support late data up to 5 minutes and also wants to reduce the number of writes to BigQuery to avoid hitting the table limit of 1,500 inserts per second. The current pipeline writes every minute, which is acceptable for inserts per second, but after adding late data handling, the number of writes doubles. How can you redesign the pipeline to handle late data while keeping write volume low?

A.Use fixed windows of 5 minutes with allowed lateness 5 minutes and trigger every 30 seconds

B.Increase the global window duration to 10 minutes and keep the same trigger

C.Discard all late data and keep the current windowing

D.Use session windows with a gap duration of 5 minutes and a count-based trigger that fires after accumulating 1000 elements

AnswerD

Session windows group events; count-based trigger reduces writes by batching.

Why this answer

Option D is correct because session windows naturally group events into bursts of activity separated by a gap duration (5 minutes), which reduces the number of writes by accumulating many trades per symbol before emitting a pane. Adding a count-based trigger that fires after 1000 elements further limits write frequency, keeping the insert rate well below BigQuery's 1,500 per second limit while still allowing late data up to the gap duration. This design handles late data implicitly within the session gap and avoids the write amplification seen with fixed windows and frequent triggers.

Exam trap

The trap here is that candidates assume fixed windows with allowed lateness are the only way to handle late data, overlooking that session windows naturally accommodate late arrivals while reducing write frequency through event grouping and count-based triggers.

How to eliminate wrong answers

Option A is wrong because fixed windows of 5 minutes with a trigger every 30 seconds would increase the number of writes (12 panes per window per key) rather than reduce them, exacerbating the BigQuery insert rate issue. Option B is wrong because increasing the global window duration to 10 minutes does not change the trigger frequency (still every minute), so the number of writes remains the same and late data handling is not addressed. Option C is wrong because discarding late data violates the requirement to support late data up to 5 minutes and is not a valid redesign for the stated need.

Full explanation →

730

MCQmedium

A data team needs to run complex analytical queries on a dataset that is frequently updated with new rows. They want to minimize query costs and avoid scanning old data that is rarely queried. Which BigQuery feature should they use?

A.Partitioned tables with partition expiration

B.BigQuery materialized views

C.Clustered tables

D.BigQuery BI Engine

AnswerA

Partitioning allows querying only relevant partitions, and partition expiration can automatically delete old partitions.

Why this answer

Partitioned tables with partition expiration allow you to divide a table into segments based on a date/timestamp column, and automatically delete partitions that are older than a specified duration. This minimizes query costs by only scanning relevant partitions and eliminates storage costs for old, rarely queried data without manual intervention.

Exam trap

Cisco often tests the distinction between performance optimization features (clustering, materialized views, BI Engine) and data lifecycle management features (partition expiration), leading candidates to choose a performance feature when the question explicitly asks about minimizing costs and avoiding scanning old data.

How to eliminate wrong answers

Option B is wrong because BigQuery materialized views precompute and cache query results for faster reads, but they do not automatically expire old data or reduce storage costs for rarely queried rows. Option C is wrong because clustered tables sort data within partitions to improve query performance and reduce bytes scanned, but they do not provide automatic deletion of old data or partition expiration. Option D is wrong because BigQuery BI Engine is an in-memory analysis service that accelerates interactive queries but does not manage data lifecycle or expiration of old rows.

Full explanation →

731

MCQmedium

A data engineer needs to store quarterly financial data that must remain immutable for 7 years to meet regulatory compliance. The data is accessed infrequently after the first year. Which Cloud Storage feature should be used to enforce immutability?

A.Object Lifecycle Rules

B.Object Lock with Retention Policy

C.Object Versioning

D.IAM Conditions

AnswerB

Object Lock sets a retention policy on the bucket, preventing objects from being deleted or overwritten for a specified duration (WORM).

Why this answer

Object Lock with a Retention Policy enforces immutability by preventing object deletion or modification for a specified duration. For regulatory compliance requiring 7-year immutable storage, a retention policy configured with a retention period of 7 years ensures that objects cannot be overwritten or deleted, even by the root account. This directly meets the requirement for data that must remain unchanged for the full 7-year period.

Exam trap

Cisco often tests the misconception that Object Versioning alone provides immutability, but versioning only protects against accidental deletion by preserving old versions—it does not prevent intentional deletion or overwriting of the current version, which is required for true immutability.

How to eliminate wrong answers

Option A is wrong because Object Lifecycle Rules manage transitions and deletions based on age or other criteria, but they do not enforce immutability—objects can still be modified or deleted by users with appropriate permissions. Option C is wrong because Object Versioning preserves previous versions of objects but does not prevent deletion or overwriting of the current version; it allows recovery but does not enforce a write-once, read-many (WORM) model. Option D is wrong because IAM Conditions control access based on attributes like IP address or time, but they do not prevent deletion or modification of objects; they only restrict who can perform actions, not enforce data immutability.

Full explanation →

732

MCQhard

You are optimizing a BigQuery query that scans 1 TB of data every day. The query joins a large fact table (partitioned by date) with a small dimension table. You notice that the query always scans the entire fact table, even though you only need the last 7 days of data. Which optimization will MOST reduce the bytes scanned?

A.Create a materialized view that pre-aggregates the data by day.

B.Add a WHERE clause that filters on the date column used for partitioning.

C.Cluster the fact table on the join key used in the query.

D.Change the table to use time-unit column partitioning with a 1-day partition interval.

AnswerB

This enables partition pruning, so only the last 7 days' partitions are scanned, reducing bytes from 1 TB to ~19 GB (1/52).

Why this answer

Using a WHERE clause on the partitioning column (e.g., date) allows BigQuery to perform partition pruning, drastically reducing scanned bytes. Clustering on the join key can improve performance but does not reduce bytes scanned as much as partition pruning. Materialized views precompute aggregations but may not reduce scans if the base query still scans full table.

Changing to a clustered table does not prune partitions automatically.

Full explanation →

733

MCQhard

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

A.Use a Cloud Bigtable table as a side input via a RichSDF.

B.Use a side input from a PCollection and broadcast it.

C.Increase the number of workers to distribute the side input.

D.Increase the worker memory to 16 GB per worker.

AnswerA

Bigtable provides scalable key-value lookups without loading all data into memory.

Why this answer

Option A is correct because using a Cloud Bigtable table as a side input via a RichSDF (Rich Splittable DoFn) allows the pipeline to perform point lookups on the large (10 GB) lookup table without loading it entirely into worker memory. This avoids OOM errors and reduces latency by leveraging Bigtable's low-latency, scalable key-value storage, which is ideal for high-throughput streaming pipelines that require frequent, random access to a large, frequently updated dataset.

Exam trap

The trap here is that candidates often assume increasing resources (memory or workers) is the solution to memory pressure, but the real issue is the architectural pattern of broadcasting a large, frequently updated dataset—requiring a shift to an external, queryable store like Bigtable.

How to eliminate wrong answers

Option B is wrong because broadcasting a 10 GB PCollection as a side input would require every worker to hold the entire lookup table in memory, causing OOM errors and high latency due to serialization and shuffle overhead. Option C is wrong because increasing the number of workers does not reduce the per-worker memory footprint of a broadcast side input; each worker still needs to load the full 10 GB table, so OOM errors persist. Option D is wrong because simply increasing worker memory to 16 GB per worker is a temporary workaround that does not scale—if the lookup table grows or multiple side inputs are used, OOM errors will recur, and it does not address the fundamental issue of loading the entire dataset into memory.

Full explanation →

734

MCQmedium

An e-commerce application uses Cloud SQL (MySQL) for transaction processing. To improve read performance for reporting queries, the team wants to offload read traffic to a separate database instance that stays in sync with the primary. Which Cloud SQL feature should they use?

A.Configure a high availability (HA) replica

B.Use Cloud SQL's failover replica

C.Create a cross-region read replica

D.Enable automatic backups and point-in-time recovery

AnswerC

Read replicas can be in the same or different region and serve read-only traffic.

Why this answer

Option C is correct because Cloud SQL read replicas are designed to offload read traffic from the primary instance while staying in sync using asynchronous replication. Cross-region read replicas specifically allow you to place a replica in a different region, improving read performance for geographically distributed reporting queries without affecting the primary's transaction processing.

Exam trap

The trap here is that candidates confuse high availability (HA) replicas with read replicas, assuming that an HA standby can also serve read traffic, but in Cloud SQL the HA standby is not directly accessible for reads and is only used for failover.

How to eliminate wrong answers

Option A is wrong because a high availability (HA) replica is a synchronous standby that provides automatic failover for high availability, not a separate instance for offloading read traffic; it does not serve read queries independently. Option B is wrong because Cloud SQL does not have a separate 'failover replica' feature; failover is handled by the HA configuration, and the term is often confused with a read replica, but failover replicas are not used for read offloading. Option D is wrong because automatic backups and point-in-time recovery are disaster recovery features that protect data, not mechanisms to offload read traffic or improve query performance.

Full explanation →

735

Multi-Selectmedium

A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?

Select 3 answers

A.Grant authorized view access to limit data visibility

B.Set a custom quota for concurrent queries

C.Partition and cluster tables to reduce bytes processed

D.Create materialized views for all reporting queries

E.Use BigQuery reservations (flex slots) for predictable workloads

AnswersB, C, E

Custom quotas limit the number of concurrent queries or bytes processed, preventing excessive resource usage.

Why this answer

Custom quotas cap query usage. Reservation models (flex slots) provide dedicated resources and predictable pricing. Partitioning and clustering reduce data scanned, lowering cost.

Authorized views control access but not cost. Materialized views can reduce cost but are not a cost control mechanism per se.

Full explanation →

736

MCQmedium

A company is designing a streaming pipeline using Dataflow to process real-time clickstream data. The pipeline reads from Pub/Sub, performs user sessionization using Apache Beam's Session window, and writes to BigQuery. The team notices that the pipeline's lag is growing and the worker utilization is low. What is the most likely cause and recommended fix?

A.Too many workers are created; reduce the number of workers.

B.The pipeline is not using autoscaling; enable autoscaling.

C.Insufficient disk space per worker; increase the boot disk size.

D.The session window gap duration is too large, causing excessive state per key; reduce the gap duration.

AnswerD

Large gap leads to long-lived state, causing lag and low utilization.

Why this answer

D is correct because a large session window gap duration causes Dataflow to maintain excessive state per key (user session), leading to high memory pressure and slow processing. This results in growing pipeline lag despite low worker utilization, as workers spend more time managing state than processing data. Reducing the gap duration limits the state size and improves throughput.

Exam trap

Google Cloud often tests the misconception that low worker utilization means too many workers, but the real cause is often state bloat from session windows, not resource overprovisioning.

How to eliminate wrong answers

Option A is wrong because low worker utilization indicates workers are underutilized, not overprovisioned; reducing workers would worsen lag. Option B is wrong because autoscaling is enabled by default in Dataflow streaming pipelines, and low utilization suggests the issue is not scaling but state management. Option C is wrong because insufficient disk space typically causes worker failures or OOM errors, not low utilization with growing lag; the symptom here points to state size, not disk I/O.

Full explanation →

737

MCQeasy

A company wants to migrate 500 TB of on-premises archival data to Cloud Storage. The data is stored on a SAN and the network link is limited to 1 Gbps. The migration must complete within 10 days. What is the MOST cost-effective approach?

A.Set up a Cloud VPN and use rsync over the encrypted connection.

B.Use BigQuery Data Transfer Service to load the data directly into BigQuery.

C.Order a Transfer Appliance, copy data locally, and ship it to Google for ingestion.

D.Use Storage Transfer Service to copy data from on-premises to GCS over the existing network.

AnswerC

Transfer Appliance is designed for large offline transfers when network speed is a constraint.

Why this answer

Option C is correct because the Transfer Appliance is designed for large-scale data migrations where network bandwidth is insufficient. With 500 TB at 1 Gbps, the theoretical transfer time is over 46 days, far exceeding the 10-day window. The appliance allows you to physically ship the data, bypassing network constraints entirely, making it the most cost-effective and timely solution.

Exam trap

The trap here is that candidates underestimate the time required for network transfer at 1 Gbps and overestimate the practicality of compression or incremental sync, failing to recognize that physical shipping is the only viable option for multi-petabyte data within a tight deadline.

How to eliminate wrong answers

Option A is wrong because rsync over a 1 Gbps Cloud VPN would take approximately 46 days for 500 TB (assuming full utilization, which is unrealistic due to overhead and encryption), far exceeding the 10-day deadline. Option B is wrong because BigQuery Data Transfer Service is for loading data from SaaS applications (e.g., Google Ads, Amazon S3) or other cloud sources into BigQuery, not for ingesting on-premises archival data into Cloud Storage. Option D is wrong because Storage Transfer Service relies on the existing 1 Gbps network link, which would require over 46 days for 500 TB, violating the 10-day requirement.

Full explanation →

738

MCQeasy

Your team is using Cloud Data Fusion to build batch ETL pipelines that load data from Cloud Storage into BigQuery. You have several pipelines that run daily. Recently, one pipeline started failing with a 'Permission denied' error when trying to read a new CSV file uploaded to a specific Cloud Storage bucket. Other pipelines using the same bucket succeed. The failing pipeline has a Cloud Storage source plugin that uses a service account with the roles/storage.objectViewer role. The bucket has uniform bucket-level access enabled. What is likely causing the issue?

A.Create a custom IAM role with storage.buckets.get and storage.objects.get permissions and assign it to the service account.

B.Check that the service account used by the failing pipeline's Data Fusion instance has the correct permissions, and ensure that the service account is the same as the one used by working pipelines.

C.Disable uniform bucket-level access and add bucket ACLs for the service account.

D.Add the service account as a member of the Cloud Storage bucket with the roles/storage.objectViewer role.

AnswerB

The root cause is likely a different service account or misconfiguration in the failing pipeline's Data Fusion instance.

Why this answer

The correct answer is B because the error is likely due to the Data Fusion instance's service account, not the source plugin's service account. In Cloud Data Fusion, the pipeline execution uses the service account attached to the Data Fusion instance itself to access Cloud Storage, even if the source plugin specifies a different service account. Since other pipelines using the same bucket succeed, the issue is that the failing pipeline's Data Fusion instance uses a service account that lacks the roles/storage.objectViewer role on the bucket, while working pipelines use an instance with the correct permissions.

Exam trap

Google Cloud often tests the misconception that the service account specified in a plugin (e.g., Cloud Storage source) is the one used for authentication, when in fact the Data Fusion instance's service account is the effective identity for all pipeline operations.

How to eliminate wrong answers

Option A is wrong because the roles/storage.objectViewer role already includes storage.objects.get permission, and storage.buckets.get is not required for reading objects; adding a custom role is unnecessary and does not address the root cause. Option C is wrong because disabling uniform bucket-level access and using ACLs is an outdated approach that contradicts best practices; the issue is not about access control mode but about which service account is being used. Option D is wrong because the service account used by the source plugin already has roles/storage.objectViewer on the bucket (as stated), but the pipeline fails because the Data Fusion instance's service account, not the plugin's service account, is the one making the request.

Full explanation →

739

MCQmedium

A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?

A.Switch to Cloud Dataproc and use Spark Structured Streaming

B.Increase the number of workers and use at-least-once delivery to BigQuery

C.Enable Dataflow Streaming Engine and use BigQuery exactly-once sink

D.Use a global window and disable triggers

AnswerC

Streaming Engine reduces latency and improves reliability; BigQuery exactly-once sink prevents duplicates.

Why this answer

Enabling streaming engine moves the state management to the backend, reducing latency and improving reliability. Using exactly-once sinks (like BigQuery with exactly-once guarantees) prevents data loss.

Full explanation →

740

MCQhard

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

A.Set the pipeline option --maxNumWorkers to a value between 5 and 10.

B.Increase the window duration to 30 seconds to reduce the number of windows.

C.Redesign the pipeline to use a side input for the dimension table instead of a lookup.

D.Increase the number of Bigtable nodes to reduce lookup latency.

AnswerA

Prevents over-scaling and shuffle overhead.

Why this answer

The increasing system lag and unemitted windows in a streaming pipeline with autoscaling (2–20 workers) and a 3-node Bigtable cluster indicate that the pipeline is bottlenecked by the number of workers, not by Bigtable performance. With default autoscaling, Dataflow may not scale up aggressively enough to handle the sustained load, causing backlog and window expiration. Capping maxNumWorkers to 5–10 ensures sufficient parallelism without over-provisioning, allowing the pipeline to catch up and emit windows reliably.

Exam trap

Google Cloud often tests the misconception that Bigtable or side inputs are the bottleneck when the real issue is insufficient worker parallelism, leading candidates to choose scaling Bigtable or redesigning the join strategy instead of adjusting autoscaling limits.

How to eliminate wrong answers

Option B is wrong because increasing the window duration to 30 seconds would only delay window emission, not resolve the root cause of increasing system lag or data loss; it could even worsen latency by accumulating more data per window. Option C is wrong because using a side input for a slowly-changing dimension table would require periodic re-reading of the entire table, increasing memory pressure and shuffle overhead, and would not fix a worker-scaling bottleneck. Option D is wrong because Bigtable CPU is below 50%, indicating the lookup latency is not the issue; adding nodes would be unnecessary and would not address the pipeline’s inability to keep up with the streaming throughput.

Full explanation →

741

MCQhard

A data engineer is using Spark on Dataproc to process a large dataset. They notice the job is slow due to excessive shuffling. They want to optimize the job by using a more efficient data structure that reduces serialization overhead and provides better memory management. Which Spark API should they use?

A.Spark SQL

B.Spark Streaming

C.RDDs

D.DataFrames or Datasets

AnswerD

DataFrames and Datasets use the Catalyst optimizer and Tungsten execution engine, improving performance and memory efficiency.

Why this answer

Spark DataFrames/Datasets use Tungsten execution engine, which provides optimized serialization and memory management. RDDs lack these optimizations. Spark SQL is a module, not an API.

Spark Streaming is for streaming.

Full explanation →

742

MCQmedium

Your Dataflow streaming pipeline is processing financial transactions and writing results to BigQuery. You need to monitor the pipeline for data freshness (end-to-end latency) and alert if it exceeds 5 minutes. The pipeline uses fixed windows of 1 minute. Which metrics should you use for alerting?

A.System Lag metric from Dataflow monitoring.

B.Data Freshness metric from BigQuery monitoring.

C.Element Count metric from Dataflow monitoring.

D.Worker Threads Utilization metric from Dataflow monitoring.

AnswerA

System Lag tracks the delay between event time and processing time; if it exceeds 5 minutes, alert.

Why this answer

System Lag in Dataflow measures the maximum time that a data element waits in the pipeline before being processed, which directly reflects end-to-end latency. For a streaming pipeline with 1-minute fixed windows, System Lag indicates how far behind the pipeline is from real-time, making it the correct metric to alert on when data freshness exceeds 5 minutes.

Exam trap

Cisco often tests the distinction between pipeline-level latency metrics (System Lag) and sink-level freshness metrics (BigQuery Data Freshness), trapping candidates who confuse the two or assume any 'freshness' metric is appropriate for the pipeline itself.

How to eliminate wrong answers

Option B is wrong because BigQuery Data Freshness measures the time since the last successful write to a table, not the end-to-end latency of the Dataflow pipeline; it would only detect if writes stop entirely, not if data is delayed within the pipeline. Option C is wrong because Element Count tracks the number of elements processed per second, which is a throughput metric and does not indicate latency or freshness. Option D is wrong because Worker Threads Utilization measures CPU usage of worker threads, which is a resource utilization metric unrelated to data freshness or end-to-end latency.

Full explanation →

743

MCQhard

A company stores IoT sensor readings in BigQuery. The table is partitioned by day and clustered by sensor_id. Query performance has degraded as data grows; many queries filter by a date range and a single sensor_id. Which optimization should be applied first?

A.Remove clustering on sensor_id as it may cause overhead.

B.Add a WHERE clause to filter by partition date even if the query already filters by a date range.

C.Increase the number of BigQuery slots assigned to the project.

D.Recluster the table to ensure data is sorted by sensor_id within each partition.

AnswerD

Clustering improves filter performance by reducing scanned data.

Why this answer

Option D is correct because reclustering the table ensures that within each daily partition, the data is physically sorted by sensor_id. This optimizes the performance of queries that filter by a date range and a single sensor_id, as BigQuery can use the clustering metadata to prune blocks and read only the relevant data, reducing the amount of data scanned and improving query speed.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (slots) or redundant WHERE clauses will fix performance issues caused by poor data layout, when the correct first step is to optimize data organization through clustering and partitioning.

How to eliminate wrong answers

Option A is wrong because removing clustering on sensor_id would eliminate the physical sorting that helps prune blocks for queries filtering by sensor_id, likely worsening performance. Option B is wrong because adding a WHERE clause to filter by partition date is redundant if the query already filters by a date range; BigQuery automatically performs partition pruning based on the date filter, so this would not improve performance. Option C is wrong because increasing the number of BigQuery slots addresses compute resource contention, not the underlying data layout issue; if the query is scanning too much data due to poor clustering, more slots will not reduce the bytes processed.

Full explanation →

744

MCQmedium

A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?

A.Materialized views cannot be clustered.

B.Materialized views must be manually refreshed by the user.

C.Materialized views can only be created on ingestion-time partitioned tables.

D.Materialized views are automatically updated when the base table changes.

AnswerD

Yes, BigQuery manages incremental refreshes.

Why this answer

BigQuery materialized views are automatically refreshed when base tables are changed.

Full explanation →

745

MCQhard

The query above runs slowly on the 10 TB table. Which optimization would most improve performance?

A.Use a subquery to filter item.category first

B.Cluster the table by customer_id

C.Create a materialized view that pre-aggregates by customer_id and item category

D.Partition the table by order_date

AnswerC

A materialized view pre-computes the COUNT for each (customer_id, category), so the query reads a small pre-aggregated table.

Why this answer

Option C is correct because a materialized view can pre-compute and store the aggregated results by customer_id and item category, eliminating the need to scan the full 10 TB table for each query. This dramatically reduces I/O and computation time, especially when the underlying aggregation is expensive and the query pattern is predictable.

Exam trap

Google Cloud often tests the misconception that partitioning or clustering alone can accelerate arbitrary aggregation queries, when in fact they only help with filter-based pruning or specific join patterns, not with reducing the full scan required for grouping without a WHERE clause.

How to eliminate wrong answers

Option A is wrong because using a subquery to filter item.category first does not reduce the scan size; the database still must read the entire 10 TB table to evaluate the subquery, and the optimizer may not push the filter down effectively. Option B is wrong because clustering by customer_id improves range scans and joins on that column, but it does not help with aggregation queries that group by customer_id and item category; the table still must be fully scanned to compute the aggregates. Option D is wrong because partitioning by order_date only prunes partitions when queries filter on order_date; the query in question does not filter by date, so all partitions would be scanned, providing no performance benefit.

Full explanation →

746

Multi-Selectmedium

A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?

Select 3 answers

A.Create authorized views for each team and grant access to the views

B.Cluster tables by team_id to improve performance

C.Grant bigquery.dataViewer on the dataset to all teams

D.Use table-level IAM to assign bigquery.dataViewer per table to each team

E.Enable deletion protection on critical tables

AnswersA, D, E

Authorized views restrict access to specific columns/rows per team.

Why this answer

Authorized views allow sharing specific table data without direct table access. Dataset-level IAM is too broad. Deletion protection prevents accidental table drops.

Clustering improves performance but not access control.

Full explanation →

747

MCQeasy

A team is using Kubeflow Pipelines on Google Kubernetes Engine to orchestrate ML workflows. They need to track parameters, metrics, and artifacts for each run. Which tool should they integrate?

A.Cloud Monitoring

B.Cloud Logging

C.BigQuery

D.Vertex ML Metadata

AnswerD

Vertex ML Metadata is designed to track ML artifacts, parameters, and metrics across pipeline runs.

Why this answer

Vertex ML Metadata is the correct choice because it is purpose-built for tracking parameters, metrics, and artifacts in ML workflows, and it integrates natively with Kubeflow Pipelines on Google Kubernetes Engine. It stores metadata for each pipeline run, enabling lineage tracking, comparison, and reproducibility of experiments.

Exam trap

Google Cloud often tests the distinction between general-purpose monitoring/logging tools and ML-specific metadata stores, so the trap here is that candidates may confuse Cloud Monitoring or Cloud Logging with a tool that can track ML metrics, when in fact they lack the structured schema and lineage capabilities required for ML workflow orchestration.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring is designed for infrastructure and application performance monitoring (e.g., CPU utilization, latency), not for tracking ML-specific parameters, metrics, and artifacts. Option B is wrong because Cloud Logging collects and stores log data (e.g., text logs from applications), not structured ML metadata like hyperparameters or model artifacts. Option C is wrong because BigQuery is a serverless data warehouse for analytical queries on large datasets, not a metadata store for ML pipeline runs.

Full explanation →

748

MCQhard

You are designing a streaming data pipeline that must guarantee exactly-once processing semantics for financial transactions. The pipeline reads from Cloud Pub/Sub and writes to Cloud Bigtable. Each transaction has a unique transaction ID. Which features do you need to implement to ensure exactly-once semantics end-to-end?

A.Use Cloud Pub/Sub with synchronous pull and manually commit offsets after successfully writing to Bigtable.

B.Use Dataflow with exactly-once processing, and ensure the Bigtable sink uses idempotent mutations based on the transaction ID.

C.Use Dataflow with at-least-once processing and implement deduplication in a windowed transform using the transaction ID.

D.Use Cloud Pub/Sub with exactly-once delivery enabled, and write to Bigtable using single-row transactions.

AnswerB

Dataflow deduplicates records using unique identifiers; Bigtable idempotent writes (e.g., using CheckAndMutate) ensure that even if a mutation is retried, the result is the same.

Why this answer

Option B is correct because Dataflow's exactly-once processing guarantees that each record is processed precisely once, and idempotent Bigtable mutations (keyed by transaction ID) ensure that even if a mutation is retried, the result is the same. This combination provides end-to-end exactly-once semantics: Dataflow handles source-side deduplication and checkpointing, while Bigtable's idempotent writes prevent duplicates at the sink.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's 'exactly-once delivery' feature exists or that manual offset management alone can achieve end-to-end exactly-once semantics, when in reality Pub/Sub only offers at-least-once delivery and requires a processing framework like Dataflow to achieve exactly-once end-to-end.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub synchronous pull with manual offset commit does not guarantee exactly-once delivery; Pub/Sub's at-least-once delivery model means duplicates can still occur, and manual offset management does not eliminate duplicates from Pub/Sub itself. Option C is wrong because at-least-once processing in Dataflow inherently allows duplicates, and windowed deduplication using transaction ID is not sufficient for end-to-end exactly-once semantics—it only handles duplicates within a window and does not address failures during checkpointing or sink writes. Option D is wrong because Cloud Pub/Sub does not support exactly-once delivery; its default is at-least-once, and single-row transactions in Bigtable do not prevent duplicates from Pub/Sub redelivery.

Full explanation →

749

MCQhard

A company is using Pub/Sub to ingest clickstream events and Dataflow to write to BigQuery. They observe that some events are malformed and cause the pipeline to fail. They need a solution that captures malformed events without blocking the pipeline and allows reprocessing later. Which Dataflow pattern should they implement?

A.Use a side input to filter malformed events before the main pipeline

B.Use the Reshuffle transform to reattempt failures

C.Write malformed events to a dead letter sink (e.g., another Pub/Sub topic or GCS bucket) and continue processing healthy events

D.Use logging alerts to notify the team and stop the pipeline on error

AnswerC

Dead letter sink is the correct pattern: isolate bad records and let the pipeline proceed.

Why this answer

Dead letter sinks (DLQ) are the standard pattern for handling bad records in Dataflow. The pipeline writes malformed records to a separate sink (e.g., Pub/Sub topic or GCS) for later analysis. Side inputs are for enriching data, not error handling.

Reshuffle doesn't apply. Output tags (side outputs) can also be used, but explicit dead letter pattern is more standard.

Full explanation →

750

MCQeasy

A team wants to store semi-structured user profile data for a web application. The data is accessed via a REST API and requires security rules to control read/write access. Which database fits best?

A.BigQuery

B.Firestore

C.Cloud SQL

D.Cloud Bigtable

AnswerB

Document NoSQL with Security Rules and REST API.

Why this answer

Firestore is a NoSQL document database that natively stores semi-structured data (JSON-like documents) and integrates directly with Firebase Authentication and security rules to control read/write access per document or collection. Its REST API support and real-time capabilities make it ideal for web application user profiles that require flexible schemas and fine-grained access control.

Exam trap

Cisco often tests the distinction between NoSQL databases optimized for different workloads (document vs. wide-column vs. analytical), and the trap here is assuming any NoSQL database (like Bigtable) is suitable for semi-structured user profiles without considering the need for built-in security rules and REST API integration.

How to eliminate wrong answers

Option A is wrong because BigQuery is a serverless data warehouse designed for analytical queries on large datasets, not for transactional REST API access with per-document security rules. Option C is wrong because Cloud SQL is a relational database (MySQL, PostgreSQL, SQL Server) that requires a fixed schema and does not natively support document-level security rules or semi-structured data without additional abstraction. Option D is wrong because Cloud Bigtable is a wide-column NoSQL database optimized for high-throughput, low-latency time-series or analytical workloads, not for semi-structured user profiles with fine-grained access control via REST API.

Full explanation →

Google Professional Data Engineer (PDE) — Questions 676–750