Google Professional Data Engineer PDE Questions 901–975 | Page 13/14

901

MCQmedium

You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?

A.ParDo

B.Flatten

C.CoGroupByKey

D.GroupByKey

AnswerC

CoGroupByKey joins multiple PCollections by key.

Why this answer

CoGroupByKey performs a key-based join of multiple PCollections. It can handle unbounded streams with appropriate windowing.

Full explanation →

902

MCQhard

A company wants to use BigQuery's PIVOT operator to transform their sales data. They have a table with columns: 'year', 'quarter', 'revenue'. They want to create a report where each row is a year and each column is a quarter (Q1, Q2, Q3, Q4) showing revenue. Which SQL statement is correct?

A.SELECT * FROM sales PIVOT(SUM(revenue) FOR quarter IN ('Q1','Q2','Q3','Q4'))

B.SELECT * FROM sales PIVOT(revenue FOR quarter IN (Q1, Q2, Q3, Q4))

C.PIVOT sales ON quarter USING SUM(revenue)

D.SELECT * FROM (SELECT year, quarter, revenue FROM sales) PIVOT(SUM(revenue) FOR quarter IN (Q1, Q2, Q3, Q4))

AnswerA

Correct syntax with subquery alias and aggregate function.

Why this answer

PIVOT in BigQuery requires specifying an aggregate function, the pivot column, and the list of pivot column values. The syntax is: SELECT * FROM (SELECT year, quarter, revenue FROM sales) PIVOT(SUM(revenue) FOR quarter IN ('Q1','Q2','Q3','Q4')).

Full explanation →

903

MCQeasy

A company wants to ingest IoT sensor data from thousands of devices into BigQuery for near-real-time analytics. The data volume is approximately 10 GB per hour. Which combination of Google Cloud services should they use for a cost-effective and scalable solution?

A.Pub/Sub → Dataflow → BigQuery

B.Cloud IoT Core → Cloud Functions → BigQuery

C.Cloud IoT Core → Cloud Dataproc → BigQuery

D.Cloud IoT Core → Cloud Storage → BigQuery load jobs

AnswerA

Pub/Sub ingests events, Dataflow streams them to BigQuery, scaling automatically.

Why this answer

Pub/Sub provides a scalable, managed ingestion layer for high-volume IoT data, decoupling producers from consumers. Dataflow (Apache Beam) processes the streaming data in near-real-time with exactly-once semantics and auto-scaling, writing directly to BigQuery for analytics. This combination minimizes operational overhead and cost by avoiding intermediate storage and manual scaling.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle streaming workloads, but its synchronous nature and timeout limit make it unsuitable for sustained high-throughput ingestion, whereas Pub/Sub + Dataflow is the standard pattern for near-real-time analytics.

How to eliminate wrong answers

Option B is wrong because Cloud Functions has a 9-minute timeout and is not designed for sustained high-throughput streaming (10 GB/hour), leading to timeouts and data loss. Option C is wrong because Cloud Dataproc (managed Spark/Hadoop) is optimized for batch processing, not near-real-time streaming; it adds latency and complexity compared to Dataflow's native streaming. Option D is wrong because Cloud Storage load jobs are batch-oriented, introducing minutes-to-hours latency and requiring manual orchestration, which fails the near-real-time requirement.

Full explanation →

904

Multi-Selecthard

Which THREE considerations are important when designing a data lake on Google Cloud using Cloud Storage?

Select 3 answers

A.Use Cloud Storage's eventual consistency model for cost savings.

B.Define a schema when writing data to enforce data quality.

C.Choose the appropriate storage class based on access patterns.

D.Enable encryption at rest using CMEK or CSEK.

E.Use object lifecycle management to transition data to colder storage classes.

AnswersC, D, E

Storage class impacts cost and latency.

Why this answer

Option C is correct because selecting the appropriate storage class (e.g., Standard, Nearline, Coldline, Archive) based on data access patterns directly optimizes cost and performance in Cloud Storage. For a data lake, where data may be accessed frequently initially and rarely later, matching the storage class to the access pattern avoids paying premium rates for infrequently accessed data.

Exam trap

Google Cloud often tests the misconception that Cloud Storage uses eventual consistency, but since 2020 it offers strong consistency for all operations, making option A a trap for those not updated on the change.

Full explanation →

905

Multi-Selecthard

A company uses Pub/Sub to ingest IoT sensor data and wants to process it with a Dataflow pipeline that uses fixed windows of 1 minute to compute average temperature. The pipeline also needs to handle malformed messages by routing them to a dead letter queue. Which TWO configurations should the engineer implement? (Choose TWO.)

Select 2 answers

A.Configure the pipeline to ignore malformed messages using `withCoder()`

B.Add a `GroupByKey` transform to deduplicate messages based on a unique ID

C.Enable exactly-once processing for the Pub/Sub subscription used by the pipeline

D.Set up a Cloud Function to reprocess messages from the dead letter topic

E.Use a dead letter sink (e.g., another Pub/Sub topic) for messages that exceed the retry limit

AnswersC, E

Exactly-once processing ensures each message is processed once, preventing duplicates and gaps.

Why this answer

To handle malformed messages, a dead letter sink (e.g., Pub/Sub topic or BigQuery table) should be used to capture messages that fail processing after retries. The Dataflow pipeline should use the Apache Beam `PubsubIO` source and apply windowing. Enabling exactly-once processing for the Pub/Sub subscription ensures that messages are not lost or duplicated during failures.

Option C is correct because dead letter sinks capture failed messages. Option E is correct because exactly-once processing ensures data integrity for streaming pipelines. Option A is not recommended as it discards data.

Option B is irrelevant as deduplication is not needed with exactly-once. Option D is not necessary for this scenario.

Full explanation →

906

MCQhard

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Some incoming messages are malformed and fail to parse. How should you handle these messages to ensure the pipeline continues processing without data loss?

A.Configure Pub/Sub to retry indefinitely until the message is processed

B.Use a try-catch block in the pipeline and ignore malformed messages

C.Write malformed messages to a dead-letter sink (e.g., Pub/Sub topic or GCS) and continue processing

D.Set the pipeline to fail and alert the team via Cloud Monitoring

AnswerC

Why this answer

The recommended pattern is to use a dead-letter queue (e.g., a separate Pub/Sub topic or a GCS bucket) to store failed messages after a retry threshold is reached. This preserves messages for later analysis without blocking the main pipeline.

Full explanation →

907

Multi-Selecteasy

Which THREE Google Cloud services are considered fully managed serverless data processing services? (Choose THREE.)

Select 3 answers

A.Cloud Dataproc

B.Cloud Functions

C.Cloud Composer

D.Cloud Data Fusion

E.Cloud Dataflow

AnswersB, D, E

E is correct because Cloud Functions is a serverless compute service often used for data transformation.

Why this answer

Cloud Functions is a fully managed serverless data processing service because it executes code in response to events without requiring any server provisioning or management. It automatically scales from zero to thousands of instances based on incoming requests, and you pay only for compute time used while your code runs. This makes it ideal for lightweight, event-driven data processing tasks such as transforming data in Cloud Storage or reacting to Pub/Sub messages.

Exam trap

Google Cloud often tests the distinction between 'fully managed' and 'serverless'—the trap here is that Cloud Dataproc and Cloud Composer are fully managed (Google handles infrastructure) but still require you to manage cluster resources or worker nodes, so they are not serverless; candidates mistakenly equate 'fully managed' with 'serverless'.

Full explanation →

908

MCQhard

A company is designing a data lake on Google Cloud. The data lake will store raw, curated, and analytics-ready data. Security requirements include: data must be encrypted at rest and in transit, access must be controlled based on data sensitivity (public, internal, confidential), and all access to sensitive data must be audited. The company also wants to minimize data transfer costs for frequently accessed curated datasets. Which combination of services and configurations best meets these requirements?

A.Use Cloud Storage with default encryption, bucket policies, and Cloud Audit Logs. For frequent access, use Cloud CDN.

B.Use Cloud Storage with CMEK, and use Cloud HSM for key storage. Use Cloud Audit Logs. Avoid caching to ensure security.

C.Use Cloud Storage with SSE-C, bucket policies, and Cloud Audit Logs. Use Cloud Load Balancing for caching.

D.Use Cloud Storage with CMEK, bucket-level IAM, and object ACLs. Use Cloud Data Loss Prevention API to classify data. Enable Cloud Audit Logs. Use Cloud CDN to cache curated datasets.

AnswerD

CMEK ensures customer-controlled encryption; IAM+ACLs give granular access; DLP inspects and classifies; audit logs capture access; CDN caches data for lower latency and cost.

Why this answer

Option D is correct because it combines CMEK for encryption at rest (with Cloud HSM for key management), bucket-level IAM and object ACLs for granular access control based on data sensitivity, Cloud Audit Logs for auditing access to sensitive data, and Cloud CDN to cache curated datasets, reducing data transfer costs for frequently accessed data. This configuration meets all security requirements (encryption at rest and in transit, access control, auditing) while optimizing cost for frequent access.

Exam trap

Google Cloud often tests the misconception that caching (Cloud CDN) is inherently insecure or that it cannot be used with sensitive data, but in reality, Cloud CDN can be secured with signed URLs, IAM, and encryption, and it is the correct way to reduce data transfer costs for frequently accessed data.

How to eliminate wrong answers

Option A is wrong because default encryption uses Google-managed keys, not customer-managed keys (CMEK), which may not satisfy compliance requirements for controlling encryption keys; Cloud CDN caches content at edge locations but does not reduce data transfer costs from Cloud Storage to the same region (it reduces egress for global distribution, not for frequent access within a region). Option B is wrong because 'Avoid caching to ensure security' contradicts the requirement to minimize data transfer costs for frequently accessed curated datasets; caching with Cloud CDN is secure when properly configured (e.g., signed URLs, IAM), and avoiding it increases costs. Option C is wrong because SSE-C (Server-Side Encryption with Customer-Provided Keys) requires the client to manage keys and is not integrated with Cloud HSM or Cloud KMS; Cloud Load Balancing does not cache data (it distributes traffic), so it does not reduce data transfer costs for frequent access.

Full explanation →

909

MCQeasy

A data engineer needs to schedule recurring nightly loads from Amazon S3 to Google Cloud Storage. The data is in CSV format and the volume is approximately 500 GB per night. Which Google Cloud service should they use?

A.Transfer Appliance

B.Storage Transfer Service

C.BigQuery Data Transfer Service

D.Datastream

AnswerB

Storage Transfer Service is designed for online transfers between storage systems including S3 to GCS.

Why this answer

The Storage Transfer Service is designed for online data transfers from external cloud providers like Amazon S3 to Google Cloud Storage. It supports scheduling recurring nightly transfers, handles large volumes (500 GB/night), and automatically retries failed transfers, making it the correct choice for this use case.

Exam trap

Cisco often tests the distinction between services that transfer data to Cloud Storage (Storage Transfer Service) versus services that load data directly into BigQuery (BigQuery Data Transfer Service), causing candidates to confuse the destination.

How to eliminate wrong answers

Option A is wrong because Transfer Appliance is a physical device for offline data transfer, used for very large datasets (hundreds of TB to PB) where network transfer is impractical, not for recurring nightly loads. Option C is wrong because BigQuery Data Transfer Service is for loading data into BigQuery tables from sources like Google Ads or Amazon S3, but it does not directly transfer files to Cloud Storage; it loads data into BigQuery, not Cloud Storage. Option D is wrong because Datastream is for real-time change data capture (CDC) from databases like MySQL or PostgreSQL to BigQuery or Cloud Storage, not for batch CSV file transfers from S3.

Full explanation →

910

MCQeasy

A company wants to analyze server logs stored in Cloud Storage using SQL. They need to get results in seconds without setting up any clusters. Which service should they use?

A.Cloud Dataflow

B.Cloud Logging

C.BigQuery

D.Cloud Dataproc

AnswerC

BigQuery supports federated queries on Cloud Storage using SQL, providing fast results without clusters.

Why this answer

BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. It allows you to analyze petabytes of data using standard SQL without needing to provision or manage any clusters, making it ideal for querying server logs stored in Cloud Storage directly via external tables or loading data into BigQuery for sub-second query performance.

Exam trap

Google Cloud often tests the distinction between serverless SQL analytics (BigQuery) and managed compute frameworks (Dataflow, Dataproc), where candidates mistakenly choose Dataflow or Dataproc for SQL-like analysis without recognizing the need for cluster management or pipeline setup.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow is a unified stream and batch data processing service that requires setting up and managing pipelines (though serverless, it is not primarily for ad-hoc SQL queries on stored logs). Option B is wrong because Cloud Logging is a real-time log management and analysis service for monitoring and debugging, not designed for complex SQL analytics on large historical log datasets stored in Cloud Storage. Option D is wrong because Cloud Dataproc is a managed Spark and Hadoop service that requires provisioning clusters (even if ephemeral) and is not serverless SQL querying.

Full explanation →

911

MCQhard

You are designing a streaming pipeline using Cloud Dataflow with exactly-once semantics. The source is Pub/Sub and the sink is Cloud Bigtable. The pipeline must handle late data up to 10 minutes. You need to minimize cost while maintaining correctness. Which configuration should you use?

A.Fixed windows of 1 minute with allowed lateness 10 minutes and accumulating fired panes

B.Sliding windows of 1 minute with allowed lateness 10 minutes and accumulating fired panes

C.Global window with allowed lateness 10 minutes and trigger=afterWatermark with early firings

D.Session windows of 5 minutes with gap duration 1 minute and discarding fired panes

AnswerC

Global window with watermark-based triggers handles late data efficiently.

Why this answer

Option C is correct because a global window with an after-watermark trigger and early firings is the most cost-effective way to handle unbounded data from Pub/Sub with exactly-once semantics, while allowing up to 10 minutes of lateness. Fixed or sliding windows would create many small window states, increasing Bigtable write costs and shuffle overhead. The global window minimizes state and processing, and the trigger ensures results are emitted promptly without accumulating panes.

Exam trap

Google Cloud often tests the misconception that windowing is always required for streaming pipelines, but here the sink (Bigtable) stores individual records, so a global window with triggers is the most efficient and correct choice, not fixed or sliding windows.

How to eliminate wrong answers

Option A is wrong because fixed windows of 1 minute with accumulating panes would create a new window every minute, leading to excessive state and write amplification in Bigtable, increasing cost without benefit for a global sink. Option B is wrong because sliding windows of 1 minute would create overlapping windows, multiplying state and processing overhead even more than fixed windows, which is wasteful for a use case that doesn't require windowed aggregations. Option D is wrong because session windows with a 5-minute gap duration and discarding panes are designed for grouping events by activity sessions, not for a simple streaming pipeline to Bigtable; discarding panes also risks losing late data that arrives within the 10-minute allowed lateness, violating correctness.

Full explanation →

912

MCQhard

A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?

A.Use Cloud Shell and clone the repo

B.Use a user-managed notebook instance with multiple users

C.Share the notebook via Cloud Storage

D.Store notebooks in Cloud Source Repositories

AnswerB

Allows collaboration with version control.

Why this answer

A user-managed notebook instance with multiple users is the correct approach because Vertex AI Workbench supports collaboration by allowing multiple users to access the same instance via IAM permissions, while the underlying Git integration enables version control. This setup provides a shared, persistent environment where team members can work on the same codebase without duplicating work, and changes can be tracked through Git repositories.

Exam trap

The trap here is that candidates confuse storing notebooks in a version control system (like Cloud Source Repositories) with having a shared, interactive development environment, overlooking that version control alone does not provide the compute and collaboration features of a user-managed notebook instance.

How to eliminate wrong answers

Option A is wrong because Cloud Shell is a temporary, per-user environment with limited resources and no persistent storage, making it unsuitable for sharing a development environment with version control across a team. Option C is wrong because sharing notebooks via Cloud Storage is a static file-sharing method that does not provide version control, collaborative editing, or a live execution environment. Option D is wrong because Cloud Source Repositories is a Git repository hosting service for storing code, not a shared interactive development environment; it lacks the compute and runtime capabilities needed for model development.

Full explanation →

913

MCQmedium

A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?

A.withAllowedLateness(Duration.standardMinutes(5)).triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))

B.withAllowedLateness(Duration.standardMinutes(5)).triggering(AfterWatermark.pastEndOfWindow().withLateFirings(AfterPane.elementCountAtLeast(1)))

C.triggering(AfterWatermark.pastEndOfWindow()).withAllowedLateness(Duration.standardMinutes(5))

D.withAllowedLateness(Duration.standardMinutes(5)).accumulatingFiredPanes()

AnswerB

Allows 5 min lateness and fires a final pane after watermark passes end of window.

Why this answer

In Beam, allowed lateness and triggering combine to handle late data.

Full explanation →

914

Multi-Selectmedium

A team monitors a deployed Vertex AI model and notices an increasing number of prediction errors with status code 413 (Request Entity Too Large). Which TWO actions should they consider to resolve this issue?

Select 2 answers

A.Implement client-side pre-processing to compress or downsample input data

B.Switch the model to batch prediction to handle large payloads offline

C.Increase the number of replicas to handle load

D.Decrease the machine type to reduce resource consumption

E.Increase the maximum request size limit in the endpoint configuration

AnswersA, E

Reducing input size prevents exceeding the limit.

Why this answer

Option A is correct because status code 413 indicates the HTTP request payload exceeds the server's size limit. Implementing client-side pre-processing to compress or downsample input data reduces the payload size before it reaches the Vertex AI endpoint, directly addressing the root cause. This approach is efficient because it shifts the computational burden to the client and avoids hitting the server-imposed request size cap, which is typically 1.5 MB for online predictions in Vertex AI.

Exam trap

Google Cloud often tests the misconception that scaling resources (replicas or machine type) can fix request size errors, but 413 is a protocol-level limit that must be addressed by reducing payload size, not by increasing infrastructure capacity.

Full explanation →

915

MCQmedium

A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?

A.Dataproc

B.Dataprep

C.Dataflow

D.Cloud Data Fusion

AnswerB

Dataprep provides a visual interface for transformations.

Why this answer

Dataprep is a visual data wrangling tool for exploring and cleaning data.

Full explanation →

916

MCQhard

A company needs to serve predictions for a model that runs an expensive computation on each request. The model is used by a batch job that processes millions of records each night, and also by a real-time API for a few thousand queries per hour. Which prediction strategy minimizes cost and latency for both use cases?

A.Deploy two identical models, one on a Compute Engine VM for batch, one on Vertex AI for online, and synchronize updates.

B.Use Vertex AI batch prediction for the nightly job and a separate online endpoint with auto-scaling for the real-time API.

C.Use Vertex AI batch prediction for both workloads.

D.Use a single online Vertex AI endpoint with auto-scaling to handle both workloads.

AnswerB

This separates concerns: batch prediction is optimized for throughput, online endpoint for low-latency, and auto-scaling handles varying traffic.

Why this answer

Option B is correct because it separates the batch and online workloads to optimize cost and latency. Vertex AI batch prediction is designed for high-throughput, asynchronous processing of large datasets at lower cost, while a separate online endpoint with auto-scaling ensures low-latency responses for real-time API queries by scaling resources based on demand. This avoids over-provisioning for the batch job and prevents the batch workload from interfering with the latency-sensitive API.

Exam trap

Cisco often tests the misconception that a single endpoint can handle both batch and online workloads efficiently, but the trap is that batch and online have fundamentally different latency and throughput requirements, and using the same infrastructure for both leads to cost or performance penalties.

How to eliminate wrong answers

Option A is wrong because deploying two identical models on separate infrastructure (Compute Engine VM and Vertex AI) introduces unnecessary management overhead and synchronization complexity, and does not leverage Vertex AI's managed batch prediction service for cost efficiency. Option C is wrong because using batch prediction for the real-time API would introduce unacceptable latency, as batch predictions are asynchronous and not designed for sub-second response times required by online queries. Option D is wrong because using a single online endpoint for both workloads would cause the batch job's high-volume requests to consume resources and potentially throttle or delay the real-time API queries, increasing latency and cost due to over-provisioning to handle peak loads.

Full explanation →

917

MCQhard

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

A.Use Customer-Managed Encryption Keys (CMEK) and VPC Service Controls.

B.Use Data Loss Prevention API to redact sensitive data.

C.Enable Cloud Audit Logs and VPC Service Controls.

D.Enable default encryption at rest and in transit.

AnswerA

Provides control and exfiltration prevention.

Why this answer

Option A is correct because Customer-Managed Encryption Keys (CMEK) allow the company to control the encryption keys used to protect data at rest in Cloud Dataflow, while VPC Service Controls provide a security perimeter that prevents data exfiltration and ensures end-to-end encryption boundaries. Together, they address the compliance requirement for encryption control and audit logging by restricting data movement within a VPC service perimeter and using customer-managed keys for data encryption.

Exam trap

The trap here is that candidates often assume default encryption (Option D) or audit logs alone (Option C) satisfy compliance requirements, but they overlook the need for customer-managed keys and network-level exfiltration controls that VPC Service Controls provide.

How to eliminate wrong answers

Option B is wrong because the Data Loss Prevention (DLP) API is used for inspecting and redacting sensitive data (e.g., PII), not for ensuring end-to-end encryption or audit log retention; it does not provide encryption key management or network-level controls. Option C is wrong because while Cloud Audit Logs capture API activity and VPC Service Controls provide a security perimeter, this combination lacks customer-managed encryption keys (CMEK), which are required for the 'encrypted end-to-end' and key control compliance mandate. Option D is wrong because default encryption at rest and in transit uses Google-managed keys, not customer-managed keys, and does not include VPC Service Controls to enforce data exfiltration prevention or audit log retention policies.

Full explanation →

918

MCQhard

A company is running a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They notice that the number of workers is not scaling up to handle increased throughput, causing latency spikes. The pipeline uses a GlobalWindow with default triggering. What is the most likely cause of the under-scaling?

A.The pipeline includes a GroupByKey that creates a hot key, limiting parallelism

B.The Pub/Sub subscription has a large backlog, but Dataflow automatically scales to handle it

C.The pipeline uses the default worker machine type, which is too small

D.The pipeline is using legacy streaming inserts instead of the Storage Write API

AnswerA

Hot keys prevent splitting the work across workers, causing underutilization and scaling issues.

Why this answer

Dataflow's autoscaling is based on CPU utilization and throughput. If the pipeline uses a GroupByKey with hot keys, parallelism is limited and workers may not scale effectively.

Full explanation →

919

MCQmedium

Refer to the exhibit. What is the cause of this error?

A.The machine type flag is only used during model deployment, not endpoint creation

B.The endpoint name already exists

C.The user must specify a model name

D.The region is missing

AnswerA

Correct: machine type is a property of the deployed model, not the endpoint.

Why this answer

The error occurs because the `machine_type` flag is only valid during model deployment (when creating a deployment in Vertex AI), not during endpoint creation. When creating an endpoint, you specify the endpoint name and region, but the machine type is configured later when deploying a model to that endpoint. Attempting to set `machine_type` during endpoint creation causes a validation error because the API does not accept that parameter at that stage.

Exam trap

Cisco often tests the distinction between endpoint creation and model deployment parameters, trapping candidates who assume all model-serving configuration happens at endpoint creation time.

How to eliminate wrong answers

Option B is wrong because if the endpoint name already exists, the error would be a 409 Conflict or 'Already exists' message, not a validation error about an invalid parameter. Option C is wrong because a model name is not required when creating an endpoint; the endpoint is a container that can host multiple models, and models are specified during deployment. Option D is wrong because the region is a required parameter for endpoint creation, and if it were missing, the error would indicate a missing required field, not an invalid parameter like `machine_type`.

Full explanation →

920

MCQmedium

A company deploys a model to Vertex AI Endpoint. They want to run a canary deployment to test a new model version with 10% of traffic. How should they configure this?

A.Deploy to a new endpoint and update the application to call both

B.Use Cloud Load Balancing to route traffic

C.Deploy the new model to the same endpoint and set traffic split

D.Deploy to Cloud Run and use gradual rollout

AnswerC

Traffic splitting allows canary.

Why this answer

Option C is correct because Vertex AI Endpoints natively support traffic splitting between model versions deployed to the same endpoint. By deploying the new model version to the same endpoint and setting a traffic split of 10% to the new version and 90% to the current version, the company can perform a canary deployment without changing the application code or infrastructure.

Exam trap

Google Cloud often tests the misconception that canary deployments require separate endpoints or external load balancers, when in fact Vertex AI Endpoints provide a built-in traffic splitting feature that handles this at the model version level.

How to eliminate wrong answers

Option A is wrong because deploying to a new endpoint and updating the application to call both endpoints adds unnecessary complexity and defeats the purpose of a canary deployment, which should be transparent to the application. Option B is wrong because Cloud Load Balancing operates at the network layer and cannot route traffic based on model version within a single Vertex AI Endpoint; it is designed for distributing traffic across regional endpoints or backends, not for model version canary testing. Option D is wrong because deploying to Cloud Run and using gradual rollout is not the native way to manage model versions in Vertex AI; Vertex AI Endpoints provide built-in traffic splitting for model versions, which is the recommended approach for canary deployments in this context.

Full explanation →

921

MCQhard

You are designing a Cloud Storage bucket to hold sensitive financial documents that must not be deleted or overwritten for 7 years. After the retention period, the documents can be deleted automatically. Which configuration should you use?

A.Set a retention policy on the bucket for 7 years and enable Object Versioning.

B.Use Object Lock with WORM mode and set a retention period of 7 years. After the period, objects are automatically deleted.

C.Set a lifecycle rule to delete objects after 7 years and enable Bucket Lock.

D.Use Bucket Lock with a retention policy of 7 years and configure a lifecycle rule to delete objects after 7 years.

AnswerD

Bucket Lock retains objects for 7 years; lifecycle rule deletes them after that period.

Why this answer

Option D is correct because Bucket Lock (also known as Object Lock) provides a WORM (Write Once, Read Many) retention policy that prevents objects from being deleted or overwritten for a specified period. By setting a retention policy of 7 years, you enforce the required compliance hold. Then, a lifecycle rule configured to delete objects after 7 years ensures automatic removal once the retention period expires.

This combination meets both the retention and automatic deletion requirements.

Exam trap

Cisco often tests the misconception that a retention policy alone (without Object Lock) or a lifecycle rule alone can enforce both retention and automatic deletion, when in fact you need both Bucket Lock for the WORM hold and a lifecycle rule for the scheduled deletion.

How to eliminate wrong answers

Option A is wrong because a retention policy on the bucket (without Object Lock) only prevents deletion of the bucket itself, not individual objects; enabling Object Versioning alone does not prevent deletion or overwrite of object versions. Option B is wrong because Object Lock with WORM mode can prevent deletion, but it does not automatically delete objects after the retention period; you must explicitly configure a lifecycle rule for deletion. Option C is wrong because a lifecycle rule alone cannot enforce a retention policy that prevents deletion or overwrite during the first 7 years; Bucket Lock is required for that enforcement, and the lifecycle rule must be combined with Bucket Lock to achieve automatic deletion after retention.

Full explanation →

922

MCQeasy

Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?

A.IAM roles

B.Authorized views

C.Dataset access controls

D.Materialized views

AnswerB

Authorized views allow sharing query results securely.

Why this answer

Authorized views allow sharing results without granting access to the base tables.

Full explanation →

923

Multi-Selecthard

A company stores data in a Cloud Storage bucket with versioning enabled. They want to automatically delete objects that are noncurrent (i.e., previous versions) after 30 days, and also delete the current version if it is older than 365 days. Which three Object Lifecycle Management conditions can be used together? (Choose three.)

Select 3 answers

A.lastAccessTime: 30

B.age: 365

C.numNewerVersions: 1

D.daysSinceCustomTime: 30

E.noncurrentTimeBefore: 30

AnswersB, C, E

Deletes current version when older than 365 days.

Why this answer

Option B is correct because the `age` condition in Object Lifecycle Management specifies the number of days since object creation, and setting it to 365 will delete the current version when it is older than 365 days. This directly meets the requirement to delete current versions older than a year.

Exam trap

Cisco often tests the distinction between Google Cloud Storage lifecycle conditions and AWS S3 lifecycle conditions, so candidates mistakenly select `lastAccessTime` (an S3-only feature) or confuse `daysSinceCustomTime` with `noncurrentTimeBefore`.

Full explanation →

924

MCQhard

You are building a machine learning pipeline for credit risk assessment. The dataset has a severe class imbalance (1% default rate). You want to use AutoML Tables on Vertex AI. Which strategy should you incorporate to handle imbalance?

A.Downsample the majority class to a 50-50 ratio

B.Apply SMOTE in a Dataflow pipeline before training

C.Upsample the minority class using BigQuery SQL

D.Use the `class_weight` parameter in the AutoML Tables model

AnswerD

AutoML Tables supports adjusting class weights to handle imbalance.

Why this answer

AutoML Tables automatically applies class imbalance handling (e.g., class weighting) by default. You can adjust the weight strategy. SMOTE is not directly supported in AutoML Tables; you would need custom training.

Downsampling and upsampling are manual steps not needed.

Full explanation →

925

MCQmedium

A company uses BigQuery for analytics. They need to ensure data quality by preventing duplicate records from being inserted. Which approach is most effective?

A.Use BigQuery ML to train a model that identifies anomalies.

B.Use a DML MERGE statement that filters out duplicates based on a unique key.

C.Use Cloud Data Loss Prevention API to scan for duplicates.

D.Use COUNT DISTINCT in queries to ignore duplicates.

AnswerB

MERGE with deduplication logic ensures only one copy of each record is inserted, maintaining data quality.

Why this answer

Option B is correct because BigQuery's DML MERGE statement can be used to atomically insert rows only when a unique key does not already exist in the target table. By using a MERGE with a WHEN NOT MATCHED THEN INSERT clause, the operation prevents duplicate records from being inserted in a single, transactional statement, ensuring data quality without requiring external tools or post-processing.

Exam trap

Cisco often tests the misconception that data quality tools like DLP or ML can solve structural data integrity problems, when in fact the correct approach is to use native DML operations (like MERGE) that enforce uniqueness at write time.

How to eliminate wrong answers

Option A is wrong because BigQuery ML is designed for machine learning tasks like forecasting or classification, not for enforcing data integrity constraints such as preventing duplicate rows; training a model for anomaly detection would be overkill and unreliable for exact duplicate prevention. Option C is wrong because the Cloud Data Loss Prevention (DLP) API is used for inspecting and de-identifying sensitive data (e.g., PII), not for detecting or preventing duplicate records; it has no concept of unique keys or row-level deduplication. Option D is wrong because COUNT DISTINCT only ignores duplicates in query results for aggregation purposes; it does not prevent duplicate records from being inserted into the table, so duplicates can still accumulate over time.

Full explanation →

926

MCQeasy

A data engineer wants to automatically detect when the distribution of input features to a production model has shifted significantly. Which Vertex AI feature should they enable?

A.Vertex AI Vizier

B.Vertex AI Model Monitoring

C.Vertex AI Explainable AI

D.Vertex AI Feature Store

AnswerB

Monitors prediction and feature drift/skew.

Why this answer

Vertex AI Model Monitoring is the correct service because it is specifically designed to continuously detect feature distribution drift and prediction skew in production models. It automatically compares the current input feature distribution against a baseline (e.g., training data) and triggers alerts when significant statistical shifts occur, enabling proactive retraining or investigation.

Exam trap

The trap here is that candidates confuse 'monitoring model performance' (e.g., accuracy, latency) with 'monitoring input feature distribution drift', leading them to incorrectly choose Vertex AI Vizier or Explainable AI, which address different aspects of model lifecycle management.

How to eliminate wrong answers

Option A is wrong because Vertex AI Vizier is a hyperparameter tuning service that optimizes model performance through black-box optimization, not for monitoring distribution shifts in production. Option C is wrong because Vertex AI Explainable AI provides feature attributions and explanations for individual predictions, but it does not monitor aggregate distribution changes over time. Option D is wrong because Vertex AI Feature Store is a centralized repository for storing, serving, and sharing feature data, but it lacks built-in drift detection or alerting capabilities.

Full explanation →

927

Multi-Selectmedium

A data engineer needs to perform a one-time migration of 10 TB of data from on-premises Hadoop HDFS to Cloud Storage. The network link is 1 Gbps. Which TWO services or tools should they consider? (Choose 2)

Select 2 answers

A.Dataproc with DistCp

B.Cloud Storage Transfer Service

C.BigQuery Data Transfer Service

D.gsutil rsync with parallel composite uploads

E.Transfer Appliance

AnswersB, E

Can transfer from HDFS via a Hadoop URL, suitable for this volume over a 1 Gbps link.

Why this answer

Storage Transfer Service can transfer data from HDFS (via an intermediary) but Transfer Appliance is also feasible for large volumes, especially if bandwidth is limited.

Full explanation →

928

MCQhard

You manage a team that deploys multiple versions of a computer vision model for A/B testing on Vertex AI Endpoints. You need to route a small percentage of traffic to a canary version while the rest goes to the stable version. You also need to gradually increase the canary traffic over time based on performance metrics. Which approach should you take?

A.Create two separate endpoints, one for each version, and use a separate load balancer to route a percentage of requests to the canary endpoint.

B.Deploy both models to the same endpoint and configure traffic splitting percentages using the Vertex AI console or API.

C.Use Cloud Armor with weighted backend services to route a portion of requests to the canary version.

D.Implement feature flags in the application code to randomly select the model version for each prediction request.

AnswerB

Vertex AI endpoints natively support traffic splitting between deployed models, allowing gradual rollout and canary testing.

Why this answer

Vertex AI Endpoints natively support traffic splitting between model versions deployed to the same endpoint. This allows you to assign a percentage of traffic (e.g., 5%) to a canary version and the remainder to the stable version, and then adjust the split over time via the console or API as performance metrics dictate. This approach avoids the complexity and latency of external load balancers or application-level routing.

Exam trap

Cisco often tests the misconception that you need an external load balancer or separate endpoints for canary deployments, when in fact Vertex AI's native traffic splitting is the correct and simplest approach.

How to eliminate wrong answers

Option A is wrong because creating two separate endpoints with an external load balancer adds unnecessary infrastructure complexity, latency, and cost; Vertex AI already provides built-in traffic splitting within a single endpoint. Option C is wrong because Cloud Armor is a web application firewall and DDoS protection service, not a traffic routing mechanism for model versions; it cannot perform weighted backend routing for Vertex AI endpoints. Option D is wrong because implementing feature flags in application code for model selection bypasses Vertex AI's managed traffic splitting, introduces custom logic that must be maintained, and does not leverage the platform's native canary deployment capabilities.

Full explanation →

929

MCQmedium

A company has a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is failing with 'deadline exceeded' errors during peak hours. The team suspects that the pipeline cannot keep up with the incoming data rate. They also notice that the autoscaling algorithm sets maxNumWorkers to 10, but the pipeline only scales to 5 workers. What is the most likely cause of the inadequate scaling?

A.The maxNumWorkers setting is too low and should be reduced to trigger more aggressive scaling

B.BigQuery streaming quota is limiting the number of concurrent writes

C.The Pub/Sub subscription has a per-subscriber throughput limit of 5 workers

D.The pipeline is CPU-bound and the autoscaler evaluates that adding more workers would not improve throughput

AnswerD

Autoscaler uses utilization metrics; if workers are already saturated, it may not add more.

Why this answer

Option D is correct because the autoscaler in Dataflow evaluates CPU utilization and throughput per worker. If the pipeline is CPU-bound, adding more workers does not reduce per-worker CPU load or improve throughput, so the autoscaler stops at 5 workers even though maxNumWorkers is 10. This is a classic symptom of a bottleneck that cannot be parallelized further, such as a single-threaded transformation or a hot key in a GroupByKey operation.

Exam trap

The trap here is that candidates assume autoscaling always scales to maxNumWorkers when there is a backlog, but the autoscaler only adds workers if they will actually improve throughput, and a CPU-bound pipeline is a common reason for scaling to stall.

How to eliminate wrong answers

Option A is wrong because reducing maxNumWorkers would further restrict scaling, not trigger more aggressive scaling; the autoscaler already has permission to scale to 10 but chooses not to. Option B is wrong because BigQuery streaming quota limits the rate of inserts, not the number of concurrent workers; quota exhaustion would cause insert errors, not prevent the autoscaler from adding workers. Option C is wrong because Pub/Sub subscriptions have a per-subscriber throughput limit that is very high (typically hundreds of MB/s per subscriber), and the pipeline is not hitting that limit; the limit is on throughput, not on the number of subscribers.

Full explanation →

930

MCQhard

A BigQuery table has a REQUIRED column 'user_id' that now needs to accept NULL values due to upstream data changes. You want to alter the schema with minimal downtime and no data loss. What should you do?

A.Run `ALTER TABLE dataset.table ALTER COLUMN user_id DROP NOT NULL;`

B.Use the bq command: `bq update --set_nullable_fields user_id dataset.table`

C.Create a view that casts user_id to NULLABLE and use the view instead.

D.Drop the table and recreate it with the column as NULLABLE.

AnswerA

This BigQuery DDL statement changes the column to nullable without downtime or data loss.

Why this answer

BigQuery allows changing a column from REQUIRED to NULLABLE using the ALTER TABLE ALTER COLUMN SET DATA TYPE statement. This operation is a metadata change and does not require table recreation or data copy. Dropping and recreating the table would cause downtime and data loss.

Using a view is a workaround but doesn't change the underlying schema. Exporting and reloading is disruptive.

Full explanation →

931

Multi-Selecthard

A company uses Workflows to orchestrate a multi-step data pipeline. One step calls an HTTP endpoint that may take up to 10 minutes, but the default Workflows timeout is too short. They also need to handle transient errors with retries. Which TWO configurations should they apply? (Choose 2)

Select 2 answers

A.Set a step timeout of 600 seconds for the HTTP call step

B.Configure a dead letter queue for failed steps

C.Use the default retry policy on the step

D.Set the workflow execution timeout to 600 seconds

E.Add a retry policy on the step with appropriate conditions for transient errors

AnswersA, E

This extends the timeout for that specific step to 10 minutes.

Why this answer

To extend the timeout, set a step timeout of 600 seconds (10 minutes). To handle transient errors, use a retry policy with appropriate conditions. Setting the entire workflow timeout to 10 minutes is not necessary if individual step timeouts are set.

The default retry policy does not cover all transient errors. Adding a dead letter queue is for event-driven patterns, not Workflows.

Full explanation →

932

MCQeasy

A team has trained a scikit-learn model and wants to deploy it to AI Platform Prediction for online predictions. What is the required format for the model artifact?

A.A model.joblib file (or model.pkl) along with any custom code.

B.A single .h5 file containing the model weights.

C.A SavedModel directory containing the model for TensorFlow.

D.A model.pt file for PyTorch models.

AnswerA

AI Platform supports joblib/pickle for scikit-learn.

Why this answer

AI Platform Prediction (now Vertex AI) supports scikit-learn models natively. The required artifact format is a serialized model file (model.joblib or model.pkl) optionally accompanied by any custom code dependencies. This is because scikit-learn models are pickled objects, and the platform deserializes them using the same Python environment specified in the runtime version.

Exam trap

Cisco often tests the misconception that all ML frameworks use a single universal model file format, when in fact each framework (scikit-learn, TensorFlow, PyTorch) has its own required artifact format for AI Platform Prediction.

How to eliminate wrong answers

Option B is wrong because .h5 files are specific to Keras/TensorFlow models, not scikit-learn; AI Platform Prediction expects a SavedModel or a serialized pickle for scikit-learn. Option C is wrong because a SavedModel directory is the required format for TensorFlow models, not for scikit-learn models. Option D is wrong because model.pt files are PyTorch serialization format; AI Platform Prediction requires a SavedModel for PyTorch or a custom container, not a raw .pt file.

Full explanation →

933

MCQeasy

To enable data lineage tracking in BigQuery, which feature should be activated?

A.BigQuery Audit Logs

B.Data Catalog

C.Dataplex Lineage

D.BigQuery Lineage API

AnswerD

The Lineage API provides data lineage for BigQuery assets.

Why this answer

BigQuery Lineage API allows tracking data lineage for tables and views. Dataplex Lineage also provides lineage but requires Dataplex. BigQuery Audit Logs capture metadata changes but not lineage specifically.

Data Catalog is for metadata management but not lineage tracking.

Full explanation →

934

Multi-Selecteasy

A company is designing a data processing pipeline for real-time sensor data. They want to ensure low latency and exactly-once processing semantics. Which two Google services should they combine to achieve this? (Choose 2)

Select 2 answers

A.Cloud Dataproc with Spark Streaming

B.Cloud Functions with Cloud Pub/Sub triggers

C.Cloud Pub/Sub with exactly-once delivery

D.Cloud Dataflow with exactly-once processing mode

E.Cloud IoT Core with device gateways

AnswersC, D

Pub/Sub can be configured for exactly-once delivery to subscribers.

Why this answer

Cloud Pub/Sub with exactly-once delivery (Option C) ensures that each message is delivered to subscribers exactly once, preventing duplicates in the pipeline. Cloud Dataflow with exactly-once processing mode (Option D) provides end-to-end exactly-once semantics by leveraging consistent snapshots and idempotent sinks, which is critical for real-time sensor data pipelines requiring low latency and accuracy.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub alone provides end-to-end exactly-once processing, but candidates must recognize that Pub/Sub only guarantees delivery exactly once to subscribers, while Dataflow is needed to ensure processing exactly once across transformations and sinks.

Full explanation →

935

MCQhard

A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?

A.Set allowed lateness to 5 minutes and use the default trigger

B.Use a sliding window of 1 minute with a 1-minute period

C.Use a global window with a trigger that fires every 10 seconds

D.Increase the watermark estimate to 10 minutes

AnswerA

This allows late events up to 5 minutes after the window end, and the default trigger fires at the end of the window plus allowed lateness.

Why this answer

Setting a watermark estimate and allowed lateness with a trigger controls how long the pipeline waits for late data. The default trigger fires at the end of the window, and with allowed lateness, late events are still processed until the allowed time expires.

Full explanation →

936

MCQmedium

A data engineer is designing a batch ETL pipeline using Cloud Composer and Dataflow. The pipeline must be self-healing and retry on failures. Which Composer feature should they configure?

A.Use Cloud Tasks for retries

B.Retry policy on the DAG

C.Cloud Composer with high availability

D.Dataflow retries

AnswerB

Composer DAGs can have retry policies for tasks.

Why this answer

Option B is correct because Cloud Composer (based on Apache Airflow) allows you to configure a retry policy directly on the DAG or individual tasks. This enables the pipeline to automatically retry failed tasks according to parameters like `retries`, `retry_delay`, and `retry_exponential_backoff`, making the ETL pipeline self-healing without external services.

Exam trap

Google Cloud often tests the distinction between orchestration-level retries (Composer DAG) and execution-level retries (Dataflow), leading candidates to pick Dataflow retries (Option D) when the question explicitly asks for a Composer feature.

How to eliminate wrong answers

Option A is wrong because Cloud Tasks is a fully managed queue service for asynchronous task execution, not a feature of Cloud Composer; it would introduce unnecessary complexity and is not the native way to handle retries within a Composer DAG. Option C is wrong because high availability (HA) for Cloud Composer ensures the Airflow components are resilient to zone failures, but it does not configure task-level retry behavior for pipeline failures. Option D is wrong because Dataflow retries handle failures at the Dataflow job level (e.g., worker failures), but the question asks for a Composer feature to manage retries of the overall pipeline orchestration, not the underlying data processing job.

Full explanation →

937

Matchingmedium

Match each BigQuery feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Sorting data within partitions to improve query performance

Dividing tables into segments based on a date/timestamp column

Unit of computational capacity in BigQuery

Pre-computed query results for faster access

Why these pairings

BigQuery features that optimize performance and cost.

Full explanation →

938

MCQeasy

A company uses Cloud Dataflow to process streaming data. They notice that the pipeline's throughput is lower than expected and the system is experiencing high latency. What is the most likely cause?

A.Using batch mode instead of streaming mode

B.Too many workers

C.Too few workers

D.Incorrect watermark setting

AnswerC

Insufficient workers cause backpressure and latency.

Why this answer

In Cloud Dataflow, streaming pipelines require sufficient worker resources to handle the incoming data rate and maintain low latency. When too few workers are provisioned, the pipeline cannot process data quickly enough, leading to increased backlog and higher latency. This is the most likely cause of reduced throughput and high latency in a streaming pipeline.

Exam trap

Cisco often tests the misconception that adding more workers always improves performance, but the trap here is that too few workers is the direct cause of high latency and low throughput in a streaming pipeline, not an incorrect watermark or batch mode.

How to eliminate wrong answers

Option A is wrong because batch mode is a separate execution mode for bounded data, and using batch mode instead of streaming mode would not cause high latency in a streaming pipeline—it would simply not process unbounded data correctly. Option B is wrong because too many workers would typically improve throughput and reduce latency, not cause high latency, unless there is excessive overhead from worker coordination, but that is less common than underprovisioning. Option D is wrong because an incorrect watermark setting affects event-time processing and windowing accuracy, but it does not directly cause lower throughput or high latency; it may cause late data handling issues or incorrect results.

Full explanation →

939

MCQeasy

A team deployed a model to Vertex AI Endpoint and notices latency spikes during peak hours. What should they first investigate?

A.Switch to batch prediction

B.Reduce number of features

C.Increase machine type

D.Check if autoscaling is enabled and configured correctly

AnswerD

Autoscaling misconfiguration is a common cause of latency spikes during traffic surges.

Why this answer

Latency spikes during peak hours typically indicate that the serving infrastructure is unable to handle the increased request volume. The first step is to check if autoscaling is enabled and configured correctly on the Vertex AI Endpoint, as this determines whether additional compute nodes are automatically provisioned to match demand. Without proper autoscaling, the endpoint will be overwhelmed, leading to queuing delays and latency spikes.

Exam trap

Google Cloud often tests the misconception that latency spikes are always due to model complexity or feature engineering, when in fact the first diagnostic step should always be to verify the serving infrastructure's scaling configuration.

How to eliminate wrong answers

Option A is wrong because switching to batch prediction is for asynchronous, non-real-time inference and does not address the root cause of latency spikes during online serving. Option B is wrong because reducing the number of features may lower model complexity but does not directly resolve infrastructure scaling issues; latency spikes are typically due to insufficient compute resources, not feature count. Option C is wrong because increasing the machine type (e.g., using a larger VM) may improve per-request performance but does not solve the problem of handling concurrent peak traffic; without autoscaling, a single larger machine can still be overwhelmed.

Full explanation →

940

MCQhard

A data team needs to share a BigQuery dataset with another business unit. They want to provide a point-in-time snapshot of the data without incurring additional storage costs for the copy. Which BigQuery feature should they use?

A.BigQuery table snapshots

B.BigQuery table clones

C.BigQuery authorized views

D.BigQuery export to Cloud Storage

AnswerB

Clones are writable and share storage with the base table, so no extra cost for the initial copy. They can be updated independently.

Why this answer

Clones use the same underlying storage as the source table; snapshots also share storage but are immutable. Both are cost-effective. For regular updates, clones are more flexible.

Full explanation →

941

Multi-Selecteasy

A company wants to implement a data lake on Google Cloud. They need to store raw, structured data in open formats and allow querying directly from BigQuery without loading. Which THREE services or features should they use? (Choose 3)

Select 2 answers

A.Cloud Storage (GCS)

B.Dataproc

C.Dataflow

D.Cloud SQL

E.BigLake

AnswersA, E

GCS is the underlying storage for the data lake, storing data in open formats like Parquet/ORC.

Why this answer

Cloud Storage (GCS) is the correct choice because it serves as the underlying storage layer for a data lake on Google Cloud, allowing raw structured data to be stored in open formats such as Parquet, Avro, or ORC. BigQuery can directly query data stored in GCS using external tables, eliminating the need to load data into BigQuery storage. This decouples compute from storage, enabling cost-effective and scalable data lake architectures.

Exam trap

Cisco often tests the misconception that Dataproc or Dataflow are required for querying data in a data lake, when in fact BigQuery external tables and BigLake provide direct querying without loading, and the key is to recognize that storage (GCS) and the query engine (BigLake) are the correct services.

Full explanation →

942

Matchingmedium

Match each Google Cloud data service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Serverless data warehouse for analytics

Object storage for unstructured data

Globally distributed relational database

NoSQL wide-column database for low-latency workloads

Asynchronous messaging service for event-driven systems

Why these pairings

These are core Google Cloud data services with distinct primary use cases.

Full explanation →

943

MCQmedium

You train a BigQuery ML linear regression model to predict house prices. The model has high bias during evaluation. Which action BEST reduces bias?

A.Decrease the learning rate in the training options

B.Add more features like number of bedrooms and square footage

C.Remove features that have low correlation with the label

D.Increase L2 regularization

AnswerB

Adding relevant features helps capture patterns, reducing bias.

Why this answer

High bias indicates underfitting. Adding more features (e.g., polynomial features) increases model complexity and reduces bias. Increasing regularization (option B) increases bias.

Removing features (C) increases bias. Reducing learning rate (D) does not help underfitting; it may slow convergence.

Full explanation →

944

MCQmedium

Refer to the exhibit. A team uses this Cloud Build configuration to deploy a service to Cloud Run. The deployment step fails with a 'Permission denied' error. What is the most likely cause?

A.The Dockerfile is missing from the repository.

B.The Docker image tag is missing or malformed.

C.The region 'us-central1' is incorrect for Cloud Run.

D.The Cloud Build service account does not have the Cloud Run Admin role.

AnswerD

The deploy step requires IAM permissions to create/update Cloud Run services; typically the Cloud Build service account needs roles/run.admin.

Why this answer

The Cloud Build service account (typically the default compute engine service account or a user-specified service account) must have the Cloud Run Admin role (roles/run.admin) to deploy services to Cloud Run. Without this IAM permission, the deployment step fails with a 'Permission denied' error, even if the build itself succeeds. The error occurs because Cloud Build attempts to call the Cloud Run Admin API (run.googleapis.com) to create or update the service, and the service account lacks the required authorization.

Exam trap

Cisco often tests the distinction between build-time errors (e.g., missing Dockerfile, malformed tags) and deployment-time permission errors, expecting candidates to recognize that a 'Permission denied' error specifically points to IAM misconfiguration rather than build configuration issues.

How to eliminate wrong answers

Option A is wrong because a missing Dockerfile would cause a build failure (e.g., 'unable to prepare context: path not found'), not a deployment-time 'Permission denied' error. Option B is wrong because a missing or malformed image tag would cause a push or pull error (e.g., 'invalid reference format'), not a permission error during deployment. Option C is wrong because 'us-central1' is a valid Cloud Run region; an incorrect region would result in a 'region not found' or 'location not found' error, not a permission error.

Full explanation →

945

MCQhard

A healthcare analytics company runs a nightly Dataproc workflow that reads radiology reports from Cloud Storage (CSV files), transforms them using PySpark, and writes results to BigQuery. The workflow is orchestrated by Cloud Composer. Recently, the job has started failing with 'Disk quota exceeded' errors on the worker nodes. The data volume has grown 5x over the past month. Currently, the cluster uses 5 n1-standard-4 workers (each 10GB persistent disk). The PySpark jobs heavily use intermediate shuffles. You need a cost-effective solution that avoids future failures as data grows. What should you do?

A.Upgrade the worker machine type to n1-standard-8 with local SSDs for shuffle storage.

B.Increase the persistent disk size on each worker node to 100 GB.

C.Add more preemptible workers to the cluster and keep boot disk size at 10GB.

D.Use Cloud Dataflow instead of Dataproc, as it handles disk management transparently.

AnswerB

More disk space per worker allows shuffles to complete without quota errors.

Why this answer

The 'Disk quota exceeded' error occurs because the 10 GB persistent disks on the n1-standard-4 workers are too small to accommodate the intermediate shuffle data, which has grown 5x. Increasing the persistent disk size to 100 GB directly addresses the storage bottleneck without changing the machine type or incurring the cost of local SSDs, making it a cost-effective solution that scales with data growth.

Exam trap

The trap here is that candidates may over-engineer the solution by upgrading machine types or switching to a different service (Dataflow) when the root cause is simply insufficient disk space for shuffle data, which is easily fixed by increasing the persistent disk size.

How to eliminate wrong answers

Option A is wrong because upgrading to n1-standard-8 with local SSDs is overkill and more expensive; the issue is disk space for shuffle data, not CPU or memory, and local SSDs are ephemeral and not cost-effective for persistent storage needs. Option C is wrong because adding more preemptible workers does not increase the persistent disk size per worker; each worker still has only 10 GB, so shuffle data will still exceed the disk quota on those nodes. Option D is wrong because migrating to Cloud Dataflow is a significant architectural change that incurs migration costs and learning curve, and it does not address the immediate disk quota issue in the existing Dataproc workflow; Dataflow also has its own disk management limits.

Full explanation →

946

Multi-Selecthard

You are designing a data pipeline for ML training with Vertex AI. You need to split time-series data into train/validation/test sets without leaking future data. Which THREE practices should you follow?

Select 3 answers

A.Use a sliding window validation approach for hyperparameter tuning.

B.Ensure that all data points for a given time period are in the same split.

C.Use Looker to generate the splits automatically.

D.Randomly assign rows to each split to ensure statistical distribution.

E.Use a date column to define the split boundaries.

AnswersA, B, E

Correct: sliding window respects time order.

Why this answer

Time-series data requires temporal splitting: use a date column, avoid random splitting to prevent data leakage, and use sliding window for sequential data. Random splitting (E) is wrong. Looker (D) is irrelevant.

Full explanation →

947

Multi-Selectmedium

A company uses Cloud Composer to orchestrate data pipelines. They have a DAG that runs hourly and processes files from Cloud Storage. The DAG is triggered by a Pub/Sub message sent from a Cloud Storage bucket notification. Recently, some DAG runs are not starting even though the Pub/Sub messages are published. Which two likely causes should the team investigate? (Choose TWO.)

Select 2 answers

A.The Cloud Storage bucket notification is not sending messages to the correct Pub/Sub topic, or the subscription's ack deadline is too short.

B.The DAG's start_date is set in the past and catchup is set to False, so DAG runs are only triggered on schedule.

C.The total number of DAGs in the environment exceeds the maximum limit of 100, causing DAG processing to stop.

D.The DAG's schedule interval is set too frequently, causing the executor queue to be full and new runs are skipped.

E.The Cloud Composer environment is using a pull subscription instead of a push subscription for the Pub/Sub sensor.

AnswersA, D

C is correct because misconfiguration of the notification or subscription can cause message loss.

Why this answer

Option A is correct because if the Cloud Storage bucket notification is misconfigured to send messages to the wrong Pub/Sub topic, the Pub/Sub sensor in the DAG will never receive the trigger message, causing DAG runs to not start. Additionally, if the subscription's ack deadline is too short, the message may be acknowledged before the sensor processes it, leading to message loss and missed triggers. Both issues directly prevent the DAG from being triggered by Pub/Sub messages.

Exam trap

Google Cloud often tests the misconception that a push subscription is required for Pub/Sub sensors in Cloud Composer, when in fact the sensor uses a pull subscription and the ack deadline is the critical parameter to manage.

Full explanation →

948

MCQmedium

A global e-commerce platform requires a relational database that can handle millions of transactions per second across regions with strong consistency and automatic failover. The database must also support SQL joins. Which database should they choose?

A.Cloud Spanner

B.Cloud SQL

C.Cloud Firestore

D.Cloud Bigtable

AnswerA

Spanner provides global distribution, strong consistency, SQL support, and automatic failover, meeting all requirements.

Why this answer

Cloud Spanner is the correct choice because it is a globally distributed, horizontally scalable relational database that provides strong consistency and automatic failover across regions, while fully supporting SQL joins. It combines the benefits of a traditional relational database with the horizontal scalability of NoSQL systems, making it ideal for high-throughput, globally distributed applications requiring ACID transactions.

Exam trap

Cisco often tests the misconception that Cloud SQL is suitable for global-scale, high-consistency workloads because it is a relational database, but it lacks the global distribution and automatic failover capabilities required for millions of transactions per second across regions.

How to eliminate wrong answers

Option B (Cloud SQL) is wrong because it is a single-region, vertically scalable relational database that cannot handle millions of transactions per second across regions with automatic failover; it lacks global distribution and strong consistency across regions. Option C (Cloud Firestore) is wrong because it is a NoSQL document database that does not support SQL joins and is designed for mobile and web apps with eventual consistency, not for high-throughput relational workloads requiring strong consistency. Option D (Cloud Bigtable) is wrong because it is a NoSQL wide-column database that does not support SQL joins or relational queries; it is optimized for analytical and time-series workloads, not transactional applications requiring ACID properties.

Full explanation →

949

MCQhard

A retail company uses BigQuery to store sales data and wants to forecast weekly demand for the next 8 weeks using historical data from the past 2 years. They need to account for seasonality and holidays. Which BigQuery ML model type and configuration is most appropriate?

A.ARIMA_PLUS with holiday_region parameter

B.Boosted tree classifier

C.Linear regression with engineered time features

D.Time-series DECOMPOSE model

AnswerA

ARIMA_PLUS is designed for time-series forecasting with automatic seasonality detection and holiday support.

Why this answer

BigQuery ML's ARIMA_PLUS model is designed for time-series forecasting, automatically detecting seasonality and handling holiday effects via the holiday_region parameter. Linear regression would require manual feature engineering for time components. Time-series DECOMPOSE is not a model type.

Boosted trees are not natively time-series aware without feature engineering.

Full explanation →

950

MCQmedium

You need to create a Cloud Storage bucket for a data lake that will store raw ingested data. The data must be immutable and cannot be deleted or overwritten for a compliance period of 5 years. Which feature should you enable?

A.Object Versioning

B.Lifecycle rules to delete objects after 5 years

C.Object Lock with governance mode

D.Bucket Lock with a retention policy of 5 years

AnswerD

Correct: Bucket Lock enforces immutability for the specified period.

Why this answer

Bucket Lock with a retention policy enforces a minimum retention period on all objects in the bucket. During the retention period, objects cannot be deleted or overwritten. This is exactly for compliance needs.

Full explanation →

951

MCQmedium

A data team uses Looker Studio to create a report that combines data from two different BigQuery tables: one with sales transactions and another with customer demographics. They need to join these tables in the report without writing SQL. Which feature should they use?

A.Data blending

B.Creating a report with multiple charts

C.Custom query in BigQuery connector

D.Calculated fields

AnswerA

Data blending allows combining data from different sources via a common key without SQL.

Why this answer

Looker Studio's data blending feature allows combining data from multiple sources (including BigQuery tables) using a common key, without writing SQL. It provides a graphical interface to define joins. Creating a custom query requires SQL.

Looker Studio reports support multiple charts, but blending is the specific feature for joining data. Calculated fields transform data within a single source.

Full explanation →

952

MCQeasy

A company uses BigQuery for real-time analytics. They stream data from IoT devices into a BigQuery table. After a few hours, some of the recent data becomes visible in the table although it was streamed less than 10 minutes ago. The data team confirms that no one ran any manual queries. What is the most likely reason for the data visibility?

A.The data was stored in the streaming buffer for more than 24 hours, and BigQuery automatically flushes it to the table.

B.BigQuery time travel allows querying data from the past, including data still in the streaming buffer.

C.The table has an expiration set, and the data is made visible as soon as the table is about to expire.

D.The streaming buffer reached its maximum capacity (default 90 minutes) and automatically flushed the data to the table.

AnswerD

C is correct because the streaming buffer flushes data approximately every 90 minutes, making it visible.

Why this answer

Option D is correct because BigQuery's streaming buffer has a maximum capacity limit, typically around 90 minutes. When the buffer reaches this capacity, BigQuery automatically flushes the buffered data to the table, making it visible. This explains why data streamed less than 10 minutes ago became visible after a few hours.

Exam trap

The trap here is that candidates often assume streaming data is immediately visible or that time travel is responsible for visibility, but BigQuery's streaming buffer has a finite capacity that triggers automatic flushes, making data visible after a delay.

How to eliminate wrong answers

Option A is wrong because the streaming buffer does not have a 24-hour retention; data is flushed automatically within about 90 minutes or when the buffer reaches capacity, not after 24 hours. Option B is wrong because BigQuery time travel allows querying historical data within a 7-day window, but it does not cause data in the streaming buffer to become visible; it only affects how you query already-committed data. Option C is wrong because table expiration settings control when the table is deleted, not when streaming data becomes visible; data visibility is independent of table expiration.

Full explanation →

953

Multi-Selectmedium

A data engineer is planning a time-series forecasting model using BigQuery ML ARIMA+ on a dataset with daily sales data spanning 3 years. Which TWO actions are required to prepare the data for ARIMA+? (Choose 2.)

Select 2 answers

A.Create a partition on the time column to improve performance.

B.Remove any rows with NULL values in the time column.

C.Sort the data by the time column in ascending order.

D.Ensure the time column is of type DATE or TIMESTAMP.

E.Encode the target variable using one-hot encoding.

AnswersC, D

ARIMA+ expects the data to be ordered by time.

Why this answer

ARIMA+ requires a time column and a numeric target column. The time column must be in a date/timestamp format. Additionally, the data should be sorted by time.

Missing values should be handled (e.g., filled with 0 or interpolated) but that's not a requirement of the function itself.

Full explanation →

954

Multi-Selectmedium

Which TWO are best practices for monitoring a deployed machine learning model in production on Vertex AI?

Select 2 answers

A.Set up a weekly retraining pipeline triggered by calendar schedule

B.Enable Vertex AI Model Monitoring to track feature drift and skew

C.Monitor the training job duration to detect anomalies

D.Monitor the distribution of predictions over time to detect concept drift

E.Monitor the model's file size to ensure it hasn't changed

AnswersB, D

Model Monitoring automatically detects drift.

Why this answer

Option B is correct because Vertex AI Model Monitoring automatically tracks feature drift and skew by comparing the serving data distribution against the training data distribution using statistical tests like the Kolmogorov-Smirnov test. This is a best practice for detecting data quality issues that can degrade model performance in production.

Exam trap

The trap here is that candidates confuse operational maintenance tasks (like scheduled retraining) with monitoring tasks, or they focus on infrastructure metrics (like job duration or file size) instead of data and prediction distribution monitoring, which directly impact model accuracy in production.

Full explanation →

955

MCQmedium

Your company stores sensitive customer data in Cloud Storage. You need to inspect the data for personally identifiable information (PII) and de-identify it before sharing with a third party. Which Google Cloud service should you use?

A.Security Command Center

B.Dataplex

C.Cloud Data Loss Prevention (DLP)

D.Cloud KMS

AnswerC

DLP is designed for inspecting and de-identifying sensitive data.

Why this answer

Cloud Data Loss Prevention (DLP) is the correct service because it is specifically designed to inspect, classify, and de-identify sensitive data such as PII in Cloud Storage. It provides built-in infoType detectors for over 150 types of PII and supports de-identification techniques like masking, tokenization, and encryption. This directly matches the requirement to inspect and de-identify data before sharing with a third party.

Exam trap

Cisco often tests the distinction between data security services (like Cloud KMS for encryption) and data inspection/de-identification services (like Cloud DLP), leading candidates to mistakenly choose Cloud KMS because they associate 'de-identify' with encryption, but Cloud KMS only manages keys, not the inspection or transformation of data content.

How to eliminate wrong answers

Option A is wrong because Security Command Center is a security and risk management platform that provides threat detection, vulnerability scanning, and compliance monitoring, but it does not have native capabilities to inspect or de-identify PII in data objects. Option B is wrong because Dataplex is a data governance and management service that helps organize, catalog, and manage data across lakes and warehouses, but it lacks built-in PII inspection and de-identification features. Option D is wrong because Cloud KMS is a key management service for creating, storing, and managing encryption keys, but it does not inspect data for PII or perform de-identification; it only provides encryption/decryption operations.

Full explanation →

956

MCQmedium

Your team is using Vertex AI Pipelines to orchestrate a model retraining workflow. The pipeline includes a data validation step, a training step, and a model evaluation step. You want to ensure that if the evaluation step fails due to low model performance, the pipeline stops and does not deploy the model. Which approach should you use?

A.Run the evaluation step after deployment and roll back if performance is low

B.Configure the evaluation step to retry up to 3 times on failure

C.Use a Conditional in the pipeline to check evaluation metrics and only run the deployment step if metrics pass thresholds

D.Create a separate pipeline for deployment and trigger it manually after review

AnswerC

Conditionals allow pipeline to branch based on results.

Why this answer

Option C is correct because Vertex AI Pipelines supports conditional execution via the `Condition` component, which allows you to evaluate model performance metrics (e.g., accuracy, RMSE) and gate subsequent steps. By placing the deployment step inside a conditional branch that only executes when evaluation metrics meet predefined thresholds, the pipeline automatically stops and avoids deploying a poor-performing model. This approach aligns with MLOps best practices for automated gating in production pipelines.

Exam trap

The trap here is that candidates confuse retry logic (Option B) with conditional gating, mistakenly thinking that retrying a failed evaluation step will somehow improve model performance, when in fact retries only handle transient errors, not metric-based failures.

How to eliminate wrong answers

Option A is wrong because running the evaluation step after deployment and then rolling back violates the principle of failing fast; it wastes compute resources and risks serving a bad model to users before rollback. Option B is wrong because retrying the evaluation step on failure does not address the root cause — low model performance — and would simply re-run the same evaluation, potentially masking the failure or delaying the pipeline. Option D is wrong because creating a separate pipeline for manual deployment defeats the purpose of automation and introduces human latency and error, contradicting the goal of an automated orchestrated workflow.

Full explanation →

957

MCQhard

A company runs a batch data processing workload using Dataproc clusters that are auto-scaled based on YARN memory utilization. During peak times, jobs take much longer than expected. Analysis shows the cluster is not scaling up despite high YARN memory utilization. What is the most likely cause?

A.Spark dynamic allocation is disabled, preventing executors from using added workers

B.The cluster autoscaler is misconfigured to scale based on CPU, not memory

C.The autoscaler is set to scale down secondary workers, not up

D.The cluster is using primary workers only; auto-scaling only adds secondary workers

AnswerD

Auto-scaling adds secondary workers, not primary; if only primary workers exist, no scale-up occurs.

Why this answer

Dataproc clusters have two types of workers: primary workers (which run both HDFS and compute) and secondary workers (compute-only). The autoscaler can only add or remove secondary workers; it cannot scale primary workers. If the cluster uses only primary workers, the autoscaler has no secondary workers to add, so it cannot scale up even under high YARN memory utilization.

This explains why the cluster remains static during peak times.

Exam trap

The trap here is that candidates assume autoscaling applies to all worker nodes equally, overlooking the Dataproc-specific distinction between primary and secondary workers and the autoscaler's limitation to secondary workers only.

How to eliminate wrong answers

Option A is wrong because Spark dynamic allocation controls how executors are distributed within existing nodes, not how the cluster adds new nodes; even if disabled, the autoscaler would still attempt to add workers if configured correctly. Option B is wrong because the question explicitly states the autoscaler is based on YARN memory utilization, not CPU; a misconfiguration to CPU would cause scaling based on CPU metrics, but the symptom here is no scaling at all, not scaling on the wrong metric. Option C is wrong because the autoscaler is designed to scale up secondary workers when utilization is high; a misconfiguration to scale down would cause premature removal of workers, not a failure to scale up.

Full explanation →

958

MCQhard

A team is training a large model using a custom container with TensorFlow on Vertex AI Training. They need to use multiple GPUs across several machines. Which strategy should they implement to maximize training throughput?

A.Use Cloud TPU Pods for distributed training

B.Use Dataflow for distributed training

C.Use Vertex AI Training with a custom job specifying workerPoolSpecs and MultiWorkerMirroredStrategy

D.Use a single worker with multiple GPUs and TensorFlow MirroredStrategy

AnswerC

MultiWorkerMirroredStrategy distributes across multiple machines.

Why this answer

Option C is correct because Vertex AI Training's custom job with workerPoolSpecs enables multi-machine, multi-GPU distributed training, and TensorFlow's MultiWorkerMirroredStrategy is specifically designed for synchronous distributed training across multiple workers. This combination maximizes throughput by efficiently synchronizing gradients across all GPUs on all machines using all-reduce communication, which is essential for large model training.

Exam trap

Cisco often tests the distinction between single-machine multi-GPU strategies (MirroredStrategy) and multi-machine distributed strategies (MultiWorkerMirroredStrategy), leading candidates to pick D when they overlook the requirement for multiple machines.

How to eliminate wrong answers

Option A is wrong because Cloud TPU Pods are specialized hardware for Tensor Processing Units, not GPUs, and the question explicitly requires using multiple GPUs across several machines. Option B is wrong because Dataflow is a serverless, fully managed service for batch and stream data processing (e.g., Apache Beam pipelines), not designed for distributed model training with TensorFlow on GPUs. Option D is wrong because a single worker with multiple GPUs and TensorFlow MirroredStrategy only scales within one machine, failing to leverage multiple machines for distributed training across a cluster.

Full explanation →

959

MCQmedium

A company needs to run a Spark ML training job on a Dataproc cluster with high memory per node, but the cluster should automatically scale down when idle to save costs. Which configuration should they use?

A.Use a single-node cluster with preemptible VMs

B.Enable Dataproc's default autoscaling with primary workers as preemptible

C.Create a cluster with custom machine types and no autoscaling

D.Use a Dataproc cluster with preemptible secondary workers and cluster autoscaling

AnswerD

Why this answer

Option D is correct because it combines preemptible secondary workers for cost-effective high-memory compute with cluster autoscaling, which automatically scales down the cluster when idle. Preemptible VMs are ideal for stateless Spark ML training tasks, and autoscaling ensures the cluster shrinks to save costs during inactivity. This configuration meets the requirement of high memory per node (via primary workers) while minimizing costs through idle scaling.

Exam trap

Cisco often tests the misconception that preemptible VMs can be used as primary workers or that autoscaling works with preemptible primary workers, but in Dataproc, preemptible VMs are restricted to secondary workers to maintain cluster stability.

How to eliminate wrong answers

Option A is wrong because a single-node cluster cannot provide high memory per node in a distributed sense and preemptible VMs on a single node risk job failure if the VM is reclaimed; also, there is no autoscaling. Option B is wrong because Dataproc's default autoscaling does not support primary workers as preemptible—preemptible VMs are only allowed as secondary workers, and using them as primary would cause instability. Option C is wrong because custom machine types without autoscaling do not automatically scale down when idle, leading to unnecessary costs.

Full explanation →

960

MCQeasy

A company wants to use BigQuery to query data stored in Parquet files in Cloud Storage without loading the data into BigQuery. Which BigQuery feature should they use?

A.BigQuery Omni

B.BigQuery ML

C.BigQuery external tables

D.BigQuery BI Engine

AnswerC

External tables allow querying data directly from GCS without loading into BigQuery storage.

Why this answer

BigQuery external tables allow querying data stored in Cloud Storage (including Parquet files) directly without loading it into BigQuery storage. This feature uses a federated query engine that reads the data on the fly, supporting formats like Parquet, Avro, ORC, CSV, and JSON. Option C is correct because it directly addresses the requirement to query Parquet files in Cloud Storage without ingestion.

Exam trap

Cisco often tests the distinction between features that query external data (external tables) versus features that process data within BigQuery (like BI Engine) or across clouds (Omni), leading candidates to confuse Omni's multi-cloud capability with external data access in the same cloud.

How to eliminate wrong answers

Option A is wrong because BigQuery Omni is designed to query data across multi-cloud environments (AWS, Azure) using BigQuery's interface, not for querying Parquet files in Cloud Storage without loading. Option B is wrong because BigQuery ML is a machine learning feature that enables creating and executing models using SQL, not for querying external data files. Option D is wrong because BigQuery BI Engine is an in-memory analysis service that accelerates dashboard queries on data already stored in BigQuery, not for querying external Parquet files in Cloud Storage.

Full explanation →

961

MCQhard

A data engineer needs to design a Bigtable row key for a time-series IoT application where each device sends data every second. The query pattern is to retrieve all data for a specific device over a time range. Which row key design minimizes hotspots?

A.device_id#timestamp (e.g., device123#2024-03-15-10:30:00)

B.hash(device_id)#timestamp (e.g., a3f2#2024-03-15-10:30:00)

C.timestamp#device_id (e.g., 2024-03-15-10:30:00#device123)

D.device_type#device_id#timestamp

AnswerB

Hashing the device ID distributes writes across tablets, and appending timestamp allows efficient time-range scans.

Why this answer

To avoid hotspots (where all writes hit a single tablet server), the row key should start with a hash of the device ID to distribute writes across the cluster, then append the timestamp for range scans.

Full explanation →

962

Multi-Selectmedium

A data warehouse team uses Cloud BigQuery for analytics. They want to optimize query performance and reduce costs. Which three actions should they take? (Choose 3)

Select 3 answers

A.Use partitioned tables on time columns

B.Use clustered tables on frequently filtered columns

C.Use automatic reclustering

D.Use materialized views for aggregations

E.Use BI Engine for all queries

AnswersA, B, D

Partitioning allows queries to skip irrelevant partitions, reducing cost and improving speed.

Why this answer

Option A is correct because partitioning tables on time columns (e.g., DATE, TIMESTAMP) in BigQuery allows the query engine to perform partition pruning, scanning only the relevant partitions instead of the entire table. This directly reduces the amount of data read, lowering query costs and improving performance by limiting I/O to the necessary time range.

Exam trap

Google Cloud often tests the distinction between automatic reclustering as a passive maintenance feature versus an active optimization action, leading candidates to mistakenly select it as a cost-saving measure when it is actually a built-in behavior that does not require manual intervention.

Full explanation →

963

Multi-Selectmedium

Which TWO steps are required to deploy a custom scikit-learn model to Vertex AI for online predictions?

Select 2 answers

A.Write a custom prediction routine

B.Containerize the model using Docker

C.Save the model using joblib or pickle

D.Create a Vertex AI Endpoint manually

E.Upload the model to Vertex AI Model Registry

AnswersC, E

Vertex AI expects a saved model artifact.

Why this answer

Option C is correct because scikit-learn models must be serialized using joblib or pickle to be saved as a model artifact that can be uploaded to Vertex AI. Vertex AI's pre-built prediction containers for scikit-learn expect the model file to be in this format (typically model.joblib or model.pkl) to serve online predictions.

Exam trap

Google Cloud often tests the misconception that you must always write a custom prediction routine or containerize your model, when in fact Vertex AI provides pre-built containers for popular frameworks like scikit-learn, making steps A and B unnecessary for standard deployments.

Full explanation →

964

MCQhard

A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?

A.The Pub/Sub topic has a schema that the messages do not comply with

B.The Pub/Sub topic is not configured with a schema

C.The Dataflow pipeline is using at-least-once delivery guarantee

D.The subscription's ack deadline is too short

AnswerA

Topic schema enforcement causes non-compliant messages to be rejected and sent to dead letter.

Why this answer

The subscription's message schema enforcement validates incoming messages; if the message doesn't conform to the schema, it is forwarded to the dead letter topic.

Full explanation →

965

Multi-Selecthard

A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)

Select 2 answers

A.Dataproc Serverless (Spark)

B.Dataproc on GKE

C.Standard Dataproc cluster with preemptible workers

D.Cloud Composer with Spark

E.Dataflow with Spark Runner

AnswersA, B

Dataproc Serverless runs Spark jobs without cluster management, supports autoscaling, and integrates with Cloud Storage and BigQuery.

Why this answer

Dataproc Serverless allows running Spark jobs without managing clusters, with autoscaling and pay-per-use pricing. Dataproc on GKE enables running Spark on Kubernetes with autoscaling and is fully managed. Standard Dataproc requires cluster management and is not serverless.

Dataflow is for Beam, not Spark. Cloud Composer is for orchestration, not data processing.

Full explanation →

966

MCQhard

A company stores sensitive customer data in BigQuery and Cloud Storage. They want to encrypt the data with customer-managed encryption keys (CMEK) and ensure that access to the key material is restricted to only approved networks. Which additional Google Cloud control should they implement to enforce network-based access to the encryption keys?

A.Identity-Aware Proxy (IAP)

B.Private Google Access

C.VPC Service Controls

D.Cloud Armor

AnswerC

VPC Service Controls allow you to define a security perimeter around Google Cloud services, including Cloud KMS, to restrict access based on network origin.

Why this answer

VPC Service Controls (VPC-SC) can create a security perimeter around Cloud KMS and BigQuery/Cloud Storage resources, preventing data exfiltration and restricting access to approved networks. VPC-SC works with CMEK to add an extra layer of network-based access control. Cloud Armor is for HTTP(S) load balancing, IAP is for user identity, and Private Google Access is for on-premises access to public IPs.

Full explanation →

967

MCQeasy

A data scientist wants to automate retraining of a classification model when new labeled data arrives. The model is deployed on AI Platform Prediction. Which Google Cloud service should be used to orchestrate the retraining pipeline?

A.AI Platform Prediction

B.AI Platform Pipelines

C.AI Platform Continuous Evaluation

D.Cloud Dataflow

AnswerB

AI Platform Pipelines provides a way to build and orchestrate ML pipelines.

Why this answer

AI Platform Pipelines (now Vertex AI Pipelines) is the correct service because it provides a fully managed, serverless orchestration engine for building, deploying, and running machine learning pipelines. It integrates with Kubeflow Pipelines and TensorFlow Extended (TFX) to automate the retraining workflow when new labeled data arrives, enabling continuous training and model versioning without manual intervention.

Exam trap

Google Cloud often tests the distinction between services that execute ML tasks (like prediction or evaluation) versus services that orchestrate the workflow; the trap here is that candidates confuse AI Platform Prediction (serving) or Cloud Dataflow (data processing) with pipeline orchestration, missing that AI Platform Pipelines is purpose-built for automating multi-step ML workflows.

How to eliminate wrong answers

Option A is wrong because AI Platform Prediction is a serving endpoint for deploying trained models to make predictions; it does not orchestrate retraining pipelines. Option C is wrong because AI Platform Continuous Evaluation is a service for monitoring model performance and detecting drift, not for orchestrating retraining workflows. Option D is wrong because Cloud Dataflow is a stream and batch data processing service (based on Apache Beam) used for data transformation and ETL, not for orchestrating end-to-end ML pipelines with conditional retraining logic.

Full explanation →

968

MCQhard

A company has a production machine learning model deployed on Vertex AI Endpoint that predicts customer churn. The model is retrained weekly using a Vertex AI Pipeline that pulls new data from BigQuery. Recently, the model's accuracy has been declining. The data science team suspects data drift but is unsure. They have enabled Vertex AI Model Monitoring but have not set up any alerts. The team wants to diagnose and address the issue quickly. The pipeline runs successfully, and no errors are reported. The model endpoint is serving predictions with average latency of 200ms. What should the team do first?

A.Immediately trigger a retraining pipeline with more recent data

B.Increase the number of replicas to reduce latency

C.Examine Cloud Logging for prediction errors

D.Review Vertex AI Model Monitoring drift reports and set up alerts for significant drift

AnswerD

Directly addresses drift detection.

Why this answer

Option D is correct because the team has already enabled Vertex AI Model Monitoring, which automatically tracks feature distributions and prediction statistics over time. The first diagnostic step should be to review the drift reports generated by Model Monitoring to confirm whether data drift is occurring, and then set up alerts so the team is proactively notified of significant drift in the future. This directly addresses the suspected root cause without unnecessary operational changes.

Exam trap

Google Cloud often tests the misconception that any model performance decline must be fixed by immediate retraining or infrastructure scaling, when the correct first step is always to diagnose the root cause using the monitoring tools already in place.

How to eliminate wrong answers

Option A is wrong because blindly retraining with more recent data without first confirming data drift may waste resources and could even degrade model performance if the new data is not representative or contains label errors. Option B is wrong because increasing replicas addresses latency, not accuracy decline; the current 200ms latency is well within acceptable bounds and is unrelated to the accuracy problem. Option C is wrong because Cloud Logging captures prediction errors (e.g., runtime exceptions, invalid inputs), but the pipeline runs successfully with no errors, so examining logs for errors will not reveal gradual accuracy degradation caused by data drift.

Full explanation →

969

MCQhard

A financial services company deploys a fraud detection model on Vertex AI using a custom prediction container that runs a PyTorch model. The model requires GPU acceleration. The deployment succeeds but predictions return an error: 'CUDA error: out of memory'. What should the team do to resolve this issue?

A.Change the container to use a CPU-only image to avoid CUDA errors

B.Increase the GPU machine type to one with more memory (e.g., from NVIDIA T4 to A100)

C.Enable Vertex AI Model Monitoring to automatically scale the endpoint

D.Add CPU replicas to distribute the inferencing load

AnswerB

The CUDA out of memory error indicates the current GPU cannot hold the model; a larger GPU or model optimization is needed.

Why this answer

Option B is correct because the CUDA out-of-memory error indicates that the GPU's VRAM is insufficient to load the PyTorch model or process the inference batch. Increasing the GPU machine type to one with more memory, such as from an NVIDIA T4 (16 GB) to an A100 (40 or 80 GB), directly resolves the capacity issue. Vertex AI prediction endpoints allow you to select different accelerator types and sizes, and this change ensures the model fits within GPU memory.

Exam trap

The trap here is that candidates may confuse a resource exhaustion error (out of memory) with a scaling or monitoring issue, leading them to choose options like Model Monitoring or adding CPU replicas, rather than recognizing the need for a larger GPU machine type.

How to eliminate wrong answers

Option A is wrong because switching to a CPU-only image would avoid CUDA errors but would likely cause severe performance degradation or timeout, as the model requires GPU acceleration for acceptable inference latency. Option C is wrong because Vertex AI Model Monitoring is designed for detecting data drift and feature skew, not for scaling endpoints or resolving out-of-memory errors; it does not automatically adjust machine resources. Option D is wrong because adding CPU replicas does not address the GPU memory exhaustion; the error occurs on the GPU, and distributing load across CPU replicas would still route requests to GPU-backed instances that lack sufficient VRAM.

Full explanation →

970

MCQeasy

Your company has a machine learning model that predicts customer churn. The model is deployed on Vertex AI Endpoints with autoscaling. After a marketing campaign, traffic to the endpoint increases by 10x. Some predictions start failing with 'HTTP 503 Service Unavailable' errors. What is the most likely cause?

A.The model container has a memory leak.

B.The model's accuracy has degraded due to data drift.

C.The autoscaling configuration has insufficient maximum nodes to handle the traffic.

D.The model is using an older version that is not supported.

AnswerC

Autoscaling with too few max nodes cannot scale up to meet demand, causing overload and 503 errors.

Why this answer

A 503 Service Unavailable error from Vertex AI Endpoints indicates that the endpoint is overwhelmed and cannot handle the incoming request volume. With a 10x traffic spike and autoscaling configured, the most likely cause is that the autoscaling configuration has insufficient maximum nodes, so the endpoint cannot scale out enough to handle the load, causing requests to be rejected.

Exam trap

Google Cloud often tests the distinction between model-level errors (e.g., data drift, accuracy degradation) and infrastructure-level errors (e.g., 503, 429, timeout), so the trap here is that candidates confuse a model performance issue with a scaling/availability issue.

How to eliminate wrong answers

Option A is wrong because a memory leak in the model container would cause gradual performance degradation or OOM kills, not a sudden 503 error under high traffic; Vertex AI would still attempt to serve requests until the container crashes. Option B is wrong because data drift affects prediction accuracy (e.g., wrong predictions), not the availability or HTTP status of the endpoint; 503 errors are infrastructure-level, not model-level. Option D is wrong because using an unsupported older version would cause deployment or startup failures, not transient 503 errors under load; Vertex AI would reject the deployment or return a different error (e.g., 400 or 404) if the version is incompatible.

Full explanation →

971

MCQmedium

A company runs a Dataflow pipeline that reads from Pub/Sub, aggregates events in a 10-minute fixed window, and writes to BigQuery. Recently, the pipeline has been failing with 'high uncommitted bytes' errors during periods of high traffic. What is the most likely cause and recommended action?

A.Reduce the window size from 10 minutes to 1 minute to decrease the amount of data per window.

B.Increase the number of worker machines to handle higher throughput.

C.Use a global window with a trigger that fires early based on element count to reduce the number of open windows.

D.Set a maximum number of workers and use a Pub/Sub flow control setting to limit incoming messages.

AnswerC

A global window with early triggers can reduce the number of panes and mitigate the high uncommitted bytes problem.

Why this answer

The 'high uncommitted bytes' error in Dataflow occurs when the system holds too much data in memory across many open windows, exceeding the default 200 MB limit. Using a global window with an early trigger based on element count reduces the number of simultaneous open windows and allows data to be committed more frequently, preventing memory pressure. This approach is recommended over reducing window size or scaling workers because the root cause is window fan-out, not throughput or parallelism.

Exam trap

Google Cloud often tests the misconception that scaling workers or reducing window size solves memory pressure, when the real issue is the number of open windows in a stateful pipeline.

How to eliminate wrong answers

Option A is wrong because reducing the window size from 10 minutes to 1 minute increases the number of open windows (from 6 per hour to 60 per hour), which would worsen the 'high uncommitted bytes' issue by creating more in-memory state. Option B is wrong because increasing worker machines does not address the fundamental problem of excessive open windows consuming memory; it may temporarily mask the issue but will not reduce the per-worker uncommitted bytes. Option D is wrong because setting a maximum number of workers and Pub/Sub flow control limits incoming messages but does not reduce the number of open windows or the memory used by uncommitted data; it may cause backpressure and data loss without fixing the window state explosion.

Full explanation →

972

MCQmedium

Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?

A.Push subscription with a load balancer

B.Pull subscription with flow control settings

C.Push subscription with endpoint on Cloud Run

D.Pull subscription with exactly-once delivery disabled

AnswerB

Pull subscriptions allow the subscriber to pull messages at its own pace, and flow control helps prevent overwhelming the consumer. This combination handles high throughput efficiently.

Why this answer

Pull subscriptions allow the subscriber to control the throughput by batching messages and setting flow control, which is ideal for high-throughput scenarios. Using a pull subscription with exactly-once delivery (if available) or at-least-once combined with idempotent processing ensures no data loss. Push subscriptions have limitations on throughput and are not suitable for millions of events per second.

Full explanation →

973

MCQmedium

A company is using Dataflow to stream data from Cloud Pub/Sub to BigQuery. The pipeline includes a custom ParDo transformation that enriches the data with external API calls. The pipeline is experiencing high latency and occasional failures due to API timeouts. What strategy should be employed to improve reliability and performance?

A.Remove the enrichment step and store raw data in BigQuery.

B.Use a global window to accumulate all data before enrichment.

C.Use a DoFn with stateful processing and batch API calls using asynchronous HTTP client.

D.Increase the number of workers to parallelize API calls.

AnswerC

Batching and async calls reduce per-element latency and handle timeouts gracefully.

Why this answer

Option C is correct because using a DoFn with stateful processing and an asynchronous HTTP client allows the pipeline to batch API calls and handle timeouts without blocking the main processing thread. This reduces latency by enabling concurrent requests and improves reliability through retry logic and state management, which is essential for external API enrichment in Dataflow.

Exam trap

Google Cloud often tests the misconception that scaling workers (Option D) is a universal fix for performance issues, but the trap here is that API timeouts are often caused by the external service's capacity, not the pipeline's parallelism, and stateful batching with async calls is the correct architectural pattern.

How to eliminate wrong answers

Option A is wrong because removing the enrichment step defeats the purpose of the pipeline and does not address the underlying issue of API call reliability. Option B is wrong because using a global window to accumulate all data before enrichment would introduce unbounded state and memory pressure, and it does not solve API timeout problems; it would also break the streaming nature of the pipeline. Option D is wrong because simply increasing the number of workers does not fix API timeouts; it may even exacerbate the problem by overwhelming the external API with more concurrent requests, leading to more failures.

Full explanation →

974

Multi-Selectmedium

Which THREE metrics should be monitored to detect model drift in a production ML system?

Select 3 answers

A.Training loss convergence.

B.Prediction distribution (prediction drift).

C.Feature distribution (data drift).

D.CPU utilization of the serving nodes.

E.Model performance metrics (e.g., accuracy, precision, recall) on a ground truth dataset.

AnswersB, C, E

Changes in prediction distribution can indicate concept drift.

Why this answer

Prediction distribution (prediction drift) is a key metric for detecting model drift because it monitors changes in the model's output probabilities or class frequencies over time. A significant shift in prediction distribution often indicates that the underlying data relationships have changed, even if feature distributions remain stable. This is a direct signal of model decay and is commonly tracked using statistical tests like the Population Stability Index (PSI) or Kolmogorov-Smirnov (KS) test on the prediction scores.

Exam trap

Cisco often tests the misconception that training metrics like loss convergence are relevant for production monitoring, when in fact they are only applicable during the training phase and have no role in detecting post-deployment drift.

Full explanation →

975

MCQeasy

A user gets the above error when trying to get online predictions. The model was created and the endpoint exists. What is the most likely reason?

A.The endpoint does not exist.

B.The endpoint is in a different region than the model.

C.No version of the model is deployed to the endpoint.

D.The model does not exist.

AnswerC

A model must be deployed (a model version) to the endpoint to serve predictions.

Why this answer

Option C is correct because the error 'No version of the model is deployed to the endpoint' occurs when the endpoint exists but has no active model version assigned to it. In Amazon SageMaker, an endpoint must have at least one production variant with a deployed model version to serve predictions. Without a deployed version, the endpoint is essentially empty and cannot handle inference requests, even though the endpoint resource itself is created.

Exam trap

Cisco often tests the misconception that creating an endpoint automatically deploys the latest model version, when in fact you must explicitly specify a model version in the production variant configuration during endpoint creation or update.

How to eliminate wrong answers

Option A is wrong because the user explicitly states 'the endpoint exists,' so the error is not due to a missing endpoint. Option B is wrong because endpoints and models in SageMaker are region-scoped; you cannot create an endpoint in a different region than the model's artifacts, so this scenario would not produce the given error. Option D is wrong because the model exists (the user says 'the model was created'), and the error is specifically about deployment status, not model existence.

Full explanation →

Google Professional Data Engineer (PDE) — Questions 901–975