CCNA Design Data Systems Questions — Page 3 of 3

151

MCQhard

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

A.Use Customer-Managed Encryption Keys (CMEK) and VPC Service Controls.

B.Use Data Loss Prevention API to redact sensitive data.

C.Enable Cloud Audit Logs and VPC Service Controls.

D.Enable default encryption at rest and in transit.

AnswerA

Provides control and exfiltration prevention.

Why this answer

Option A is correct because Customer-Managed Encryption Keys (CMEK) allow the company to control the encryption keys used to protect data at rest in Cloud Dataflow, while VPC Service Controls provide a security perimeter that prevents data exfiltration and ensures end-to-end encryption boundaries. Together, they address the compliance requirement for encryption control and audit logging by restricting data movement within a VPC service perimeter and using customer-managed keys for data encryption.

Exam trap

The trap here is that candidates often assume default encryption (Option D) or audit logs alone (Option C) satisfy compliance requirements, but they overlook the need for customer-managed keys and network-level exfiltration controls that VPC Service Controls provide.

How to eliminate wrong answers

Option B is wrong because the Data Loss Prevention (DLP) API is used for inspecting and redacting sensitive data (e.g., PII), not for ensuring end-to-end encryption or audit log retention; it does not provide encryption key management or network-level controls. Option C is wrong because while Cloud Audit Logs capture API activity and VPC Service Controls provide a security perimeter, this combination lacks customer-managed encryption keys (CMEK), which are required for the 'encrypted end-to-end' and key control compliance mandate. Option D is wrong because default encryption at rest and in transit uses Google-managed keys, not customer-managed keys, and does not include VPC Service Controls to enforce data exfiltration prevention or audit log retention policies.

Practice this question →

152

MCQmedium

A company has a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is failing with 'deadline exceeded' errors during peak hours. The team suspects that the pipeline cannot keep up with the incoming data rate. They also notice that the autoscaling algorithm sets maxNumWorkers to 10, but the pipeline only scales to 5 workers. What is the most likely cause of the inadequate scaling?

A.The maxNumWorkers setting is too low and should be reduced to trigger more aggressive scaling

B.BigQuery streaming quota is limiting the number of concurrent writes

C.The Pub/Sub subscription has a per-subscriber throughput limit of 5 workers

D.The pipeline is CPU-bound and the autoscaler evaluates that adding more workers would not improve throughput

AnswerD

Autoscaler uses utilization metrics; if workers are already saturated, it may not add more.

Why this answer

Option D is correct because the autoscaler in Dataflow evaluates CPU utilization and throughput per worker. If the pipeline is CPU-bound, adding more workers does not reduce per-worker CPU load or improve throughput, so the autoscaler stops at 5 workers even though maxNumWorkers is 10. This is a classic symptom of a bottleneck that cannot be parallelized further, such as a single-threaded transformation or a hot key in a GroupByKey operation.

Exam trap

The trap here is that candidates assume autoscaling always scales to maxNumWorkers when there is a backlog, but the autoscaler only adds workers if they will actually improve throughput, and a CPU-bound pipeline is a common reason for scaling to stall.

How to eliminate wrong answers

Option A is wrong because reducing maxNumWorkers would further restrict scaling, not trigger more aggressive scaling; the autoscaler already has permission to scale to 10 but chooses not to. Option B is wrong because BigQuery streaming quota limits the rate of inserts, not the number of concurrent workers; quota exhaustion would cause insert errors, not prevent the autoscaler from adding workers. Option C is wrong because Pub/Sub subscriptions have a per-subscriber throughput limit that is very high (typically hundreds of MB/s per subscriber), and the pipeline is not hitting that limit; the limit is on throughput, not on the number of subscribers.

Practice this question →

153

Multi-Selecteasy

A company is designing a data processing pipeline for real-time sensor data. They want to ensure low latency and exactly-once processing semantics. Which two Google services should they combine to achieve this? (Choose 2)

Select 2 answers

A.Cloud Dataproc with Spark Streaming

B.Cloud Functions with Cloud Pub/Sub triggers

C.Cloud Pub/Sub with exactly-once delivery

D.Cloud Dataflow with exactly-once processing mode

E.Cloud IoT Core with device gateways

AnswersC, D

Pub/Sub can be configured for exactly-once delivery to subscribers.

Why this answer

Cloud Pub/Sub with exactly-once delivery (Option C) ensures that each message is delivered to subscribers exactly once, preventing duplicates in the pipeline. Cloud Dataflow with exactly-once processing mode (Option D) provides end-to-end exactly-once semantics by leveraging consistent snapshots and idempotent sinks, which is critical for real-time sensor data pipelines requiring low latency and accuracy.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub alone provides end-to-end exactly-once processing, but candidates must recognize that Pub/Sub only guarantees delivery exactly once to subscribers, while Dataflow is needed to ensure processing exactly once across transformations and sinks.

Practice this question →

154

MCQmedium

A data engineer is designing a batch ETL pipeline using Cloud Composer and Dataflow. The pipeline must be self-healing and retry on failures. Which Composer feature should they configure?

A.Use Cloud Tasks for retries

B.Retry policy on the DAG

C.Cloud Composer with high availability

D.Dataflow retries

AnswerB

Composer DAGs can have retry policies for tasks.

Why this answer

Option B is correct because Cloud Composer (based on Apache Airflow) allows you to configure a retry policy directly on the DAG or individual tasks. This enables the pipeline to automatically retry failed tasks according to parameters like `retries`, `retry_delay`, and `retry_exponential_backoff`, making the ETL pipeline self-healing without external services.

Exam trap

Google Cloud often tests the distinction between orchestration-level retries (Composer DAG) and execution-level retries (Dataflow), leading candidates to pick Dataflow retries (Option D) when the question explicitly asks for a Composer feature.

How to eliminate wrong answers

Option A is wrong because Cloud Tasks is a fully managed queue service for asynchronous task execution, not a feature of Cloud Composer; it would introduce unnecessary complexity and is not the native way to handle retries within a Composer DAG. Option C is wrong because high availability (HA) for Cloud Composer ensures the Airflow components are resilient to zone failures, but it does not configure task-level retry behavior for pipeline failures. Option D is wrong because Dataflow retries handle failures at the Dataflow job level (e.g., worker failures), but the question asks for a Composer feature to manage retries of the overall pipeline orchestration, not the underlying data processing job.

Practice this question →

155

MCQeasy

A company uses Cloud Dataflow to process streaming data. They notice that the pipeline's throughput is lower than expected and the system is experiencing high latency. What is the most likely cause?

A.Using batch mode instead of streaming mode

B.Too many workers

C.Too few workers

D.Incorrect watermark setting

AnswerC

Insufficient workers cause backpressure and latency.

Why this answer

Option A is correct because insufficient workers are a common cause of low throughput and high latency. Autoscaling may not be enabled or workers are too few. Option B is wrong because batch mode is not used in streaming.

Option C is incorrect; watermark settings affect late data, not throughput. Option D is wrong; too many workers would not cause high latency.

Practice this question →

156

Matchingmedium

Match each Google Cloud data service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Serverless data warehouse for analytics

Object storage for unstructured data

Globally distributed relational database

NoSQL wide-column database for low-latency workloads

Asynchronous messaging service for event-driven systems

Why these pairings

These are core Google Cloud data services with distinct primary use cases.

Practice this question →

157

MCQhard

A company runs a batch data processing workload using Dataproc clusters that are auto-scaled based on YARN memory utilization. During peak times, jobs take much longer than expected. Analysis shows the cluster is not scaling up despite high YARN memory utilization. What is the most likely cause?

A.Spark dynamic allocation is disabled, preventing executors from using added workers

B.The cluster autoscaler is misconfigured to scale based on CPU, not memory

C.The autoscaler is set to scale down secondary workers, not up

D.The cluster is using primary workers only; auto-scaling only adds secondary workers

AnswerD

Auto-scaling adds secondary workers, not primary; if only primary workers exist, no scale-up occurs.

Why this answer

Dataproc clusters have two types of workers: primary workers (which run both HDFS and compute) and secondary workers (compute-only). The autoscaler can only add or remove secondary workers; it cannot scale primary workers. If the cluster uses only primary workers, the autoscaler has no secondary workers to add, so it cannot scale up even under high YARN memory utilization.

This explains why the cluster remains static during peak times.

Exam trap

The trap here is that candidates assume autoscaling applies to all worker nodes equally, overlooking the Dataproc-specific distinction between primary and secondary workers and the autoscaler's limitation to secondary workers only.

How to eliminate wrong answers

Option A is wrong because Spark dynamic allocation controls how executors are distributed within existing nodes, not how the cluster adds new nodes; even if disabled, the autoscaler would still attempt to add workers if configured correctly. Option B is wrong because the question explicitly states the autoscaler is based on YARN memory utilization, not CPU; a misconfiguration to CPU would cause scaling based on CPU metrics, but the symptom here is no scaling at all, not scaling on the wrong metric. Option C is wrong because the autoscaler is designed to scale up secondary workers when utilization is high; a misconfiguration to scale down would cause premature removal of workers, not a failure to scale up.

Practice this question →

158

Multi-Selectmedium

A data warehouse team uses Cloud BigQuery for analytics. They want to optimize query performance and reduce costs. Which three actions should they take? (Choose 3)

Select 3 answers

A.Use partitioned tables on time columns

B.Use clustered tables on frequently filtered columns

C.Use automatic reclustering

D.Use materialized views for aggregations

E.Use BI Engine for all queries

AnswersA, B, D

Partitioning allows queries to skip irrelevant partitions, reducing cost and improving speed.

Why this answer

Option A is correct because partitioning tables on time columns (e.g., DATE, TIMESTAMP) in BigQuery allows the query engine to perform partition pruning, scanning only the relevant partitions instead of the entire table. This directly reduces the amount of data read, lowering query costs and improving performance by limiting I/O to the necessary time range.

Exam trap

Google Cloud often tests the distinction between automatic reclustering as a passive maintenance feature versus an active optimization action, leading candidates to mistakenly select it as a cost-saving measure when it is actually a built-in behavior that does not require manual intervention.

Practice this question →

159

Multi-Selectmedium

A company is planning to migrate a legacy batch ETL pipeline to Google Cloud. The pipeline involves reading from a relational database, transforming data, and writing to a data warehouse. Which three Google Cloud services can be used as the orchestration layer? (Choose three.)

Select 3 answers

A.Cloud Dataproc

B.Cloud Scheduler

C.Cloud Dataflow

D.Cloud Workflows

E.Cloud Composer

AnswersB, D, E

Cloud Scheduler can trigger jobs on a schedule, acting as a simple orchestrator.

Why this answer

Cloud Scheduler is a fully managed cron job service that can trigger orchestration workflows on a schedule. It is correct because it can initiate batch ETL pipelines by sending HTTP requests to Cloud Run, Cloud Functions, or Pub/Sub, making it a lightweight orchestration trigger for scheduled batch jobs.

Exam trap

Google Cloud often tests the distinction between data processing services (Dataproc, Dataflow) and orchestration services (Workflows, Composer, Scheduler), so candidates mistakenly select Dataproc or Dataflow thinking they can orchestrate, when they are actually execution engines.

Practice this question →