PDE Designing data processing systems — All Questions With Answers

Question 1mediummultiple choice

Read the full Designing data processing systems explanation →

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

Question 2hardmultiple choice

Read the full Designing data processing systems explanation →

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

Question 3easymultiple choice

Read the full NAT/PAT explanation →

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

Question 4mediummultiple choice

Read the full Designing data processing systems explanation →

A financial company processes transactions in real-time and requires exactly-once processing semantics. They also need to reprocess historical data for backtesting. Which Google Cloud service should they use?

Question 5hardmultiple choice

Read the full Designing data processing systems explanation →

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

Question 6easymultiple choice

Read the full Designing data processing systems explanation →

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

Question 7mediummultiple choice

Read the full Designing data processing systems explanation →

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

Question 8hardmultiple choice

Read the full NAT/PAT explanation →

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

Question 9easymultiple choice

Read the full Designing data processing systems explanation →

A company is running a Cloud Dataflow streaming pipeline that aggregates events in 1-minute windows. They notice that the watermark is lagging significantly behind real-time. What is the most likely cause?

Question 10mediummulti select

Read the full Designing data processing systems explanation →

A data engineer is designing a batch processing system using Cloud Dataproc. Which TWO practices improve performance and reduce costs? (Choose TWO.)

Question 11hardmulti select

Read the full Designing data processing systems explanation →

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

Question 12easymulti select

Read the full Designing data processing systems explanation →

A data pipeline uses Cloud Pub/Sub to ingest events, then a Cloud Dataflow job writes to BigQuery. The Dataflow job is failing with 'deadline exceeded' errors. Which TWO actions can resolve this? (Choose TWO.)

Question 13hardmultiple choice

Read the full Designing data processing systems explanation →

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

Network Topology

Question 14mediummultiple choice

Read the full Designing data processing systems explanation →

The exhibit shows a Cloud Logging query result. A data engineer sees this log for a streaming Dataflow job. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
resource.type="dataflow_step"
resource.labels.job_id="2023-01-01_000000-12345678"
"worker pool exhausted"
```

Question 15easymultiple choice

Read the full Designing data processing systems explanation →

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

Refer to the exhibit.

```
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ]
}
```

Question 16hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

Question 17mediummultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Dataproc to run nightly Spark ETL jobs that process about 500 GB of data each night. The jobs currently take 4 hours to complete. The company wants to reduce the runtime to under 2 hours to meet a new SLA. The cluster is configured with 10 worker nodes (n1-standard-4) and 1 master node (n1-standard-4). The jobs are CPU-bound and use only default settings. The cluster is deleted after each job and recreated. The data is stored in Cloud Storage. The company is open to increasing cost but wants the most cost-effective solution to meet the SLA. Which approach should they take?

Question 18easymultiple choice

Read the full Designing data processing systems explanation →

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

Question 19mediummultiple choice

Read the full Designing data processing systems explanation →

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

Question 20hardmultiple choice

Read the full Designing data processing systems explanation →

A financial services firm processes sensitive transactions using Cloud Dataflow. The pipeline reads from Pub/Sub, performs stateful processing (e.g., fraud detection), and writes to Cloud Spanner. Compliance requires exactly-once processing semantics. Which configuration ensures exactly-once processing?

Question 21easymultiple choice

Read the full Designing data processing systems explanation →

A logistics company uses Cloud Functions to process incoming tracking events from IoT devices. Events are sent via HTTP triggers. During peak hours, some events fail with 500 errors. What is the best strategy to handle this reliably?

Question 22mediummultiple choice

Read the full Designing data processing systems explanation →

A media company ingests video files from partners via a REST API. Files are stored in Cloud Storage, and metadata is written to Firestore. A Cloud Function is triggered on object finalize to transcode video using Transcoder API. Sometimes, the function fails because the file is still being uploaded when triggered. How should this be fixed?

Question 23hardmultiple choice

Read the full NAT/PAT explanation →

A healthcare company streams patient monitoring data to Cloud Pub/Sub. A Dataflow pipeline reads the stream, enriches with patient records from BigQuery, and writes to Bigtable for real-time queries. The BigQuery lookup is slow and causes pipeline lag. What is the best approach to improve performance?

Question 24easymultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Dataproc to run Spark ML jobs. The jobs are memory-intensive and often fail with OutOfMemory errors. Which action would most effectively reduce memory pressure without changing the Spark code?

Question 25mediummulti select

Read the full Designing data processing systems explanation →

Which TWO statements are correct about designing a data pipeline using Cloud Dataflow for processing unbounded data?

Question 26hardmulti select

Read the full Designing data processing systems explanation →

Which THREE considerations are important when designing a data lake on Google Cloud using Cloud Storage?

Question 27easymulti select

Read the full Designing data processing systems explanation →

Which TWO approaches are recommended for handling late-arriving data in a streaming Dataflow pipeline?

Question 28hardmultiple choice

Read the full NAT/PAT explanation →

A multinational e-commerce company runs a real-time recommendation system. The architecture: user click events are sent via HTTP to a Cloud Run service, which publishes them to a Cloud Pub/Sub topic. A Dataflow streaming pipeline reads from the subscription, joins with user profile data from Firestore, computes recommendations using a TensorFlow model (loaded as a side input), and writes results to a Redis cache (Memorystore) for low-latency serving. The pipeline is deployed in us-central1. Recently, the team noticed that recommendation latency has increased from 50ms to 500ms, and the pipeline's backlog is growing. The Dataflow monitoring shows high CPU utilization on workers, and the SystemLag metric is 2 minutes and increasing. The Redis cluster shows no performance issues. The Firestore queries are within normal latency. The team suspects the TensorFlow model inference is the bottleneck. The model is a large neural network (500MB) loaded in each worker's memory. The pipeline uses 10 n1-standard-4 workers. The pipeline is using Dataflow's streaming engine. The team wants to reduce latency without increasing cost significantly. What should they do?

Question 29mediummultiple choice

Read the full Designing data processing systems explanation →

Your company is building a real-time fraud detection system using Google Cloud. Transactions are streamed into Pub/Sub, and you need to process them with low latency (under 100ms per event) and aggregate data over sliding windows. Which Google Cloud service is best suited for this processing logic?

Question 30hardmulti select

Read the full Designing data processing systems explanation →

Which TWO statements about designing a data processing pipeline on Google Cloud are correct? (Choose 2.)

Question 31easymultiple choice

Read the full Designing data processing systems explanation →

Based on the exhibit, what is the most likely cause of the out-of-memory error?

Exhibit

Refer to the exhibit.

```
# Dataflow pipeline error log:
Workflow failed. Causes: S02:ReadPubSub/Read+Transform/ParDo(ExtractTimestamps)+ ... (4b9c3d2e)
The job failed because a worker experienced a "out of memory" error.
```

Pipeline configuration:
- Streaming engine: disabled
- Worker machine type: n1-standard-4 (4 vCPU, 15 GB memory)
- Number of workers: 2 (autoscaling enabled, max 10)
- Input: Pub/Sub topic with 1000 messages/sec, each message ~50 KB
- Transform: Parse JSON, enrich with external API call, window into 1-minute fixed windows, write to BigQuery

Question 32hardmultiple choice

Read the full Designing data processing systems explanation →

You are a data engineer at a global e-commerce company. Your team manages a real-time recommendation system that ingests user clickstream events from a Pub/Sub topic (topic-clickstream). The pipeline uses Dataflow to read events, join with user profile data from Cloud Bigtable, compute recommendations using a machine learning model hosted on Cloud Run, and write results to a BigQuery table for analytics. The pipeline has been running smoothly for months, but recently the Dataflow job started failing with the error: "Workflow failed. Causes: S01:ReadPubSub/Read+Transform/ParDo(ExtractUserID)+ ... (5a3b2c1d) The job failed because a worker encountered an out-of-memory error." The Dataflow job uses the Streaming Engine feature with a worker type of n2-standard-8 (8 vCPU, 32 GB memory) and autoscaling from 2 to 20 workers. The clickstream event rate has increased from 500 events/second to 5000 events/second over the past week. The user profile data in Bigtable has also grown, with average row size increasing from 1 KB to 10 KB due to additional fields. You need to resolve the out-of-memory errors without completely redesigning the pipeline. What should you do?

Question 33mediumdrag order

Read the full Designing data processing systems explanation →

Drag and drop the steps to create a Cloud Storage bucket with uniform bucket-level access into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 34mediumdrag order

Read the full VPN explanation →

Drag and drop the steps to configure a VPC network with private Google access for on-premises connectivity using Cloud VPN into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 35mediumdrag order

Read the full Designing data processing systems explanation →

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 36mediummatching

Read the full Designing data processing systems explanation →

Match each Google Cloud data service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Serverless data warehouse for analytics

Object storage for unstructured data

Globally distributed relational database

NoSQL wide-column database for low-latency workloads

Asynchronous messaging service for event-driven systems

Question 37mediummatching

Read the full Designing data processing systems explanation →

Match each Google Cloud service to its data processing capability.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Unified stream and batch processing (Apache Beam)

Managed Spark and Hadoop clusters

Workflow orchestration (Apache Airflow)

Visual data integration and pipeline builder

Question 38mediummatching

Read the full Designing data processing systems explanation →

Match each Google Cloud monitoring/logging service to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Metrics and alerting for cloud resources

Centralized log storage and analysis

Aggregates and analyzes application errors

Records administrative and data access activities

Question 39easymultiple choice

Read the full Designing data processing systems explanation →

A company uses Dataflow to process streaming data from Pub/Sub. They notice increased processing latency. What is the most likely cause?

Question 40mediummultiple choice

Read the full Designing data processing systems explanation →

A data pipeline uses Cloud Composer to orchestrate Dataflow and BigQuery jobs. The pipeline fails intermittently with dependency errors. Which design change can improve reliability?

Question 41hardmultiple choice

Read the full Designing data processing systems explanation →

A company needs to process sensitive data in BigQuery with column-level security. They want to allow analysts to see aggregated data but not individual records. What approach?

Question 42easymulti select

Read the full Designing data processing systems explanation →

A company uses Dataproc for transient clusters. Which TWO actions can reduce costs?

Question 43mediummulti select

Read the full Designing data processing systems explanation →

A data engineer is migrating on-premises Hadoop jobs to Dataproc. Which TWO considerations are important?

Question 44hardmulti select

Read the full Designing data processing systems explanation →

A company building a real-time analytics pipeline with Pub/Sub and Dataflow. Which THREE best practices should they follow?

Question 45easymultiple choice

Read the full Designing data processing systems explanation →

A data engineer tries to grant a service account read access to a Cloud Storage bucket using the IAM policy above. The service account still cannot read objects. What is the most likely reason?

Exhibit

Refer to the exhibit.

Exhibit:
{
  "bindings": [
    {
      "role": "roles/storage.objectViewer",
      "members": ["serviceAccount:sa@project.iam.gserviceaccount.com"],
      "condition": {
        "title": "limit_time",
        "expression": "request.time < timestamp('2023-01-01T00:00:00Z')"
      }
    }
  ]
}

Question 46mediummultiple choice

Read the full Designing data processing systems explanation →

A BigQuery query fails with the error shown in the exhibit. What is the most likely cause?

Exhibit

Refer to the exhibit.

Exhibit:
Error: Resources exceeded during query execution.
Query statement: SELECT * 
FROM `project.dataset.table` 
WHERE date >= '2023-01-01'

Question 47hardmultiple choice

Read the full Designing data processing systems explanation →

A Dataflow pipeline as described in the exhibit has increasing lag. Which optimization is most likely to reduce the lag?

Exhibit

Refer to the exhibit.

Exhibit:
Pipeline description:
- Source: PubSubIO.read()
- Transform: ParDo(Process)
- Window: Window.into(FixedWindows of 1 minute)
- Transform: GroupByKey
- Sink: Write to BigQuery using StreamingInserts
- Estimated throughput: 10MB/s
- Observed lag: increasing

Question 48easymultiple choice

Read the full Designing data processing systems explanation →

A company needs to process large files (100GB each) from Cloud Storage using Dataproc. They want to minimize job execution time. Which configuration is most appropriate?

Question 49mediummultiple choice

Read the full Designing data processing systems explanation →

A data pipeline uses Cloud Pub/Sub to ingest events and Cloud Functions to transform and write to BigQuery. The system is experiencing data loss during Pub/Sub subscription outages. Which design change improves reliability?

Question 50hardmultiple choice

Read the full Designing data processing systems explanation →

A company wants to implement a near-real-time lake architecture using Cloud Storage and BigQuery. They need to enable queries on data within 5 minutes of arrival. Which approach meets the requirement with minimal operational overhead?

Question 51easymultiple choice

Read the full Designing data processing systems explanation →

A data engineer needs to design a batch pipeline that processes daily log files from Cloud Storage and writes aggregated results to BigQuery. Which service is most appropriate for this ETL job?

Question 52mediummultiple choice

Read the full Designing data processing systems explanation →

A company uses BigQuery to run reporting queries on a table that is partitioned by date and clustered by customer_id. Queries filtering by customer_id and a date range are performing poorly. What is the most likely cause?

Question 53hardmultiple choice

Read the full Designing data processing systems explanation →

A financial services company needs to process high-frequency trading data with strict ordering guarantees. They use Pub/Sub with ordering keys and Dataflow. The pipeline occasionally produces out-of-order results. What is the most likely cause?

Question 54easymultiple choice

Read the full Designing data processing systems explanation →

An e-commerce company processes real-time clickstream data using Pub/Sub and Dataflow. They want to ensure that if a Dataflow worker fails, the pipeline can resume processing from the point of failure without data loss. Which feature should they enable?

Question 55mediummultiple choice

Read the full Designing data processing systems explanation →

A financial services company uses Cloud Composer to orchestrate daily batch jobs. One job extracts data from MongoDB to Cloud Storage, then loads into BigQuery, and finally runs a Dataflow pipeline for aggregations. The Dataflow job fails intermittently. They want to automatically restart only the failed Dataflow job without re-running the earlier extraction and load. Which Airflow operator configuration should they use?

Question 56hardmultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Storage to store IoT sensor data in JSON format. The data is ingested using a Cloud Function triggered by Cloud Storage events. They notice that when many files are uploaded simultaneously, some files are not processed and the Cloud Function logs show 'function execution timeout'. What is the most likely cause and solution?

Question 57easymultiple choice

Read the full Designing data processing systems explanation →

An online retailer uses BigQuery for analytics. They have a time-series table with 5 billion rows and new data arrives every day. They want to optimize query performance and reduce costs by ensuring that queries scan only the partitions they need. Which table design should they use?

Question 58mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineering team needs to process a large volume of CSV files stored in Cloud Storage using Dataproc. The files are generated hourly and each contains millions of rows. They want to minimize the number of Dataproc cluster nodes to reduce cost while processing within an hour. Which configuration should they recommend?

Question 59hardmultiple choice

Read the full Designing data processing systems explanation →

A gaming company uses Pub/Sub to ingest player events and Dataflow for real-time analytics. They notice that the Pub/Sub subscription backlog is growing despite the Dataflow pipeline running continuously. The pipeline has a 1-hour window for aggregations. What is the most effective way to reduce the backlog?

Question 60easymultiple choice

Read the full Designing data processing systems explanation →

A startup wants to build a data lake on Google Cloud using Cloud Storage. They need to store raw data in its original format for future analysis. Which storage class should they use to optimize for cost given that data will be accessed occasionally after the first month?

Question 61mediummultiple choice

Read the full Designing data processing systems explanation →

A media company uses Cloud Data Loss Prevention (DLP) API to inspect and de-identify sensitive data before loading into BigQuery. They want to reduce costs by sampling the data during inspection. Which configuration should they use?

Question 62hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They experience a sudden spike in data volume causing BigQuery write throughput to be exceeded, resulting in errors. Which strategy should they implement to handle this gracefully?

Question 63easymulti select

Read the full Designing data processing systems explanation →

A company is designing a data processing pipeline for real-time sensor data. They want to ensure low latency and exactly-once processing semantics. Which two Google services should they combine to achieve this? (Choose 2)

Question 64mediummulti select

Read the full Designing data processing systems explanation →

A data warehouse team uses Cloud BigQuery for analytics. They want to optimize query performance and reduce costs. Which three actions should they take? (Choose 3)

Question 65hardmulti select

Read the full Designing data processing systems explanation →

A company uses Cloud Dataproc for large-scale Spark jobs. They notice that some jobs are failing due to insufficient memory on the worker nodes. They want to improve memory management without over-provisioning. Which three configurations should they apply? (Choose 3)

Question 66easymultiple choice

Read the full Designing data processing systems explanation →

Given the query plan, what is the most likely reason this query is efficient despite processing 10 billion rows?

Exhibit

Refer to the exhibit.

```sql
SELECT product_id, SUM(amount) AS total_sales
FROM sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_id
```
The job metadata shows: Input: 10 billion rows, Output: 500 million rows, Slot time: 20000 seconds, Elapsed time: 10 minutes, Shuffle: 100% locally, Joins: 0.

Question 67mediummultiple choice

Read the full Designing data processing systems explanation →

What is the most likely cause of data duplication after this command?

Network Topology

Question 68hardmultiple choice

Read the full Designing data processing systems explanation →

What is the root cause of this error and the correct solution?

Exhibit

Refer to the exhibit.

```json
{
  "error": {
    "code": 403,
    "message": "The service account 'dataflow-sa@project.iam.gserviceaccount.com' does not have 'bigquery.tables.getData' permission on table 'project:dataset.table'.",
    "status": "PERMISSION_DENIED"
  }
}
```
This error occurs when a Dataflow pipeline tries to read from a BigQuery table.

Question 69easymultiple choice

Read the full Designing data processing systems explanation →

A company is designing a streaming data pipeline to process real-time clickstream events. They need to aggregate events by session window with a 5-minute gap and enable exactly-once processing semantics. Which Google Cloud service should they use?

Question 70mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineer is designing a batch data pipeline that reads Avro files from Cloud Storage, transforms data using Apache Beam, and writes to BigQuery. The pipeline must handle daily runs and backfills. Which runner should they use?

Question 71hardmultiple choice

Read the full Designing data processing systems explanation →

A company processes IoT sensor data in near real-time. They ingest data via Cloud Pub/Sub, then a Dataflow streaming pipeline writes to Bigtable for low-latency queries. Recently, they observed increased Pub/Sub message backlog during traffic spikes. What is the most effective scaling strategy?

Question 72easymultiple choice

Read the full Designing data processing systems explanation →

A team needs to migrate an existing on-premises Hadoop Hive workload to Google Cloud. They want to minimize code changes and use a managed service for transient clusters. Which service should they choose?

Question 73mediummultiple choice

Read the full NAT/PAT explanation →

A financial company needs to process batch trades data daily and ensure that if a transformation step fails, the entire daily run is retried from the beginning. Which design pattern is appropriate?

Question 74hardmultiple choice

Read the full NAT/PAT explanation →

A data pipeline uses Cloud Pub/Sub to ingest events, then a Dataflow job writes to Cloud Storage in Avro format. The Dataflow job uses Global windows with a 10-minute trigger. The data is later loaded into BigQuery. They notice duplicate rows in BigQuery because the trigger produced multiple panes. What should the Dataflow pipeline change to eliminate duplicates?

Question 75easymultiple choice

Read the full Designing data processing systems explanation →

A company wants to analyze server logs stored in Cloud Storage using SQL. They need to get results in seconds without setting up any clusters. Which service should they use?

Question 76mediummultiple choice

Read the full Designing data processing systems explanation →

A data pipeline processes streaming data from Pub/Sub to BigQuery. The pipeline needs to handle late-arriving data that is up to 1 hour late. Which Dataflow feature should be used?

Question 77hardmultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Dataproc to run Spark jobs on ephemeral clusters. The input data is in Cloud Storage and output is also to Cloud Storage. The cluster is created and deleted daily. The cost is high due to spinning up nodes. Which change can reduce cost without sacrificing performance?

Question 78mediummulti select

Read the full Designing data processing systems explanation →

A company is designing a data lake on Google Cloud. They need to store raw data in multiple formats (CSV, Parquet, Avro) and allow various downstream processing frameworks. Which two storage solutions provide flexibility and scalability? (Choose two.)

Question 79hardmulti select

Read the full Designing data processing systems explanation →

A streaming pipeline uses Cloud Pub/Sub and Dataflow to process financial transactions. The pipeline must guarantee that each transaction is processed exactly once and in order per customer key. Which two configurations are necessary? (Choose two.)

Question 80mediummulti select

Read the full Designing data processing systems explanation →

A company is planning to migrate a legacy batch ETL pipeline to Google Cloud. The pipeline involves reading from a relational database, transforming data, and writing to a data warehouse. Which three Google Cloud services can be used as the orchestration layer? (Choose three.)

Question 81easymultiple choice

Read the full NAT/PAT explanation →

A data engineer runs this Dataflow template to load CSV files from Cloud Storage into BigQuery. The job fails with a 'File pattern not matching any files' error. What is the most likely cause?

Exhibit

Refer to the exhibit.
```
gcloud dataflow jobs run my-batch-job \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters inputFilePattern=gs://my-bucket/input/*.csv,outputTable=my-project:my_dataset.my_table
```

Question 82hardmultiple choice

Read the full Designing data processing systems explanation →

A team configured a garbage collection rule on a Cloud Bigtable column family with max_age of 100 seconds. After 2 minutes, they notice that data older than 100 seconds is still present. What is the most likely reason?

Exhibit

Refer to the exhibit.
```
# bigtable.gcloud
{
  "gc_rules": [
    {
      "column_family": "cf1",
      "max_age": "100s"
    }
  ]
}
# apply using gcloud bigtable app profiles create ...
```

Question 83easymultiple choice

Read the full Designing data processing systems explanation →

A team has set up a push subscription to an HTTPS endpoint. They notice that messages are not being acknowledged and are resent every 10 seconds. What is the most likely issue?

Exhibit

Refer to the exhibit.
```
# Pub/Sub subscription config
projects/my-project/subscriptions/my-sub:
  topic: projects/my-project/topics/my-topic
  ackDeadlineSeconds: 10
  pushConfig:
    pushEndpoint: https://my-endpoint.example.com/push
    oidcToken:
      serviceAccountEmail: sa@my-project.iam.gserviceaccount.com
```

Question 84mediummultiple choice

Read the full Designing data processing systems explanation →

A company processes real-time clickstream data from websites. They need to aggregate user sessions that may span multiple hours and handle events that arrive late due to network delays. The pipeline must avoid discarding late data. Which Dataflow feature should they configure?

Question 85hardmultiple choice

Read the full Designing data processing systems explanation →

A data analyst frequently queries a BigQuery table that contains an array of structs representing product purchases. The query below runs slowly:

SELECT customer_id, COUNT(purchase) as total_purchases FROM sales, UNNEST(purchases) as purchase GROUP BY customer_id

What change would most improve query performance?

Question 86mediummultiple choice

Read the full Designing data processing systems explanation →

A Dataflow pipeline reads events from Pub/Sub and transforms them. Some events contain invalid product IDs that should be filtered out. The list of valid product IDs is stored in a frequently updated BigQuery table. What is the best approach to filter out invalid events?

Question 87hardmultiple choice

Read the full Designing data processing systems explanation →

A manufacturing company wants to detect anomalies in sensor data from thousands of IoT devices in real time. The data is streaming into Pub/Sub. The best solution should use a machine learning model served from AI Platform that scores sensor readings aggregated over 5-minute windows. Which pipeline design meets these requirements?

Question 88easymultiple choice

Read the full Designing data processing systems explanation →

A company runs a nightly Dataproc batch job to process large log files. The job is idempotent and can tolerate node failures if restarted. Minimizing cost is critical. What is the most cost-effective cluster design?

Question 89easymultiple choice

Read the full NAT/PAT explanation →

A company wants to implement a data lake on Google Cloud to store raw sensor data (unstructured binary files) and allow data scientists to run SQL queries on processed data. They expect to store terabytes of data and have different access patterns. Which combination of GCP services best meets these requirements?

Question 90mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineering team needs to build a data integration pipeline that involves connecting to multiple sources, performing data transformations with visual editing, and then running custom machine learning algorithms. The team has both data analysts and data scientists. Which approach is most suitable?

Question 91mediummultiple choice

Read the full NAT/PAT explanation →

A gaming company uses Avro schemas for its streaming event data. They anticipate adding new optional fields to events over time. They need to ensure backward compatibility so that existing pipelines continue to work. Which strategy should they adopt?

Question 92hardmultiple choice

Read the full Designing data processing systems explanation →

A financial services company must comply with GDPR "right to be forgotten". They store customer transactions in BigQuery partitioned by date. When a user requests deletion, all their data must be removed within 48 hours. The deletion requests are received via a Pub/Sub topic. What is the most scalable and cost-effective approach?

Question 93mediummulti select

Read the full Designing data processing systems explanation →

You are designing a streaming Dataflow pipeline that processes high-throughput data. Which two features can help minimize cost? (Choose TWO.)

Question 94hardmulti select

Read the full Designing data processing systems explanation →

A payment processing company needs to detect fraudulent transactions in real time. The system must have sub-second latency for high-value transactions and use a machine learning model. Which two components should be part of the architecture? (Choose TWO.)

Question 95hardmulti select

Read the full Designing data processing systems explanation →

You are designing a streaming pipeline that must guarantee exactly-once processing. Which three services or features can help achieve this? (Choose THREE.)

Question 96mediummultiple choice

Read the full Designing data processing systems explanation →

What is the most likely cause of this error?

Exhibit

Refer to the exhibit.

Error log from Dataflow pipeline:
"java.lang.IllegalArgumentException: Unable to convert value '2024-08-15T10:23:45.123Z' for field 'timestamp' from type 'STRING' to type 'TIMESTAMP' at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.addRecordToBatch"

Question 97hardmultiple choice

Read the full Designing data processing systems explanation →

The query above runs slowly on the 10 TB table. Which optimization would most improve performance?

Exhibit

Refer to the exhibit.

Table details: orders (10 TB, daily partitioned by order_date, no clustering).
Query:
SELECT customer_id, COUNT(item) as items_purchased
FROM orders, UNNEST(items) as item
WHERE item.category = 'electronics'
GROUP BY customer_id
ORDER BY items_purchased DESC

Question 98easymultiple choice

Read the full Designing data processing systems explanation →

The push endpoint is returning 500 errors. What is the most likely cause?

Exhibit

Refer to the exhibit.

Created subscription 'my-sub' on topic 'my-topic' (project:my-project)
Push endpoint: https://myapp.example.com/events
Ack deadline: 10 seconds
Retry policy: Retry after 10 seconds, maximum 10 retries
No authentication configured.

Question 99mediummultiple choice

Read the full Designing data processing systems explanation →

A company is designing a streaming pipeline using Dataflow to process real-time clickstream data. The pipeline reads from Pub/Sub, performs user sessionization using Apache Beam's Session window, and writes to BigQuery. The team notices that the pipeline's lag is growing and the worker utilization is low. What is the most likely cause and recommended fix?

Question 100easymultiple choice

Read the full NAT/PAT explanation →

A company wants to ingest IoT sensor data from thousands of devices into BigQuery for near-real-time analytics. The data volume is approximately 10 GB per hour. Which combination of Google Cloud services should they use for a cost-effective and scalable solution?

Question 101hardmultiple choice

Read the full Designing data processing systems explanation →

A Dataflow streaming pipeline uses stateful transformations with per-key state and timers. After a deployment, the team observes that the pipeline is reprocessing events from the last 30 minutes every time it restarts. The pipeline's checkpoint is configured to persist every 10 seconds. Which change should be made to prevent unnecessary reprocessing?

Question 102mediummultiple choice

Read the full Designing data processing systems explanation →

A data team uses Cloud Dataproc to run nightly Spark jobs. The job volume has increased, and the cluster is often underutilized during the day. They want to reduce costs while ensuring jobs can scale when needed. Which strategy should they adopt?

Question 103easymultiple choice

Read the full Designing data processing systems explanation →

A company processes CSV files that are uploaded to Cloud Storage by external partners. Each file is around 500 MB, and they need to be parsed and loaded into BigQuery. The processing must start as soon as the file arrives. What is the most efficient serverless architecture?

Question 104hardmultiple choice

Read the full Designing data processing systems explanation →

A company stores IoT sensor readings in BigQuery. The table is partitioned by day and clustered by sensor_id. Query performance has degraded as data grows; many queries filter by a date range and a single sensor_id. Which optimization should be applied first?

Question 105mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineering team uses Cloud Data Fusion to build ETL pipelines. They have a pipeline that reads from Cloud SQL, transforms data using Wrangler, and writes to BigQuery. The pipeline fails intermittently with a 'connection timeout' error from Cloud SQL. What is the best way to handle this?

Question 106easymultiple choice

Read the full Designing data processing systems explanation →

An organization wants to automate their batch data processing pipeline using Cloud Composer. The pipeline consists of multiple tasks: extract from Cloud Storage, transform with Dataflow, and load into BigQuery. Which Airflow operator should be used to run Dataflow jobs?

Question 107hardmultiple choice

Read the full Designing data processing systems explanation →

A Dataflow streaming pipeline reads from Pub/Sub, applies a ParDo that uses a side input from a BigQuery table (refreshed hourly), and writes to BigQuery. The side input is large and causes increased latency and worker OOM errors. Which design change solves this?

Question 108mediummulti select

Read the full Designing data processing systems explanation →

A data engineer is monitoring a Dataflow streaming pipeline and notices that the 'System Lag' metric is increasing. Which TWO actions should be taken to diagnose the issue?

Question 109hardmulti select

Read the full Designing data processing systems explanation →

A company is designing a data lake on Cloud Storage for analytics. They need to store data in various formats (Avro, Parquet, CSV) and enable efficient querying with BigQuery and Dataproc. Which THREE practices should they follow?

Question 110easymulti select

Read the full Designing data processing systems explanation →

A company uses Pub/Sub to decouple services. They have a topic with two subscriptions: Subscription A is a push subscription that sends messages to a Cloud Function; Subscription B is a pull subscription used by a Dataflow job. They need to ensure that messages are processed in order for a specific device_id. Which TWO configurations should they apply?

Question 111easymultiple choice

Read the full Designing data processing systems explanation →

A streaming Dataflow job is processing messages from Cloud Pub/Sub. The job is underutilizing resources and the throughput is lower than expected. Which parameter should be adjusted to increase parallelism?

Question 112mediummultiple choice

Read the full Designing data processing systems explanation →

A company stores IoT sensor data in BigQuery. Queries that filter on a timestamp column and a device_id column are slow even though the table is partitioned by day. What should the data engineer do to improve query performance?

Question 113hardmultiple choice

Read the full Designing data processing systems explanation →

A financial services company uses Cloud Pub/Sub with ordering keys to process transactions in order. Some messages are failing processing and getting stuck. The team wants to ensure that if a message fails, it can be reprocessed later without blocking subsequent messages. What should they implement?

Question 114easymultiple choice

Read the full Designing data processing systems explanation →

A data engineer is running a Dataproc cluster for a batch ETL job that needs to process 10 TB of data. The job is memory-intensive. The cluster currently uses n1-standard-4 workers. Performance is poor. What is the most cost-effective change to improve performance?

Question 115mediummultiple choice

Read the full Designing data processing systems explanation →

A team is designing an event-driven data pipeline. They need to process messages from Cloud Pub/Sub, transform them, and write to BigQuery. The messages have variable volume and spikes. What is the best serverless compute option for this workload?

Question 116hardmultiple choice

Read the full Designing data processing systems explanation →

A Dataflow pipeline reads from Cloud Pub/Sub and writes to Cloud Storage. The pipeline needs to guarantee exactly-once processing despite worker failures. Which configuration ensures exactly-once semantics?

Question 117easymultiple choice

Read the full Designing data processing systems explanation →

A data engineer needs to automatically delete objects from a Cloud Storage bucket after 30 days and archive them to nearline storage after 7 days. Which configuration should they use?

Question 118mediummultiple choice

Read the full Designing data processing systems explanation →

A BigQuery table contains streaming data from Cloud Pub/Sub. The table is partitioned by ingestion time. A user runs a query that accesses data from the last 5 minutes and gets correct results. After 90 minutes, the user runs the same query again but notices that some rows are missing. What is the most likely cause?

Question 119hardmultiple choice

Read the full Designing data processing systems explanation →

In Cloud Composer, a DAG has two tasks: task_A (runs an Apache Spark job on Dataproc) and task_B (loads data from Cloud Storage to BigQuery). task_B must start after task_A completes. The DAG is scheduled to run hourly. Sometimes task_B starts before task_A finishes because task_A's Dataproc job appears to complete in the Airflow metadata but the data is not yet available. What is the best way to ensure task_B only runs after the data is fully written?

Question 120easymulti select

Read the full Designing data processing systems explanation →

Which TWO roles are required to allow a service account to run a Dataflow job and write results to BigQuery? (Choose two.)

Question 121mediummulti select

Read the full Designing data processing systems explanation →

A data engineer is designing a BigQuery table for time-series data that will be queried frequently by time range and also by a customer_id. Which TWO design decisions will improve query performance and manage costs? (Choose two.)

Question 122hardmulti select

Read the full Designing data processing systems explanation →

A company uses Cloud Pub/Sub with pull subscriptions to process orders. The application requires at-least-once delivery and the ability to process orders in order per customer_id. Which THREE features should they configure? (Choose three.)

Question 123easymultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Dataflow to process streaming data. They notice that the pipeline's throughput is lower than expected and the system is experiencing high latency. What is the most likely cause?

Question 124easymultiple choice

Read the full Designing data processing systems explanation →

A data engineer needs to design a data processing system that ingests large volumes of sensor data from IoT devices. The data should be stored in a schema-less format and allow for real-time analytics. Which Google Cloud service is most appropriate?

Question 125mediummultiple choice

Read the full Designing data processing systems explanation →

A company is migrating their on-premises Apache Spark jobs to Dataproc. They want to minimize code changes and take advantage of serverless infrastructure. Which Dataproc feature should they use?

Question 126mediummultiple choice

Read the full Designing data processing systems explanation →

A data pipeline using Cloud Pub/Sub and Cloud Dataflow is experiencing duplicate messages. The source system publishes messages at least once. What Dataflow technique ensures exactly-once processing?

Question 127hardmultiple choice

Read the full Designing data processing systems explanation →

A company processes financial transactions using Cloud Dataflow. They need to ensure that late-arriving data is handled correctly for fraud detection. The pipeline uses event time processing. Which approach should they use to handle late data?

Question 128mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineer is designing a batch ETL pipeline using Cloud Composer and Dataflow. The pipeline must be self-healing and retry on failures. Which Composer feature should they configure?

Question 129hardmultiple choice

Read the full Designing data processing systems explanation →

A team is using BigQuery to analyze petabyte-scale data. They notice that queries are slow and expensive due to full table scans. They have already partitioned by date. What additional optimization should they implement?

Question 130easymultiple choice

Read the full Designing data processing systems explanation →

A company needs to stream real-time user click events from a web application to BigQuery for analysis. Which Google Cloud architecture is most suitable?

Question 131mediummultiple choice

Read the full Designing data processing systems explanation →

A data pipeline reading from Cloud Storage and writing to BigQuery using Dataflow is experiencing high cost. The data is CSV and needs schema inference. What change reduces cost?

Question 132mediummulti select

Read the full Designing data processing systems explanation →

A data engineer is designing a streaming pipeline with Cloud Pub/Sub and Cloud Dataflow. They need to guarantee at-least-once delivery and handle occasional duplicates. Which TWO configurations should they implement?

Question 133hardmulti select

Read the full Designing data processing systems explanation →

A company uses Cloud Composer to orchestrate Dataproc and BigQuery jobs. They need to implement retry logic for transient failures. Which THREE features can help?

Question 134mediummulti select

Read the full Designing data processing systems explanation →

A data warehouse in BigQuery is experiencing performance issues. Which THREE techniques can improve performance without moving data to a different storage system?

Question 135hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a streaming data pipeline on Google Cloud using Cloud Pub/Sub, Cloud Dataflow, and BigQuery. The pipeline processes real-time sensor data for predictive maintenance. Recently, the Dataflow job's lag has increased from seconds to minutes, and the system shows backpressure. The pipeline uses fixed windows of 1 minute and writes results to BigQuery. The data volume has doubled. The team has already increased the number of workers. What should they do next? Options: A. Use session windows instead of fixed windows. B. Enable Streaming Engine and use Upsert to BigQuery. C. Decrease the window duration. D. Use Cloud Storage as temporary sink.

Question 136mediummultiple choice

Read the full Designing data processing systems explanation →

A data engineer is responsible for a batch ETL pipeline that runs daily using Cloud Composer and Dataproc. The pipeline extracts data from Cloud SQL, transforms it with Spark, and loads to BigQuery. Last night, the pipeline failed because the Spark job ran out of memory. The team needs a solution that prevents future failures without manual intervention. Options: A. Use a larger machine type for Dataproc. B. Enable Dataproc autoscaling and configure memory-based scaling. C. Split the Spark job into multiple stages. D. Use Cloud Functions to retry the job.

Question 137hardmultiple choice

Read the full Designing data processing systems explanation →

A company uses Cloud Dataflow to process financial transactions from Pub/Sub to BigQuery. The pipeline must ensure exactly-once semantics. Recently, they noticed duplicate rows in BigQuery. The source publishes with at-least-once. The Dataflow pipeline uses idempotent writes. What is the most likely cause? Options: A. The pipeline uses GlobalWindows. B. The pipeline has autoscaling enabled. C. The pipeline uses file loads as a sink. D. The pipeline's watermark is misconfigured.

Question 138easymultiple choice

Read the full Designing data processing systems explanation →

A company needs to stream data from a fleet of IoT devices to BigQuery for near-real-time analytics. The data volume is unpredictable and can spike during certain events. Which Google Cloud service should be used as the ingestion point to handle variable throughput with minimal operational overhead?

Question 139mediummultiple choice

Read the full Designing data processing systems explanation →

A team runs a Dataflow streaming pipeline that reads from Pub/Sub, windows events by processing time, and writes to BigQuery. Some late-arriving events are being dropped. The requirement is to include all events that arrive within 10 minutes of the watermark. Which pipeline configuration should be used?

Question 140hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a batch data processing workload using Dataproc clusters that are auto-scaled based on YARN memory utilization. During peak times, jobs take much longer than expected. Analysis shows the cluster is not scaling up despite high YARN memory utilization. What is the most likely cause?

Question 141easymulti select

Read the full Designing data processing systems explanation →

A company is designing a data processing system that must handle both batch and streaming workloads with unified pipeline code. Which two Google Cloud services are most suitable for implementing a unified batch and streaming pipeline? (Choose TWO.)

Question 142mediummulti select

Read the full Designing data processing systems explanation →

An organization is moving on-premises Hadoop workloads to Google Cloud. They need to minimize code changes and manage transient clusters for cost savings. Which two Google Cloud services should they consider? (Choose TWO.)

Question 143hardmulti select

Read the full Designing data processing systems explanation →

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

Question 144easymultiple choice

Read the full Designing data processing systems explanation →

A startup is building a real-time dashboard that shows aggregated metrics from social media feeds. They expect up to 10,000 events per second. The data must be near-real-time (< 30 seconds latency) and stored in BigQuery for historical analysis. They have limited experience managing infrastructure. The CTO suggests using Apache Kafka on Compute Engine for ingestion. However, the data engineer recommends a fully managed solution. Which approach should the team adopt?

Question 145easymultiple choice

Read the full Designing data processing systems explanation →

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

Question 146mediummultiple choice

Read the full Designing data processing systems explanation →

A financial services company uses a Dataflow streaming pipeline to process real-time stock trades. The pipeline reads from Pub/Sub, enriches with reference data from Cloud Bigtable, and writes to BigQuery. Recently, they noticed an increase in processing latency during market open hours. Investigation shows that the pipeline is data-skewed: a few stock symbols generate 90% of the traffic. The team wants to reduce latency without changing the pipeline structure. What should they do?

Question 147mediummultiple choice

Read the full Designing data processing systems explanation →

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

Question 148mediummultiple choice

Read the full Designing data processing systems explanation →

A company has a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is failing with 'deadline exceeded' errors during peak hours. The team suspects that the pipeline cannot keep up with the incoming data rate. They also notice that the autoscaling algorithm sets maxNumWorkers to 10, but the pipeline only scales to 5 workers. What is the most likely cause of the inadequate scaling?

Question 149hardmultiple choice

Read the full NAT/PAT explanation →

A healthcare company processes patient data using a Dataflow pipeline that reads from Cloud Storage, transforms data, and writes to BigQuery. They need to ensure that the processing is idempotent to handle failures and retries without duplicating records. The data arrives in daily batches and may be re-delivered if earlier processing failed. What approach should they take to guarantee exactly-once processing in BigQuery?

Question 150hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a Dataproc cluster with 10 worker nodes for a Spark streaming job that processes data from Pub/Sub (via Pub/Sub Lite) and writes to Cloud Storage. They observe that the job is producing many small files in Cloud Storage, leading to high costs and performance issues in downstream batch pipelines. The team wants to consolidate output files while maintaining low latency. What is the best solution?

Question 151hardmultiple choice

Read the full Designing data processing systems explanation →

A media company uses Cloud Dataflow to process video metadata from a Pub/Sub stream. The pipeline enriches metadata using a lookup table stored in Cloud Bigtable. Recently, they noticed increased latency and occasional 'Bigtable operation timeout' errors. The Bigtable instance has 3 nodes and the data is highly distributed. The Dataflow pipeline uses default settings. What is the most likely cause of the timeouts?

Question 152hardmultiple choice

Read the full Designing data processing systems explanation →

A company runs a production Dataflow streaming pipeline that reads from Pub/Sub, groups events by customer ID, and writes to BigQuery. The pipeline uses global windows with triggers. After a recent code change, the pipeline started generating duplicate events in BigQuery for the same customer ID. The previous version did not have duplicates. The team reviews the code and sees that the trigger was changed from 'afterProcessingTime' to 'afterWatermark'. What is the most likely reason for duplicates?

Question 153easymultiple choice

Read the full Designing data processing systems explanation →

A company is designing a real-time clickstream analytics pipeline using Pub/Sub and Dataflow. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which Dataflow feature should be configured to handle late data correctly?

Question 154mediummultiple choice

Read the full Designing data processing systems explanation →

A data team runs regular analytical queries on a BigQuery table that stores 2 years of sales data (approximately 10 TB). Queries frequently filter on a `sale_date` column and also group by `product_id`. To optimize cost and performance, which design approach is most effective?

Question 155hardmultiple choice

Read the full Designing data processing systems explanation →

A Dataflow streaming job is processing high-volume sensor data from thousands of IoT devices. The job uses global windows with a 10-minute processing time trigger. Recently, the job's CPU utilization is nearly 100% and it is falling behind. Which action is most likely to reduce CPU load while maintaining data freshness?

Question 156easymultiple choice

Read the full Designing data processing systems explanation →

A company is building a data lake on Cloud Storage for log analysis. Log files (CSV) arrive every 5 minutes from multiple sources. The files should be ingested into BigQuery for reporting within 15 minutes. Which approach best meets the requirements with minimal operational overhead?

Question 157mediummulti select

Read the full NAT/PAT explanation →

A healthcare company stores patient records as JSON files in Cloud Storage for analysis. They want to design a data lake that enables querying the data with BigQuery while minimizing storage costs and maintaining data security. Which two actions should they take? (Choose two.)

Question 158hardmultiple choice

Read the full Designing data processing systems explanation →

A data engineer configures the above lifecycle rule on a Cloud Storage bucket that stores daily log files. After 60 days, they notice that files older than 30 days have been transitioned to Nearline, but files older than 90 days are still present. What is the most likely cause?

Exhibit

Refer to the exhibit.
```
# Cloud Storage Object Lifecycle Rule (JSON)
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30, "matchesStorageClass": ["STANDARD"]}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 90}
      }
    ]
  }
}
```

Question 159hardmultiple choice

Read the full Designing data processing systems explanation →

A large e-commerce company is migrating its on-premise Hadoop cluster to Google Cloud using Dataproc for batch processing. The cluster processes daily sales data from multiple sources, generates aggregated reports, and performs ad-hoc analysis. The migration is complete, but users report that jobs are running 30% slower than on-premise. The data is stored in Cloud Storage as Parquet files partitioned by date. The Dataproc cluster uses preemptible VMs for worker nodes, and the master node uses a standard VM. The jobs heavily rely on shuffling data between stages. The cluster's autoscaling is enabled with a minimum of 10 and a maximum of 50 workers. During job execution, CPU utilization on workers is low, but disk I/O is high, especially on local SSDs. The network utilization is moderate. The team suspects that the shuffle operation is causing the slowdown. Which action should the team take to improve job performance?