Google Professional Data Engineer PDE practice test

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

Trap 1: Dataflow with batch mode

Dataflow is more expensive for batch than Dataproc Serverless.

Trap 2: Cloud Data Fusion

Data Fusion is a full ETL tool with higher costs and complexity.

Trap 3: BigQuery Data Transfer Service

Data Transfer Service is for scheduled transfers, not processing.

A
Dataproc Serverless with PySpark
Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.
B
Dataflow with batch mode
Why wrong: Dataflow is more expensive for batch than Dataproc Serverless.
C
Cloud Data Fusion
Why wrong: Data Fusion is a full ETL tool with higher costs and complexity.
D
BigQuery Data Transfer Service
Why wrong: Data Transfer Service is for scheduled transfers, not processing.

Question 2hardmultiple choice

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

Trap 1: Use a side input from a PCollection and broadcast it.

Broadcasting a 10 GB PCollection will cause OOM on each worker.

Trap 2: Increase the number of workers to distribute the side input.

Distributing the side input still requires each worker to hold a copy, causing OOM.

Trap 3: Increase the worker memory to 16 GB per worker.

16 GB may still not be sufficient if multiple side input copies are needed.

A
Use a Cloud Bigtable table as a side input via a RichSDF.
Bigtable provides scalable key-value lookups without loading all data into memory.
B
Use a side input from a PCollection and broadcast it.
Why wrong: Broadcasting a 10 GB PCollection will cause OOM on each worker.
C
Increase the number of workers to distribute the side input.
Why wrong: Distributing the side input still requires each worker to hold a copy, causing OOM.
D
Increase the worker memory to 16 GB per worker.
Why wrong: 16 GB may still not be sufficient if multiple side input copies are needed.

Question 3mediummultiple choice

Your company uses Vertex AI Pipelines to automate model retraining. The pipeline has three steps: data extraction from BigQuery, feature engineering using Dataflow, and model training using a custom container on Vertex AI Training. Recently, the pipeline has been failing intermittently at the Dataflow step with a 'The job encountered a transient error. Please retry.' message. You have enabled pipeline retries with 3 attempts. However, the pipeline still fails after 3 retries. You check the logs and find that the Dataflow job requires more resources than the default worker configuration provides. Which change should you make to reduce the failure rate?

Trap 1: Increase the number of Dataflow workers to improve parallelism

More workers may increase resource contention.

Trap 2: Increase the number of retries in the pipeline to 5

Retries don't fix the underlying resource issue.

Trap 3: Replace Dataflow with Dataproc to run the feature engineering step

This is a major change and may not resolve the resource issue.

A
Increase the number of Dataflow workers to improve parallelism
Why wrong: More workers may increase resource contention.
B
Increase the number of retries in the pipeline to 5
Why wrong: Retries don't fix the underlying resource issue.
C
Replace Dataflow with Dataproc to run the feature engineering step
Why wrong: This is a major change and may not resolve the resource issue.
D
Increase the Dataflow worker machine type to have more memory and CPU in the pipeline step configuration
More resources prevent the transient resource exhaustion errors.

Question 4hardmultiple choice

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

Trap 1: Vertex AI Feature Store

Used for feature management, not evaluation.

Trap 2: Cloud Build trigger

Cloud Build is for building containers.

Trap 3: Cloud Monitoring alert

Alerts are reactive, not pre-deployment gates.

A
Vertex AI Feature Store
Why wrong: Used for feature management, not evaluation.
B
Vertex AI Model Evaluation
Evaluates model and can block deployment if threshold not met.
C
Cloud Build trigger
Why wrong: Cloud Build is for building containers.
D
Cloud Monitoring alert
Why wrong: Alerts are reactive, not pre-deployment gates.

Read the full NAT/PAT explanation →

Question 5easymultiple choice

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

Trap 1: Cloud Pub/Sub, Cloud Dataproc, Cloud Storage

Dataproc overhead.

Trap 2: Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner

Spanner expensive for analytics.

Trap 3: Cloud Pub/Sub, Cloud Dataflow, Cloud Storage

Cloud Storage not SQL-queryable.

A
Cloud Pub/Sub, Cloud Dataproc, Cloud Storage
Why wrong: Dataproc overhead.
B
Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner
Why wrong: Spanner expensive for analytics.
C
Cloud Pub/Sub, Cloud Dataflow, BigQuery
Best for real-time SQL analytics.
D
Cloud Pub/Sub, Cloud Dataflow, Cloud Storage
Why wrong: Cloud Storage not SQL-queryable.

Question 6easymultiple choice

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

Refer to the exhibit.

```
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ]
}
```

Trap 1: roles/bigquery.admin

Too broad, but also correct; however dataEditor is sufficient.

Trap 2: roles/bigquery.user

Does not include tables.get.

Trap 3: roles/bigquery.jobUser

Does not include tables.get.

A
roles/bigquery.admin
Why wrong: Too broad, but also correct; however dataEditor is sufficient.
B
roles/bigquery.user
Why wrong: Does not include tables.get.
C
roles/bigquery.jobUser
Why wrong: Does not include tables.get.
D
roles/bigquery.dataEditor
Includes bigquery.tables.get.