Sample questions
Google Professional Data Engineer practice questions
A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?
Trap 1: Dataflow with batch mode
Dataflow is more expensive for batch than Dataproc Serverless.
Trap 2: Cloud Data Fusion
Data Fusion is a full ETL tool with higher costs and complexity.
Trap 3: BigQuery Data Transfer Service
Data Transfer Service is for scheduled transfers, not processing.
- A
Dataproc Serverless with PySpark
Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.
- B
Dataflow with batch mode
Why wrong: Dataflow is more expensive for batch than Dataproc Serverless.
- C
Cloud Data Fusion
Why wrong: Data Fusion is a full ETL tool with higher costs and complexity.
- D
BigQuery Data Transfer Service
Why wrong: Data Transfer Service is for scheduled transfers, not processing.
A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?
Trap 1: Use a side input from a PCollection and broadcast it.
Broadcasting a 10 GB PCollection will cause OOM on each worker.
Trap 2: Increase the number of workers to distribute the side input.
Distributing the side input still requires each worker to hold a copy, causing OOM.
Trap 3: Increase the worker memory to 16 GB per worker.
16 GB may still not be sufficient if multiple side input copies are needed.
- A
Use a Cloud Bigtable table as a side input via a RichSDF.
Bigtable provides scalable key-value lookups without loading all data into memory.
- B
Use a side input from a PCollection and broadcast it.
Why wrong: Broadcasting a 10 GB PCollection will cause OOM on each worker.
- C
Increase the number of workers to distribute the side input.
Why wrong: Distributing the side input still requires each worker to hold a copy, causing OOM.
- D
Increase the worker memory to 16 GB per worker.
Why wrong: 16 GB may still not be sufficient if multiple side input copies are needed.
Your company uses Vertex AI Pipelines to automate model retraining. The pipeline has three steps: data extraction from BigQuery, feature engineering using Dataflow, and model training using a custom container on Vertex AI Training. Recently, the pipeline has been failing intermittently at the Dataflow step with a 'The job encountered a transient error. Please retry.' message. You have enabled pipeline retries with 3 attempts. However, the pipeline still fails after 3 retries. You check the logs and find that the Dataflow job requires more resources than the default worker configuration provides. Which change should you make to reduce the failure rate?
Trap 1: Increase the number of Dataflow workers to improve parallelism
More workers may increase resource contention.
Trap 2: Increase the number of retries in the pipeline to 5
Retries don't fix the underlying resource issue.
Trap 3: Replace Dataflow with Dataproc to run the feature engineering step
This is a major change and may not resolve the resource issue.
- A
Increase the number of Dataflow workers to improve parallelism
Why wrong: More workers may increase resource contention.
- B
Increase the number of retries in the pipeline to 5
Why wrong: Retries don't fix the underlying resource issue.
- C
Replace Dataflow with Dataproc to run the feature engineering step
Why wrong: This is a major change and may not resolve the resource issue.
- D
Increase the Dataflow worker machine type to have more memory and CPU in the pipeline step configuration
More resources prevent the transient resource exhaustion errors.
A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?
Trap 1: Vertex AI Feature Store
Used for feature management, not evaluation.
Trap 2: Cloud Build trigger
Cloud Build is for building containers.
Trap 3: Cloud Monitoring alert
Alerts are reactive, not pre-deployment gates.
- A
Vertex AI Feature Store
Why wrong: Used for feature management, not evaluation.
- B
Vertex AI Model Evaluation
Evaluates model and can block deployment if threshold not met.
- C
Cloud Build trigger
Why wrong: Cloud Build is for building containers.
- D
Cloud Monitoring alert
Why wrong: Alerts are reactive, not pre-deployment gates.
A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?
Trap 1: Cloud Pub/Sub, Cloud Dataproc, Cloud Storage
Dataproc overhead.
Trap 2: Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner
Spanner expensive for analytics.
Trap 3: Cloud Pub/Sub, Cloud Dataflow, Cloud Storage
Cloud Storage not SQL-queryable.
- A
Cloud Pub/Sub, Cloud Dataproc, Cloud Storage
Why wrong: Dataproc overhead.
- B
Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner
Why wrong: Spanner expensive for analytics.
- C
Cloud Pub/Sub, Cloud Dataflow, BigQuery
Best for real-time SQL analytics.
- D
Cloud Pub/Sub, Cloud Dataflow, Cloud Storage
Why wrong: Cloud Storage not SQL-queryable.
The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?
Exhibit
Refer to the exhibit.
```
{
"bindings": [
{
"role": "roles/bigquery.dataViewer",
"members": [
"serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
]
}
]
}
```Trap 1: roles/bigquery.admin
Too broad, but also correct; however dataEditor is sufficient.
Trap 2: roles/bigquery.user
Does not include tables.get.
Trap 3: roles/bigquery.jobUser
Does not include tables.get.
- A
roles/bigquery.admin
Why wrong: Too broad, but also correct; however dataEditor is sufficient.
- B
roles/bigquery.user
Why wrong: Does not include tables.get.
- C
roles/bigquery.jobUser
Why wrong: Does not include tables.get.
- D
roles/bigquery.dataEditor
Includes bigquery.tables.get.
A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?
Trap 1: Use Cloud Shell and clone the repo
Limited resources and not scalable.
Trap 2: Share the notebook via Cloud Storage
No version control or collaboration.
Trap 3: Store notebooks in Cloud Source Repositories
No interactive environment.
- A
Use Cloud Shell and clone the repo
Why wrong: Limited resources and not scalable.
- B
Use a user-managed notebook instance with multiple users
Allows collaboration with version control.
- C
Share the notebook via Cloud Storage
Why wrong: No version control or collaboration.
- D
Store notebooks in Cloud Source Repositories
Why wrong: No interactive environment.
Drag and drop the steps to deploy a Cloud Dataflow pipeline from a template into the correct order.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?
Trap 1: Reduce the minimum number of nodes to zero to allow scale-to-zero…
Reducing min nodes would increase cold start latency and not help during peak traffic.
Trap 2: Place a Google Cloud Load Balancer in front of the Vertex AI…
Vertex AI Prediction endpoints already have built-in load balancing; an external load balancer adds complexity without benefit.
Trap 3: Implement A/B testing by splitting traffic between two model…
A/B testing is for evaluating model performance, not for scaling to handle traffic spikes.
- A
Reduce the minimum number of nodes to zero to allow scale-to-zero during low traffic.
Why wrong: Reducing min nodes would increase cold start latency and not help during peak traffic.
- B
Place a Google Cloud Load Balancer in front of the Vertex AI endpoint to distribute requests across multiple endpoints.
Why wrong: Vertex AI Prediction endpoints already have built-in load balancing; an external load balancer adds complexity without benefit.
- C
Configure horizontal autoscaling with a higher maximum number of nodes and set a CPU utilization target.
Autoscaling allows the endpoint to add nodes during high traffic, reducing latency and preventing timeouts.
- D
Implement A/B testing by splitting traffic between two model versions to distribute load.
Why wrong: A/B testing is for evaluating model performance, not for scaling to handle traffic spikes.
A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?
Trap 1: Create a single large persistent Dataproc cluster to handle the…
A persistent cluster is costly and underutilized during low traffic.
Trap 2: Use Cloud Data Fusion to visually design the pipeline and run it on…
Data Fusion adds complexity and cost, and may not handle 10x growth seamlessly.
Trap 3: Migrate the pipeline to Dataflow with Apache Beam and use flexRS…
Dataflow flexRS is for batch jobs, but may be more expensive than Dataproc for large volumes.
- A
Create a single large persistent Dataproc cluster to handle the peak load.
Why wrong: A persistent cluster is costly and underutilized during low traffic.
- B
Use Cloud Data Fusion to visually design the pipeline and run it on Dataproc.
Why wrong: Data Fusion adds complexity and cost, and may not handle 10x growth seamlessly.
- C
Use a Dataproc cluster with preemptible worker nodes and autoscaling enabled.
Preemptible VMs are cost-effective, and autoscaling handles growth.
- D
Migrate the pipeline to Dataflow with Apache Beam and use flexRS for cost savings.
Why wrong: Dataflow flexRS is for batch jobs, but may be more expensive than Dataproc for large volumes.
A financial services company deploys a regression model to predict loan default risk. The model is served using Vertex AI Endpoints with autoscaling. After deployment, latency increases significantly during peak hours, causing timeouts. The model uses scikit-learn and has a large feature set. Which action should the team take to reduce latency while maintaining prediction accuracy?
Trap 1: Switch to batch prediction for all requests.
Batch prediction is not appropriate for real-time serving.
Trap 2: Increase the minimum number of replicas in the endpoint to handle…
This addresses capacity but not per-request latency.
Trap 3: Increase the memory allocation for the serving container.
More memory may not reduce computation time.
- A
Switch to batch prediction for all requests.
Why wrong: Batch prediction is not appropriate for real-time serving.
- B
Increase the minimum number of replicas in the endpoint to handle peak load.
Why wrong: This addresses capacity but not per-request latency.
- C
Increase the memory allocation for the serving container.
Why wrong: More memory may not reduce computation time.
- D
Apply feature selection to reduce the number of input features.
Reducing features decreases model size and inference time.
Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?
Trap 1: Use a pull subscription with a 10-second acknowledgment deadline.
Short ack deadlines can cause duplicates and processing failures.
Trap 2: Disable autoscaling to prevent worker churn.
Autoscaling helps handle load spikes; disabling it can cause failures under load.
Trap 3: Use micro-batch processing with a small batch size.
Small batches increase overhead and may not improve reliability.
- A
Use a pull subscription with a 10-second acknowledgment deadline.
Why wrong: Short ack deadlines can cause duplicates and processing failures.
- B
Enable Dataflow Streaming Engine.
Streaming Engine offloads state management to the backend, improving reliability.
- C
Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).
Exactly-once processing prevents duplicate data.
- D
Disable autoscaling to prevent worker churn.
Why wrong: Autoscaling helps handle load spikes; disabling it can cause failures under load.
- E
Use micro-batch processing with a small batch size.
Why wrong: Small batches increase overhead and may not improve reliability.
A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?
Trap 1: Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service…
VPC Service Controls reduce exfiltration risk but do not provide access auditing.
Trap 2: Use Default Encryption and enable Data Loss Prevention (DLP) API.
Default Encryption does not allow customer control over keys.
Trap 3: Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service…
CSEK requires the customer to supply the key material, which may not be desirable for all scenarios.
- A
Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.
Why wrong: VPC Service Controls reduce exfiltration risk but do not provide access auditing.
- B
Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.
CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.
- C
Use Default Encryption and enable Data Loss Prevention (DLP) API.
Why wrong: Default Encryption does not allow customer control over keys.
- D
Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.
Why wrong: CSEK requires the customer to supply the key material, which may not be desirable for all scenarios.
A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?
Trap 1: The pipeline uses default triggers instead of after-watermark…
Trigger type does not cause duplicates.
Trap 2: The fixed window duration is too short, causing overlapping windows.
Fixed windows are non-overlapping.
Trap 3: The pipeline is using too many Dataflow workers, causing load…
Load balancing does not cause duplicate rows.
- A
Dataflow is retrying BigQuery streaming inserts after a timeout, and the retries succeed even though the original insert succeeded.
This is a known scenario: BigQuery streaming inserts are not idempotent, and retries can lead to duplicates.
- B
The pipeline uses default triggers instead of after-watermark triggers.
Why wrong: Trigger type does not cause duplicates.
- C
The fixed window duration is too short, causing overlapping windows.
Why wrong: Fixed windows are non-overlapping.
- D
The pipeline is using too many Dataflow workers, causing load balancing issues.
Why wrong: Load balancing does not cause duplicate rows.
A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?
You have a batch prediction job on Vertex AI that processes millions of records. The job is failing with an out-of-memory error. What is the best way to resolve this?
Trap 1: Increase the minNodes and maxNodes for the batch prediction job
Batch prediction uses machine types, not nodes.
Trap 2: Split the input data into smaller files and run multiple batch…
This is a workaround but not the best.
Trap 3: Enable autoscaling on the batch prediction job
Batch prediction does not autoscale.
- A
Increase the minNodes and maxNodes for the batch prediction job
Why wrong: Batch prediction uses machine types, not nodes.
- B
Split the input data into smaller files and run multiple batch prediction jobs
Why wrong: This is a workaround but not the best.
- C
Enable autoscaling on the batch prediction job
Why wrong: Batch prediction does not autoscale.
- D
Use a machine type with more memory for the batch prediction job
Increasing memory directly solves OOM.
A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?
Trap 1: Implement a retry loop with exponential backoff in the DAG.
Retries may still hit timeouts if the issue persists.
Trap 2: Use preemptible VMs for the cluster to reduce cost and improve…
Preemptible VMs may have longer creation times due to resource availability.
Trap 3: Increase the cluster creation timeout in the Airflow configuration.
This merely masks the problem without addressing the root cause.
- A
Create a long-running Dataproc cluster that remains idle and reuse it for each workflow.
Reusing an existing cluster eliminates the creation step and associated timeout.
- B
Implement a retry loop with exponential backoff in the DAG.
Why wrong: Retries may still hit timeouts if the issue persists.
- C
Use preemptible VMs for the cluster to reduce cost and improve creation speed.
Why wrong: Preemptible VMs may have longer creation times due to resource availability.
- D
Increase the cluster creation timeout in the Airflow configuration.
Why wrong: This merely masks the problem without addressing the root cause.
Your team is using Vertex AI Pipelines to orchestrate a model retraining workflow. The pipeline includes a data validation step, a training step, and a model evaluation step. You want to ensure that if the evaluation step fails due to low model performance, the pipeline stops and does not deploy the model. Which approach should you use?
Trap 1: Run the evaluation step after deployment and roll back if…
This would deploy a poorly performing model before rollback.
Trap 2: Configure the evaluation step to retry up to 3 times on failure
Retrying does not address low model performance.
Trap 3: Create a separate pipeline for deployment and trigger it manually…
Manual trigger is not automated and may cause delays.
- A
Run the evaluation step after deployment and roll back if performance is low
Why wrong: This would deploy a poorly performing model before rollback.
- B
Configure the evaluation step to retry up to 3 times on failure
Why wrong: Retrying does not address low model performance.
- C
Use a Conditional in the pipeline to check evaluation metrics and only run the deployment step if metrics pass thresholds
Conditionals allow pipeline to branch based on results.
- D
Create a separate pipeline for deployment and trigger it manually after review
Why wrong: Manual trigger is not automated and may cause delays.
Which TWO are best practices for monitoring a deployed machine learning model in production on Vertex AI?
Trap 1: Set up a weekly retraining pipeline triggered by calendar schedule
This is not monitoring; it's a fixed schedule.
Trap 2: Monitor the training job duration to detect anomalies
Training duration is not a production monitoring metric.
Trap 3: Monitor the model's file size to ensure it hasn't changed
File size is not indicative of model quality.
- A
Set up a weekly retraining pipeline triggered by calendar schedule
Why wrong: This is not monitoring; it's a fixed schedule.
- B
Enable Vertex AI Model Monitoring to track feature drift and skew
Model Monitoring automatically detects drift.
- C
Monitor the training job duration to detect anomalies
Why wrong: Training duration is not a production monitoring metric.
- D
Monitor the distribution of predictions over time to detect concept drift
Monitoring predictions helps identify when the model's behavior changes.
- E
Monitor the model's file size to ensure it hasn't changed
Why wrong: File size is not indicative of model quality.
A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?
Trap 1: Use Cloud Dataproc Serverless for all Spark jobs.
Serverless may not support custom Spark configurations.
Trap 2: Migrate jobs to Cloud Dataflow.
Dataflow is not Spark-compatible.
Trap 3: Run Spark on Compute Engine instances with startup scripts.
Requires manual cluster management.
- A
Use Cloud Dataproc Serverless for all Spark jobs.
Why wrong: Serverless may not support custom Spark configurations.
- B
Migrate jobs to Cloud Dataflow.
Why wrong: Dataflow is not Spark-compatible.
- C
Run Spark on Compute Engine instances with startup scripts.
Why wrong: Requires manual cluster management.
- D
Use Dataproc clusters with auto-scaling and preemptible VMs.
Reduces cost and operational overhead.
A data science team has built a model using scikit-learn. They want to operationalize it on Google Cloud without rewriting the code. Which approach should they take?
Trap 1: Export the model as a PMML file and use BigQuery ML
BigQuery ML does not support PMML.
Trap 2: Use AI Platform Training to host the model directly
AI Platform Training is for training, not serving.
Trap 3: Convert the scikit-learn model to TensorFlow SavedModel format
This requires rewriting and may not be exact.
- A
Export the model as a PMML file and use BigQuery ML
Why wrong: BigQuery ML does not support PMML.
- B
Use AI Platform Training to host the model directly
Why wrong: AI Platform Training is for training, not serving.
- C
Package the model in a custom container and deploy to Vertex AI Endpoints
Custom containers allow any framework without code changes.
- D
Convert the scikit-learn model to TensorFlow SavedModel format
Why wrong: This requires rewriting and may not be exact.
Question Discussion
Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.
Sign in to join the discussion.