Google Professional Data Engineer (PDE) — Questions 976990

990 questions total · 14pages · All types, answers revealed

Page 13

Page 14 of 14

976
MCQhard

An e-commerce company uses Cloud Spanner for order processing. They need to query orders by customer ID and retrieve all order items. Which schema design pattern should they use for optimal performance?

A.Use interleaved tables where Orders is the parent and OrderItems is an interleaved child table with the same primary key prefix.
B.Store all data in a single table with nullable columns for order item attributes.
C.Denormalize by storing order items as a repeated field in the orders table.
D.Create two separate tables with a secondary index on customer_id in the orders table and a secondary index on order_id in the order_items table.
AnswerA

Interleaving co-locates child rows with their parent, enabling efficient joins and strong consistency.

Why this answer

Interleaved tables in Cloud Spanner physically co-locate parent and child rows on the same split, so querying orders by customer_id and retrieving all order items becomes a single, fast key-range scan without cross-node joins. This design exploits Spanner's hierarchical storage model to minimize latency and maximize throughput for this access pattern.

Exam trap

Cisco often tests the misconception that secondary indexes alone are sufficient for performance, ignoring that Spanner's distributed architecture makes cross-table joins expensive, whereas interleaving provides physical co-location that avoids network round-trips.

How to eliminate wrong answers

Option B is wrong because a single table with nullable columns violates normalization, wastes storage, and forces complex queries to filter order-item rows from order rows, eliminating Spanner's interleaving performance benefit. Option C is wrong because storing order items as a repeated field (e.g., ARRAY<STRUCT>) prevents independent indexing and filtering of individual items, and updating a single item requires rewriting the entire order row, causing contention and poor concurrency. Option D is wrong because two separate tables with secondary indexes on customer_id and order_id require a distributed join across splits, incurring cross-node communication and higher latency compared to the co-located access of interleaved tables.

977
Multi-Selecteasy

A data engineer is preparing a dataset for ML training in Vertex AI. The dataset includes a timestamp column, a categorical column with high cardinality (1000 distinct values), and a numerical column with outliers. Which two preprocessing steps should they apply? (Choose TWO)

Select 2 answers
A.Drop the timestamp column
B.Label encode the categorical column
C.Winsorize the numerical column to cap outliers
D.Normalize the numerical column using Z-score
E.One-hot encode the categorical column
AnswersB, C

Label encoding maps categories to integers, reducing dimensionality.

Why this answer

One-hot encoding for high cardinality may be too sparse; label encoding (ordinal encoder) is more common. Winsorizing clips outliers.

978
MCQhard

Refer to the exhibit. A team is trying to run a custom prediction container on Vertex AI Endpoint. They get this error when the container starts. What is the most likely cause?

A.The container image is too large
B.The entry point is missing or incorrect
C.The container is built for a different CPU architecture
D.The model file is missing from the container
AnswerB

The error message directly states to ensure the container has an entry point.

Why this answer

The error occurs when the container starts, which typically happens during the initial health check or readiness probe. Vertex AI Endpoints require a valid entry point (e.g., CMD or ENTRYPOINT in the Dockerfile) to start the prediction server. If the entry point is missing or incorrect, the container fails to launch, resulting in the observed error.

Exam trap

Google Cloud often tests the distinction between container startup failures (entry point issues) and runtime failures (missing model files or architecture mismatches), leading candidates to confuse a missing model file with a startup error.

How to eliminate wrong answers

Option A is wrong because container image size does not prevent startup; Vertex AI supports images up to 10 GB, and a large image would only affect pull time, not the container's ability to start. Option C is wrong because CPU architecture mismatch would cause a runtime crash or 'exec format error' during execution, not a startup failure, and Vertex AI uses x86_64 architecture by default. Option D is wrong because a missing model file would cause a runtime error during prediction (e.g., 404 or model load failure), not a container startup failure, as the container can still start and listen for requests.

979
Multi-Selecteasy

Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)

Select 2 answers
A.Scheduling data quality jobs
B.Column histograms
C.Joining datasets
D.Recipe steps
E.Data quality profiling
AnswersB, E

Histograms visually display the distribution of values, helping to spot unexpected patterns.

Why this answer

Data quality profiling provides statistics and distributions to identify anomalies. Column histograms visualize data distribution and outliers. Scheduling and recipe steps are execution features, not exploratory analysis.

Joins are transformations, not profiling.

980
MCQmedium

A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?

A.Cloud SQL JDBC driver and Cloud Functions
B.Dataproc Metastore and Cloud Storage sink
C.Wrangler and Dataproc
D.CDAP JDBC plugin and the Hub
AnswerD

CDAP JDBC plugin connects to on-prem DB; Hub provides prebuilt transforms.

Why this answer

Cloud Data Fusion uses CDAP plugins for JDBC connections and the Hub provides prebuilt transforms. Plugins are the mechanism; Hub is where they are sourced. Wrangler is for data preparation, not sink.

Dataproc is not needed as Data Fusion runs on its own infrastructure.

981
MCQeasy

Which Google Cloud service would you use to create a unified data catalog that automatically captures lineage from BigQuery, Cloud Storage, and other sources?

A.Cloud Composer
B.Dataflow
C.Data Catalog
D.Dataplex
AnswerD

Dataplex includes a unified catalog, lineage, and governance.

Why this answer

Dataplex provides a unified data catalog (Universal Catalog) with automated lineage, discovery, and governance across GCP. Data Catalog is the older standalone service; Dataplex is the recommended unified solution. Cloud Composer and Dataflow are orchestration/processing tools.

982
MCQhard

A company uses BigQuery flat-rate pricing with 500 slots purchased as a committed use discount. During peak hours, they need additional capacity but do not want to buy more committed slots. They have a secondary project used for ad-hoc queries by analysts. How can they provide burst capacity to the primary project during peak times without increasing committed spend?

A.Create flex slots in the secondary project, create a reservation in the secondary project, and assign the reservation to the primary project.
B.Enable autoscaling slot management in the primary project's reservation, allowing slots to scale up based on demand.
C.Upgrade the primary project's edition to Enterprise Plus to allow bursting.
D.Purchase additional committed use slots in the primary project and apply them to the reservation.
AnswerA

Flex slots provide temporary slots; they can be assigned to the primary project via a reservation in the secondary project.

Why this answer

Flex slots allow temporary capacity in a separate project, and can be assigned to the primary project via reservations. This provides burst capacity without committing to long-term purchases.

983
MCQmedium

A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?

A.Cloud SQL for MySQL
B.Dataproc Metastore
C.BigQuery as a Hive metastore
D.Dataproc on GKE
AnswerB

Dataproc Metastore is a managed Hive metastore service that works with Dataproc clusters.

Why this answer

Dataproc Metastore provides a fully managed Hive metastore that integrates with Dataproc clusters, allowing existing Spark jobs to use it without code changes.

984
MCQmedium

You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?

A.Dataproc with HDFS and Cloud Bigtable for metadata
B.Dataproc with Cloud Storage and Cloud SQL for Hive metastore
C.Dataflow with Cloud Storage and BigQuery
D.Dataproc with Cloud Storage and Dataproc Metastore
AnswerD

Dataproc Metastore is a fully managed Hive metastore. Cloud Storage replaces HDFS seamlessly.

Why this answer

Dataproc Metastore provides a fully managed Hive metastore service that can be used with Dataproc clusters. Cloud Storage can replace HDFS via the gs:// connector, allowing the same file paths. This minimises code changes.

985
Multi-Selecteasy

Which TWO options can help reduce costs for a Dataflow batch pipeline that processes 100 GB of data daily from Cloud Storage? (Choose 2)

Select 2 answers
A.Use Dataflow Prime (now Dataflow Runner v2)
B.Use high-memory machine types
C.Use Streaming Engine
D.Use FlexRS (Flexible Resource Scheduling)
E.Use preemptible VMs for Dataflow workers
AnswersD, E

FlexRS offers discounted pricing for batch jobs that are flexible on start time.

Why this answer

FlexRS (Flexible Resource Scheduling) allows you to run batch workloads on a discounted, flexible schedule. It reduces costs by offering lower prices in exchange for the job being able to wait up to 6 hours for resources to become available. This is ideal for a daily 100 GB batch pipeline that can tolerate some scheduling delay.

Exam trap

Google Cloud often tests the distinction between batch and streaming optimizations, so the trap here is that candidates might select Streaming Engine (Option C) thinking it reduces costs in batch pipelines, when it is only relevant for streaming.

986
Multi-Selectmedium

A data team wants to use Approximate Aggregation Functions in BigQuery to get faster query results. Which two functions can they use? (Choose 2)

Select 2 answers
A.APPROX_SUM
B.APPROX_AVG
C.APPROX_QUANTILES
D.APPROX_COUNT_DISTINCT
E.APPROX_MEDIAN
AnswersC, D

Returns approximate quantiles.

Why this answer

BigQuery provides APPROX_COUNT_DISTINCT for approximate distinct counts and APPROX_QUANTILES for approximate quantiles. Other approximate functions include APPROX_TOP_COUNT and APPROX_TOP_SUM.

987
MCQmedium

A company uses a custom container image for model serving. The image is large (10 GB). During deployment, they get timeouts. What should they do?

A.Pre-pull the image on all nodes
B.Increase the timeout in the deployment config
C.Switch to a larger machine type
D.Use a smaller base image
AnswerD

Smaller image reduces pull time and deployment time.

Why this answer

Option D is correct because using a smaller base image directly addresses the root cause of the timeout: the 10 GB image takes too long to download from the container registry during pod startup. By reducing the image size (e.g., using a slim or distroless base image), the pull time decreases, avoiding the default kubelet image pull timeout (typically 5 minutes) without requiring infrastructure changes.

Exam trap

Google Cloud often tests the misconception that increasing timeouts or scaling up hardware solves performance bottlenecks, when the correct answer is to optimize the artifact itself (image size) to meet the system's implicit constraints.

How to eliminate wrong answers

Option A is wrong because pre-pulling the image on all nodes is a manual workaround that does not solve the underlying issue of a bloated image; it also adds operational overhead and fails in dynamic clusters where new nodes are added. Option B is wrong because increasing the timeout in the deployment config (e.g., the `imagePullPolicy` or pod-level timeout) only masks the symptom and does not reduce the pull time, potentially leading to other timeouts in the cluster. Option C is wrong because switching to a larger machine type does not affect the network transfer time for pulling the image; it only provides more local resources, which does not address the slow image download.

988
Multi-Selectmedium

A company is planning to migrate a legacy batch ETL pipeline to Google Cloud. The pipeline involves reading from a relational database, transforming data, and writing to a data warehouse. Which three Google Cloud services can be used as the orchestration layer? (Choose three.)

Select 3 answers
A.Cloud Dataproc
B.Cloud Scheduler
C.Cloud Dataflow
D.Cloud Workflows
E.Cloud Composer
AnswersB, D, E

Cloud Scheduler can trigger jobs on a schedule, acting as a simple orchestrator.

Why this answer

Cloud Scheduler is a fully managed cron job service that can trigger orchestration workflows on a schedule. It is correct because it can initiate batch ETL pipelines by sending HTTP requests to Cloud Run, Cloud Functions, or Pub/Sub, making it a lightweight orchestration trigger for scheduled batch jobs.

Exam trap

Google Cloud often tests the distinction between data processing services (Dataproc, Dataflow) and orchestration services (Workflows, Composer, Scheduler), so candidates mistakenly select Dataproc or Dataflow thinking they can orchestrate, when they are actually execution engines.

989
Multi-Selectmedium

A data scientist needs to perform feature engineering for a machine learning model using Vertex AI. They want to preprocess data using a pipeline that includes scaling, one-hot encoding, and handling missing values. Which TWO services can they use to define and execute this preprocessing pipeline? (Choose 2.)

Select 2 answers
A.Cloud Dataproc
B.Vertex AI Pipelines
C.BigQuery SQL with ML.TRANSFORM
D.Cloud Dataflow
E.Cloud Functions
AnswersB, C

Allows you to build and run end-to-end ML pipelines, including preprocessing.

Why this answer

Vertex AI Pipelines is the recommended service for building and running ML pipelines, including preprocessing steps. Alternatively, you can use BigQuery SQL for feature engineering directly on the data, then export the processed data for training. Cloud Dataflow is an option for batch/streaming data processing but is not specific to ML pipelines.

Cloud Functions and Dataproc are less suitable for this purpose.

990
MCQmedium

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

A.Dataflow is retrying BigQuery streaming inserts after a timeout, and the retries succeed even though the original insert succeeded.
B.The pipeline uses default triggers instead of after-watermark triggers.
C.The fixed window duration is too short, causing overlapping windows.
D.The pipeline is using too many Dataflow workers, causing load balancing issues.
AnswerA

This is a known scenario: BigQuery streaming inserts are not idempotent, and retries can lead to duplicates.

Why this answer

Option A is correct because Dataflow uses at-least-once semantics for streaming inserts into BigQuery. When a streaming insert times out, Dataflow retries the insert, and if the original insert actually succeeded but the acknowledgment was lost, the retry produces a duplicate row. This is a known behavior of BigQuery streaming inserts with retry logic.

Exam trap

The trap here is that candidates often confuse trigger behavior (Option B) with the root cause of duplicates, not realizing that duplicates stem from retry semantics in the sink, not from windowing or parallelism.

How to eliminate wrong answers

Option B is wrong because default triggers in Dataflow (which fire on element arrival and after watermark) do not cause duplicate rows; they affect when results are emitted, not whether duplicates occur. Option C is wrong because fixed windows of 10 seconds do not overlap by design; overlapping windows would require a sliding window, not a fixed window. Option D is wrong because using too many Dataflow workers can cause resource inefficiency or shuffle issues, but it does not directly cause duplicate rows in BigQuery output.

Page 13

Page 14 of 14