Knowledge + Practice

CCNA Pde Designing Data Systems Questions

35 of 110 questions · Page 2/2 · Pde Designing Data Systems topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQmedium

A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?

A.Switch to Cloud Dataproc and use Spark Structured Streaming

B.Increase the number of workers and use at-least-once delivery to BigQuery

C.Enable Dataflow Streaming Engine and use BigQuery exactly-once sink

D.Use a global window and disable triggers

AnswerC

Streaming Engine reduces latency and improves reliability; BigQuery exactly-once sink prevents duplicates.

Why this answer

Enabling streaming engine moves the state management to the backend, reducing latency and improving reliability. Using exactly-once sinks (like BigQuery with exactly-once guarantees) prevents data loss.

Practice this question →

77

MCQmedium

A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?

A.Materialized views cannot be clustered.

B.Materialized views must be manually refreshed by the user.

C.Materialized views can only be created on ingestion-time partitioned tables.

D.Materialized views are automatically updated when the base table changes.

AnswerD

Yes, BigQuery manages incremental refreshes.

Why this answer

BigQuery materialized views are automatically refreshed when base tables are changed.

Practice this question →

78

Multi-Selectmedium

A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?

Select 3 answers

A.Create authorized views for each team and grant access to the views

B.Cluster tables by team_id to improve performance

C.Grant bigquery.dataViewer on the dataset to all teams

D.Use table-level IAM to assign bigquery.dataViewer per table to each team

E.Enable deletion protection on critical tables

AnswersA, D, E

Authorized views restrict access to specific columns/rows per team.

Why this answer

Authorized views allow sharing specific table data without direct table access. Dataset-level IAM is too broad. Deletion protection prevents accidental table drops.

Clustering improves performance but not access control.

Practice this question →

79

MCQeasy

A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?

A.Partition by customer_id and cluster by transaction_date

B.Cluster by both customer_id and transaction_date

C.Use a materialized view with customer_id and transaction_date

D.Partition by transaction_date and cluster by customer_id

AnswerD

Partitioning by date allows BigQuery to prune partitions for queries with date ranges. Clustering by customer_id sorts data within partitions, speeding up customer-level queries.

Why this answer

Clustering sorts data based on one or more columns, improving query performance for filters and sorts on those columns. Partitioning by date/timestamp can further improve performance for time-range queries. For this scenario, clustering on customer_id and partitioning by transaction_date is optimal.

Practice this question →

80

MCQeasy

You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?

A.Cloud Scheduler

B.Pub/Sub Lite

C.Pub/Sub

D.Cloud Tasks

AnswerB

Pub/Sub Lite is cheaper and suitable for applications that can tolerate some message loss.

Why this answer

Pub/Sub Lite offers a lower-cost option compared to Pub/Sub, with reduced reliability (e.g., at-least-once delivery but not exactly-once). It is designed for cost-sensitive streaming workloads where occasional loss is acceptable.

Practice this question →

81

MCQeasy

You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?

A.Cloud Dataflow with Beam SQL

B.Cloud Dataproc with Dataproc on GKE

C.Cloud BigQuery with external tables on Cloud Storage

D.Cloud Dataproc with Cloud Storage and Dataproc Metastore

AnswerD

Dataproc runs Spark and Hive, and Dataproc Metastore provides a shared Hive metastore. Cloud Storage replaces HDFS.

Why this answer

Dataproc is the managed Hadoop/Spark service that allows you to run existing Spark and Hive code without modification. Dataproc Metastore provides a fully managed Hive metastore that can be shared across multiple Dataproc clusters. Cloud Storage is used instead of HDFS for storing data, but the metastore is the key component.

Practice this question →

82

MCQmedium

You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?

A.Cloud Dataprep

B.Cloud Dataflow

C.Dataproc

D.Cloud Data Fusion

AnswerA

Dataprep is specifically designed for visual data wrangling with scheduling capabilities.

Why this answer

Cloud Dataprep (Trifacta) provides a visual interface for data wrangling, allowing users to create recipes and schedule jobs.

Practice this question →

83

Multi-Selectmedium

A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?

Select 2 answers

A.Set max delivery attempts to 5 on the subscription.

B.Set the subscription's ack deadline to 600 seconds.

C.Enable message ordering on the subscription.

D.Create a dead letter topic and attach it to the subscription.

E.Set the subscription type to push.

AnswersA, D

Required to trigger dead letter after attempts.

Why this answer

Dead letter topics require setting max delivery attempts and specifying a dead letter topic.

Practice this question →

84

MCQhard

A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?

A.Use dataset-level ACLs to deny the partner access to the original table and grant access to a view.

B.Export the filtered data to a new BigQuery dataset and grant the partner access to that dataset.

C.Create an authorized view in the same dataset, excluding PII columns, and share only the view with the partner's BigQuery account.

D.Create a materialized view and grant the partner the bigquery.dataViewer role on the dataset.

AnswerC

Authorized views allow precise control. The view is defined in the dataset and can be shared with specific users. The partner can query the view but cannot access the underlying table, even if they have permissions on the view. The company retains full control.

Why this answer

Authorized views allow you to share a view with specific users/groups while restricting access to the underlying table. The view can be defined to exclude PII columns and can be shared with the partner's account. The partner queries the view directly, but cannot access the base table.

This maintains data control and security.

Practice this question →

85

MCQmedium

An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?

A.Preemptible instances cannot be used with the standard cluster mode

B.Jobs must be designed to handle node preemption, and overall job runtime may increase

C.Preemptible instances are only available in certain regions

D.Jobs will automatically restart from the last checkpoint without any performance impact

AnswerB

Preemption can cause task re-execution, so fault tolerance is required and runtime may increase.

Why this answer

Preemptible VMs can be terminated at any time, so Spark jobs must be fault-tolerant. Dataproc handles this by automatically rescheduling failed tasks, but the job may take longer.

Practice this question →

86

MCQmedium

You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?

A.Sliding windows of length 5 minutes with a period of 1 minute

B.Global windows with a trigger firing every minute

C.Fixed windows of 5 minutes

D.Session windows with a gap duration of 1 minute

AnswerA

Sliding windows produce overlapping windows every minute, exactly what is needed.

Why this answer

Sliding windows of length 5 minutes with a period of 1 minute give the desired overlapping windows: every minute, you get the average over the last 5 minutes.

Practice this question →

87

MCQmedium

A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?

A.Use a custom image with more memory

B.Set spark.driver.memory to a higher value

C.Increase the number of workers by setting --num-workers

D.Set spark.executor.memory to a higher value, e.g., 8g

AnswerD

This directly increases memory per executor, fixing OOM errors.

Why this answer

In Dataproc Serverless, Spark properties can be set via --properties. The spark.executor.memory property controls the memory per executor. Increasing it can resolve OOM errors.

Practice this question →

88

MCQhard

Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?

A.The pipeline is using a different version of the Wrangler plugin

B.The Cloud Storage bucket is in a different region than the Data Fusion instance

C.The Wrangler plugin is not deployed in the Cloud Data Fusion instance

D.The service account used in the pipeline does not have permissions to write to BigQuery

AnswerC

The Cloud Data Fusion studio uses a different environment than the pipeline runtime. The Wrangler plugin must be deployed in the runtime environment; otherwise, directives fail.

Why this answer

When a pipeline that works in the studio fails at runtime, common issues include differences in environment (e.g., runtime arguments, service account permissions, or plugin versions). But the most likely cause is that the pipeline configuration does not include the necessary plugins or the runtime environment is missing the required artifacts. In Cloud Data Fusion, the studio uses a local or preview environment, while the pipeline runs on a separate Cloud Data Fusion instance with its own set of plugins.

If the Wrangler plugin is not deployed to the runtime environment, the directive will fail.

Practice this question →

89

MCQeasy

A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?

A.Authorized view

B.Clustered table

C.Materialized view

D.Partitioned table with partition expiration

AnswerD

Partition expiration automatically deletes partitions older than a specified number of days. This is ideal for time-based data retention.

Why this answer

Partitioned tables with a partition expiration allow automatic deletion of partitions. Clustering does not affect data expiration. Materialized views are for pre-computed aggregates, not data lifecycle.

Authorized views control access.

Practice this question →

90

MCQmedium

A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?

A.Partition by ingestion time and cluster by customer_id

B.Use a materialized view that filters by date and customer_id

C.Cluster by date column and partition by customer_id

D.Partition by date column and cluster by customer_id

AnswerD

Partitioning reduces scan to relevant dates; clustering improves filtering on customer_id.

Why this answer

Partitioning by date allows pruning irrelevant partitions; clustering on customer_id orders data within partitions for efficient filtering. Clustering alone doesn't prune partitions. Ingestion-time partitioning is based on arrival time, not logical date.

Practice this question →

91

MCQmedium

A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?

A.Use Pub/Sub with exactly-once delivery enabled and Dataflow with exactly-once processing

B.Use Pub/Sub with a unique message ID and Dataflow with idempotent writes or Dataflow's exactly-once sink

C.Use Pub/Sub with message deduplication and Dataflow with at-least-once processing

D.Use Pub/Sub with a dead letter topic and Dataflow with automatic retries

AnswerB

By using a unique ID, you can deduplicate in Dataflow. Dataflow's exactly-once sinks also help ensure no duplicates.

Why this answer

Pub/Sub offers at-least-once delivery. To achieve exactly-once processing, the pipeline must be idempotent or use Dataflow's exactly-once sinks. Using a unique message ID for deduplication is a common approach.

Practice this question →

92

Multi-Selectmedium

A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)

Select 3 answers

A.Dataflow

B.Pub/Sub

C.Kafka Connect (with Pub/Sub connector)

D.Cloud NAT

E.Cloud Functions

AnswersA, B, C

Dataflow processes the streaming data for aggregation.

Why this answer

Pub/Sub is the target messaging system. Dataflow can read from Kafka directly using the Kafka I/O connector, with no need for intermediate services. Cloud NAT is not needed.

Cloud Functions could be used but is not required and adds complexity. Kafka Connect with the Pub/Sub connector is a standard way to stream data from Kafka to Pub/Sub. So the three services are Pub/Sub, Dataflow, and Kafka Connect (or a combination of Dataflow reading from Kafka and writing to Pub/Sub).

However, the options given are: Pub/Sub, Dataflow, Cloud NAT, Cloud Functions, and Kafka Connect. The correct three are: Pub/Sub, Dataflow, and Kafka Connect. But note that Dataflow can read from Kafka and write to Pub/Sub, eliminating the need for Kafka Connect.

However, the question specifically says 'with minimal custom code', and Kafka Connect provides a no-code connector. Alternatively, Dataflow with the Kafka I/O connector requires some code but is still minimal. The best answer set is: Pub/Sub, Dataflow, and Kafka Connect.

Practice this question →

93

MCQhard

A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?

A.Use a side input to read one stream as a map and enrich the other stream

B.Use Flatten to merge the streams and then Partition

C.Use CoGroupByKey on the two streams using a common key like user_id

D.Use Union to combine both streams into one and then apply GroupByKey

AnswerC

CoGroupByKey joins multiple streams by key within the same window.

Why this answer

CoGroupByKey joins multiple PCollections by key. Using user_id as common key, both streams can be joined. Side inputs and Union are not for joining.

Flatten merges PCollections of same type.

Practice this question →

94

MCQmedium

Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?

A.Cloud Data Fusion

B.Cloud Dataflow

C.Cloud Dataproc Serverless

D.Cloud Dataproc (standard cluster)

AnswerD

Dataproc standard clusters support both batch and streaming Spark jobs with minimal code changes. It is managed, though not fully serverless.

Why this answer

Cloud Dataflow uses Apache Beam, which is a different programming model than Spark. Dataproc is the managed Spark service that allows you to run existing Spark code with minimal changes, and Dataproc Serverless eliminates cluster management. However, Dataproc Serverless currently only supports batch workloads, not streaming.

The question asks for both batch and streaming future support. Dataproc (standard) supports both batch and streaming with Spark Structured Streaming. But it is not fully serverless.

Dataproc Serverless is serverless but only batch. So the best answer is Dataproc (standard) with a cluster that can be used for both.

Practice this question →

95

MCQhard

A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?

A.Increase the number of workers in the pipeline.

B.Switch from ParDo to MapElements.

C.Use a side input to batch elements and make fewer API calls.

D.Use a GroupByKey to group elements with the same key and then make one API call per group.

AnswerC

Batching reduces API calls, improving throughput.

Why this answer

Using side inputs to batch data before API calls can reduce the number of calls and improve throughput.

Practice this question →

96

Multi-Selectmedium

A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?

Select 2 answers

A.ParDo

B.Windowing

C.CoGroupByKey

D.GroupByKey

E.Side inputs

AnswersA, E

ParDo is used for per-element transformation, such as looking up enrichment data.

Why this answer

Side inputs allow reading from Bigtable and Spanner in a non-blocking way. ParDo is for per-element processing where enrichment occurs. GroupByKey and Windowing are not needed for this enrichment step.

Practice this question →

97

MCQeasy

Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?

A.Dataproc on GKE

B.Dataflow

C.Dataproc Serverless

D.Cloud Data Fusion

AnswerC

Serverless Spark is a feature of Dataproc Serverless.

Why this answer

Dataproc Serverless allows running Spark workloads without managing clusters.

Practice this question →

98

MCQeasy

You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?

A.Preemptible worker nodes

B.Standard worker nodes

C.High-memory worker nodes

D.Sole-tenant nodes

AnswerA

Preemptible VMs offer up to 80% discount and are suitable for fault-tolerant workloads.

Why this answer

Preemptible VMs are significantly cheaper but can be terminated at any time. Since the job is fault-tolerant, preemptible workers can be used for cost savings.

Practice this question →

99

MCQmedium

You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?

A.ParDo

B.Flatten

C.CoGroupByKey

D.GroupByKey

AnswerC

CoGroupByKey joins multiple PCollections by key.

Why this answer

CoGroupByKey performs a key-based join of multiple PCollections. It can handle unbounded streams with appropriate windowing.

Practice this question →

100

MCQmedium

A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?

A.withAllowedLateness(Duration.standardMinutes(5)).triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))

B.withAllowedLateness(Duration.standardMinutes(5)).triggering(AfterWatermark.pastEndOfWindow().withLateFirings(AfterPane.elementCountAtLeast(1)))

C.triggering(AfterWatermark.pastEndOfWindow()).withAllowedLateness(Duration.standardMinutes(5))

D.withAllowedLateness(Duration.standardMinutes(5)).accumulatingFiredPanes()

AnswerB

Allows 5 min lateness and fires a final pane after watermark passes end of window.

Why this answer

In Beam, allowed lateness and triggering combine to handle late data.

Practice this question →

101

MCQmedium

A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?

A.Dataproc

B.Dataprep

C.Dataflow

D.Cloud Data Fusion

AnswerB

Dataprep provides a visual interface for transformations.

Why this answer

Dataprep is a visual data wrangling tool for exploring and cleaning data.

Practice this question →

102

MCQeasy

Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?

A.IAM roles

B.Authorized views

C.Dataset access controls

D.Materialized views

AnswerB

Authorized views allow sharing query results securely.

Why this answer

Authorized views allow sharing results without granting access to the base tables.

Practice this question →

103

MCQhard

A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?

A.Set allowed lateness to 5 minutes and use the default trigger

B.Use a sliding window of 1 minute with a 1-minute period

C.Use a global window with a trigger that fires every 10 seconds

D.Increase the watermark estimate to 10 minutes

AnswerA

This allows late events up to 5 minutes after the window end, and the default trigger fires at the end of the window plus allowed lateness.

Why this answer

Setting a watermark estimate and allowed lateness with a trigger controls how long the pipeline waits for late data. The default trigger fires at the end of the window, and with allowed lateness, late events are still processed until the allowed time expires.

Practice this question →

104

MCQhard

A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?

A.The Pub/Sub topic has a schema that the messages do not comply with

B.The Pub/Sub topic is not configured with a schema

C.The Dataflow pipeline is using at-least-once delivery guarantee

D.The subscription's ack deadline is too short

AnswerA

Topic schema enforcement causes non-compliant messages to be rejected and sent to dead letter.

Why this answer

The subscription's message schema enforcement validates incoming messages; if the message doesn't conform to the schema, it is forwarded to the dead letter topic.

Practice this question →

105

Multi-Selecthard

A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)

Select 2 answers

A.Dataproc Serverless (Spark)

B.Dataproc on GKE

C.Standard Dataproc cluster with preemptible workers

D.Cloud Composer with Spark

E.Dataflow with Spark Runner

AnswersA, B

Dataproc Serverless runs Spark jobs without cluster management, supports autoscaling, and integrates with Cloud Storage and BigQuery.

Why this answer

Dataproc Serverless allows running Spark jobs without managing clusters, with autoscaling and pay-per-use pricing. Dataproc on GKE enables running Spark on Kubernetes with autoscaling and is fully managed. Standard Dataproc requires cluster management and is not serverless.

Dataflow is for Beam, not Spark. Cloud Composer is for orchestration, not data processing.

Practice this question →

106

MCQmedium

Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?

A.Push subscription with a load balancer

B.Pull subscription with flow control settings

C.Push subscription with endpoint on Cloud Run

D.Pull subscription with exactly-once delivery disabled

AnswerB

Pull subscriptions allow the subscriber to pull messages at its own pace, and flow control helps prevent overwhelming the consumer. This combination handles high throughput efficiently.

Why this answer

Pull subscriptions allow the subscriber to control the throughput by batching messages and setting flow control, which is ideal for high-throughput scenarios. Using a pull subscription with exactly-once delivery (if available) or at-least-once combined with idempotent processing ensures no data loss. Push subscriptions have limitations on throughput and are not suitable for millions of events per second.

Practice this question →

107

Multi-Selecteasy

Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)

Select 2 answers

A.Scheduling data quality jobs

B.Column histograms

C.Joining datasets

D.Recipe steps

E.Data quality profiling

AnswersB, E

Histograms visually display the distribution of values, helping to spot unexpected patterns.

Why this answer

Data quality profiling provides statistics and distributions to identify anomalies. Column histograms visualize data distribution and outliers. Scheduling and recipe steps are execution features, not exploratory analysis.

Joins are transformations, not profiling.

Practice this question →

108

MCQmedium

A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?

A.Cloud SQL JDBC driver and Cloud Functions

B.Dataproc Metastore and Cloud Storage sink

C.Wrangler and Dataproc

D.CDAP JDBC plugin and the Hub

AnswerD

CDAP JDBC plugin connects to on-prem DB; Hub provides prebuilt transforms.

Why this answer

Cloud Data Fusion uses CDAP plugins for JDBC connections and the Hub provides prebuilt transforms. Plugins are the mechanism; Hub is where they are sourced. Wrangler is for data preparation, not sink.

Dataproc is not needed as Data Fusion runs on its own infrastructure.

Practice this question →

109

MCQmedium

A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?

A.Cloud SQL for MySQL

B.Dataproc Metastore

C.BigQuery as a Hive metastore

D.Dataproc on GKE

AnswerB

Dataproc Metastore is a managed Hive metastore service that works with Dataproc clusters.

Why this answer

Dataproc Metastore provides a fully managed Hive metastore that integrates with Dataproc clusters, allowing existing Spark jobs to use it without code changes.

Practice this question →

110

MCQmedium

You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?

A.Dataproc with HDFS and Cloud Bigtable for metadata

B.Dataproc with Cloud Storage and Cloud SQL for Hive metastore

C.Dataflow with Cloud Storage and BigQuery

D.Dataproc with Cloud Storage and Dataproc Metastore

AnswerD

Dataproc Metastore is a fully managed Hive metastore. Cloud Storage replaces HDFS seamlessly.

Why this answer

Dataproc Metastore provides a fully managed Hive metastore service that can be used with Dataproc clusters. Cloud Storage can replace HDFS via the gs:// connector, allowing the same file paths. This minimises code changes.

Practice this question →

← PreviousPage 2 of 2 · 110 questions total

Ready to test yourself?

Try a timed practice session using only Pde Designing Data Systems questions.

Start 20-question session

CCNA Pde Designing Data Systems Questions — Page 2 of 2 | Courseiva