DP-203 Design and develop data processing Practice Test 2 — 15 Questions

Question 1

You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?

Accepted Answer

Implement incremental processing using Auto Loader with 'directoryListing' mode to process only new files since the last run.. Option C is correct because the job reads the entire 5 TB dataset daily, which is inefficient when only new data needs processing. Auto Loader with 'directoryListing' mode incrementally identifies and processes only new files since the last run, drastically reducing the data volume and execution time. This directly addresses the root cause of the SLA breach—reading unchanged historical data repeatedly—rather than tuning resources or partitioning.

Answer

Increase the executor memory and cores in the Spark pool configuration to handle larger shuffles.

Answer

Repartition the data on the 'product_category' column with a higher number of partitions (e.g., 2000).

Answer

Use a broadcast join hint on the fact table to reduce shuffle operations.

Question 2

You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?

Accepted Answer

The corporate firewall or network device is closing idle TCP connections to the SQL Server database.. The intermittent nature of the timeout, combined with the fact that the SHIR VM can connect to SQL Server via SSMS, strongly suggests that the corporate firewall or a network intermediary (such as a load balancer or NAT device) is closing idle TCP connections. ADF pipelines may hold connections open between activities or during long-running data transfers, and if no keep-alive packets are sent within the firewall's idle timeout window (commonly 4–30 minutes), the firewall drops the TCP session. When ADF attempts to reuse that connection, it receives a 'Connection timed out' error because the socket is no longer valid.

Answer

The data being transferred is skewed, causing the sink to be overwhelmed.

Answer

The SQL Server database is experiencing high CPU utilization during the pipeline execution window.

Answer

The self-hosted integration runtime is running out of memory during peak loads.

Question 3

A data engineering team is designing a batch processing solution using Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) and must be processed daily with minimal cost. The team needs to choose between using a Delta Lake table or a Parquet file format for the processed output. Which TWO factors should the team consider when making this decision?

Accepted Answer

Delta Lake provides time travel capabilities for accessing historical data versions.. Option A is correct because Delta Lake's time travel feature allows querying previous versions of data using a timestamp or version number, which is essential for auditing, rollback, and reproducing historical reports. This capability is built on the transaction log that tracks every change, making it a key differentiator from plain Parquet files.

Answer

Parquet is easier to implement for schema evolution than Delta Lake.

Answer

Delta Lake reduces storage costs by automatically compressing data.

Answer

Parquet files are not natively supported by Azure Databricks.

Question 4

You are a data engineer at a retail company. You have designed a near real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, which receives clickstream events from the company's e-commerce website. The output is written to an Azure SQL Database table for reporting. Each event includes fields: UserId, ProductId, EventType (e.g., 'click', 'purchase'), and Timestamp. The requirement is to calculate the number of purchases per product in a 5-minute tumbling window and update a SQL table. The Stream Analytics job has been running for a week, but the reporting team notices that the purchase counts in SQL are consistently lower than expected compared to a direct count from Event Hubs. You suspect that late-arriving events are being dropped. The job's configuration includes a 5-minute tumbling window with no late arrival policy. What should you do to fix the issue without losing data?

Accepted Answer

Modify the query to use a larger tumbling window (e.g., 10 minutes) and add a late arrival policy with a 5-minute grace period to allow late events to be included.. Option A is correct because the current 5-minute tumbling window has no late arrival policy, so any event that arrives after the window ends is dropped. By increasing the window size to 10 minutes and adding a 5-minute late arrival grace period, you allow events that arrive up to 5 minutes late to still be included in the correct window, matching the actual purchase count from Event Hubs.

Answer

Modify the query to use TIMESTAMP BY on the EventHubs enqueued time instead of the event's Timestamp field.

Answer

Change the tumbling window to a hopping window with a 1-minute hop size to increase the frequency of output updates.

Answer

Add a second Stream Analytics job to process late-arriving events separately and union the results.

Question 5

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

Accepted Answer

Out of order tolerance window: 5 minutes; Late arrival tolerance window: 1 hour. Azure Stream Analytics uses two distinct temporal policies to handle event timing issues. The 'Late arrival tolerance window' defines how long the system waits for events that arrive after their timestamp (up to 1 hour in this scenario), while the 'Out of order tolerance window' specifies the maximum time difference allowed for events that arrive out of sequence (up to 5 minutes). Option A correctly configures both policies to match the requirements.

Answer

Out of order tolerance window: 1 hour; Late arrival tolerance window: 5 minutes

Answer

Use Event Hubs capture to handle late events; no additional configuration needed

Question 6

You are designing a data processing solution for a retail company. The solution must ingest streaming sales data from point-of-sale (POS) systems and batch uploads from stores that are offline. The total data volume is 5 TB daily. The solution must allow real-time dashboards and periodic batch processing. Which combination of services and ingestion patterns is most cost-effective and scalable?

Accepted Answer

Use Azure Event Hubs with Kafka protocol for all incoming data. Use Stream Analytics for real-time dashboards and Event Hubs Capture to land data in ADLS for batch processing. Option B is correct because Azure Event Hubs with Kafka protocol provides a unified ingestion endpoint for both streaming POS data and batch offline uploads, eliminating the need for separate services. Stream Analytics enables real-time dashboards, while Event Hubs Capture automatically lands data into Azure Data Lake Storage for cost-effective batch processing, making this the most scalable and cost-effective solution for 5 TB daily.

Answer

Use Azure IoT Hub for POS streaming and Azure Blob Storage for offline store uploads, then process with Stream Analytics and Data Factory

Answer

Use Azure Data Lake Storage for all data, then use Azure Databricks structured streaming for real-time and batch

Answer

Use Azure Stream Analytics directly on POS data and store offline uploads in Blob Storage, then batch process with U-SQL

Question 7

You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?

Accepted Answer

Self-hosted Integration Runtime. A self-hosted integration runtime (IR) is required to connect Azure Data Factory to on-premises SQL Server because it provides the compute environment for data movement between on-premises networks and Azure. It must be installed on a machine inside the corporate firewall, enabling secure communication via outbound HTTPS (port 443) to Azure. This is the only IR type that can access private, on-premises data sources directly.

Answer

Azure Integration Runtime

Answer

Managed Virtual Network Integration Runtime

Answer

Azure Data Factory Gateway

Question 8

You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?

Accepted Answer

Use CREATE TABLE AS SELECT (CTAS) with the hash-distributed column. Option B is correct because CTAS with a hash-distributed column loads data directly into the target table with the same distribution scheme, avoiding data movement and minimizing resource contention. This pattern is optimized for bulk loading large datasets into a hash-distributed fact table, as it leverages the Synapse SQL pool's MPP architecture to parallelize the operation without blocking concurrent queries.

Answer

Use PolyBase INSERT...SELECT with rowstore table

Answer

Use COPY INTO command with a round-robin distribution

Answer

Use Azure Data Factory Copy activity with staging enabled

Question 9

You are designing a stream processing solution using Azure Stream Analytics. The job must reference a static lookup table (product catalog) stored in Azure Blob Storage. The catalog is updated once daily. The job should automatically pick up the latest version without restarting. Which two configurations are required? (Choose two.)

Accepted Answer

Set the reference input's 'Path pattern' to include date and time placeholders. Option B is correct because Azure Stream Analytics reference data inputs support path pattern placeholders like {date} and {time} to dynamically resolve the latest blob file. This allows the job to automatically load a new version of the static lookup table when the blob is updated, without requiring a job restart. The path pattern must be structured to match the naming convention of the uploaded file, such as 'catalog/{date}/{time}/products.csv'.

Answer

Configure the reference input with a static blob path

Answer

Use Azure Event Grid to trigger job restart on blob update

Answer

Store the reference data in Azure SQL Database instead of Blob Storage

Question 10

You are designing a data transformation pipeline using Azure Databricks. The pipeline reads from Azure Data Lake Storage Gen2, performs aggregations, and writes to a Synapse dedicated SQL pool. Which three configurations should you implement to optimize performance and minimize cost? (Choose three.)

Accepted Answer

Enable Delta Lake on the storage account. Option A is correct because enabling Delta Lake on the storage account allows you to use Delta tables, which provide ACID transactions, scalable metadata handling, and unified batch/streaming capabilities. This is essential for reliable and performant data transformations in Azure Databricks, especially when reading from ADLS Gen2 and writing to Synapse.

Answer

Use a single-node cluster to reduce cost

Answer

Disable autoscaling to avoid cost variability

Answer

Use default Spark shuffle partitions (200)

Question 11

Which Azure service is primarily used for orchestrating data pipelines in a cloud-native ETL workflow?

Accepted Answer

Azure Data Factory. Azure Data Factory (ADF) is the correct answer because it is a cloud-native, serverless data integration service specifically designed for orchestrating and automating data pipelines. ADF provides a code-free visual interface, supports over 90 built-in connectors, and enables complex ETL/ELT workflows with control flow, data flow, and trigger-based scheduling, making it the primary orchestration tool in Azure.

Answer

Azure Synapse Analytics

Answer

Azure HDInsight

Answer

Azure Databricks

Question 12

When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?

Accepted Answer

Set the option 'mergeSchema' to 'true' on write. Option A is correct because in Delta Lake, schema evolution is automatically handled by setting the 'mergeSchema' option to 'true' on write operations. This allows new columns to be added or existing column types to be safely widened without manual intervention, preserving existing data and metadata integrity.

Answer

Set the option 'overwriteSchema' to 'true' on write

Answer

Manually alter the table schema using ALTER TABLE

Answer

Ignore schema changes and use 'failOnDataLoss' flag

Question 13

You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?

Accepted Answer

Use 'PartitionBy' with the GROUP BY column. Option B is correct because when performing a GROUP BY on a column that is not the partition key, you must use the PARTITION BY clause in the query to ensure that all rows with the same grouping value are processed by the same Stream Analytics node. This guarantees consistency and correctness of the aggregation, as it avoids data being split across multiple nodes without proper alignment.

Answer

Use 'Passthrough' partitioning

Answer

Increase the number of SUs to handle skew

Answer

Use 'INTO' with a 'PARTITION BY' clause

Question 14

Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)

Accepted Answer

Copy. A is correct because the Copy activity is a fundamental data movement activity in Azure Data Factory (ADF) that allows you to ingest data from a wide range of supported source stores (e.g., Azure Blob Storage, SQL Server) to sink stores (e.g., Azure Synapse Analytics, Azure Data Lake Storage). It uses the underlying integration runtime to perform the actual data transfer, supporting both built-in connectors and self-hosted runtimes for on-premises sources. This makes it a core, valid activity for building ETL/ELT pipelines.

Answer

Assign

Answer

Execute Pipeline

Question 15

You are designing a data processing solution that requires exactly-once processing semantics for streaming data. Which two Azure services support exactly-once processing? (Choose two.)

Accepted Answer

Azure Stream Analytics. Azure Stream Analytics supports exactly-once processing by using checkpointing and event sourcing to ensure that each event is processed exactly once, even in the event of failures or restarts. It achieves this through its internal state management and the use of checkpoint offsets in the output sink, guaranteeing no duplicate or missed events.

Answer

Azure Event Hubs

Answer

Azure Data Lake Storage Gen2