DP-203 Practice Questions

Question 1

You are designing a data storage solution for IoT sensor data. The data is written thousands of times per second and requires low-latency reads for real-time dashboards. Which Azure storage solution should you use?

Accepted Answer

Azure Cosmos DB. Azure Cosmos DB is the correct choice because it provides single-digit millisecond read and write latency at any scale, with automatic indexing and multi-region distribution. Its support for multiple APIs (SQL, MongoDB, Cassandra, etc.) and configurable consistency levels makes it ideal for IoT sensor data requiring high-throughput writes and low-latency reads for real-time dashboards.

Answer

Azure Blob Storage

Answer

Azure SQL Database

Answer

Azure Data Lake Storage Gen2

Question 2

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

Accepted Answer

Use a staging table and then swap partitions with the target table.. Option C is correct because using a staging table with partition swapping ensures idempotent writes by atomically replacing the target partition with a fully loaded staging partition. This avoids duplicates even if the job restarts, as the swap operation is transactional and the staging table can be truncated before each run. In Azure Synapse dedicated SQL pool, partition switching is a metadata-only operation that provides consistency without data movement.

Answer

Use a TRUNCATE statement before each insert.

Answer

Use a MERGE statement with a unique key to upsert data.

Answer

Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.

Question 3

A multinational corporation uses Azure Data Lake Storage Gen2 to store petabytes of parquet files partitioned by date and hour. Data scientists report that queries on the last 7 days of data take over 30 minutes, while queries on older data are fast. The storage account uses the default Azure Blob Storage hierarchical namespace. Which action will MOST improve query performance on recent data?

Accepted Answer

Optimize the partition layout by partitioning by date first, then by hour, to reduce the number of partitions scanned for recent data. Option C is correct because partitioning by date first, then by hour, ensures that queries filtering on the last 7 days scan only the relevant date partitions, drastically reducing the amount of data read. In Azure Data Lake Storage Gen2, the hierarchical namespace allows partition pruning at the directory level, so a date-first layout minimizes the number of partitions scanned for recent data, directly addressing the performance bottleneck.

Answer

Convert the parquet files to CSV format to reduce metadata overhead

Answer

Enable soft delete on the storage account to reduce read latency

Answer

Apply Z-order clustering on the parquet files using Azure Databricks

Question 4

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Accepted Answer

Azure Event Hubs. Azure Event Hubs is correct because it is a fully managed, real-time data ingestion service that can capture streaming data and store it in Azure Data Lake Storage Gen2 or Blob Storage, enabling a unified storage layer for both batch and streaming pipelines. It supports schema evolution through Avro or JSON serialization, allowing downstream consumers to adapt to schema changes without breaking existing processes.

Answer

Azure SQL Database

Answer

Apache Kafka on HDInsight

Answer

Azure Data Lake Storage Gen2

Question 5

A company ingests streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time to detect anomalies and stored in Azure Data Lake Storage Gen2 for historical analysis. The solution must minimize latency and avoid duplicate processing. Which Azure service should be used for processing?

Accepted Answer

Azure Stream Analytics. Azure Stream Analytics is the correct choice because it is purpose-built for near real-time stream processing with sub-second latency, directly integrates with Event Hubs as input and Data Lake Storage Gen2 as output, and provides built-in exactly-once delivery semantics to avoid duplicate processing. It also supports temporal windowing and anomaly detection functions natively, making it ideal for this IoT anomaly detection scenario.

Answer

Azure Data Factory

Answer

Azure Databricks with Structured Streaming

Answer

Azure Functions with Event Hubs trigger

Question 6

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Accepted Answer

Partition data by date and hour to improve query performance. Partitioning data by date and hour (Option A) is appropriate because it enables partition elimination, where queries only scan relevant partitions rather than the entire dataset. This directly reduces latency and improves throughput by minimizing I/O and compute resources needed for time-range queries, which is critical for meeting strict SLAs in data processing solutions.

Answer

Process all data synchronously to ensure consistency

Answer

Use a single large cluster for all workloads to simplify management

Answer

Use a single node for orchestration to reduce complexity

Question 7

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Accepted Answer

Need for complex transformations and machine learning model integration. Option B is correct because Azure Databricks provides native support for complex transformations (e.g., windowed aggregations, multi-step ETL) and seamless integration with machine learning libraries (e.g., MLflow, Spark MLlib), which are not natively available in Azure Stream Analytics. Stream Analytics uses a SQL-like query language and is optimized for simpler, declarative transformations, making Databricks the better choice when advanced analytics or ML model scoring is required in real-time pipelines.

Answer

Integration with Power BI for real-time dashboards

Answer

Maximum allowed latency for late-arriving data

Question 8

You are designing a data lake on Azure Data Lake Storage Gen2. The data will be used by both batch processing (Spark) and interactive querying (Azure Synapse Serverless SQL). The data is partitioned by date and stored as Parquet. What is the optimal folder structure to minimize cross-partition scans for both workloads?

Accepted Answer

/year/month/day/ (e.g., /2023/12/25/). Option B (/year/month/day/) is optimal because it aligns with Hive-style partitioning, which both Spark and Azure Synapse Serverless SQL can leverage for partition pruning. Spark uses partition discovery to read only relevant directories, and Synapse Serverless SQL uses the file path metadata to filter partitions, minimizing cross-partition scans and reducing data read overhead.

Answer

All files in a single folder

Answer

/yyyy-mm-dd/ (e.g., /2023-12-25/)

Answer

Files named by date (e.g., data_20231225.parquet)

Question 9

A company uses Azure Data Factory to copy sensitive data from on-premises SQL Server to Azure Blob Storage. They must ensure that data is encrypted in transit and at rest. Which combination of features should they use?

Accepted Answer

Configure the copy activity to use TLS and enable Azure Storage Service Encryption.. Option C is correct because Azure Data Factory's copy activity uses TLS (Transport Layer Security) to encrypt data in transit between the on-premises SQL Server and Azure Blob Storage, and Azure Storage Service Encryption (SSE) automatically encrypts data at rest using 256-bit AES encryption. This combination satisfies both encryption requirements without additional complexity.

Answer

Use Always Encrypted in SQL Server and customer-managed keys in Blob Storage.

Answer

Set up a VPN between on-premises and Azure, and use Azure Disk Encryption.

Answer

Use HTTPS for the copy activity and enable Azure Storage Service Encryption.

Question 10

You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?

Accepted Answer

The corporate firewall or network device is closing idle TCP connections to the SQL Server database.. The intermittent nature of the timeout, combined with the fact that the SHIR VM can connect to SQL Server via SSMS, strongly suggests that the corporate firewall or a network intermediary (such as a load balancer or NAT device) is closing idle TCP connections. ADF pipelines may hold connections open between activities or during long-running data transfers, and if no keep-alive packets are sent within the firewall's idle timeout window (commonly 4–30 minutes), the firewall drops the TCP session. When ADF attempts to reuse that connection, it receives a 'Connection timed out' error because the socket is no longer valid.

Answer

The data being transferred is skewed, causing the sink to be overwhelmed.

Answer

The SQL Server database is experiencing high CPU utilization during the pipeline execution window.

Answer

The self-hosted integration runtime is running out of memory during peak loads.

Question 11

Which TWO of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2?

Accepted Answer

Use customer-managed keys in Azure Key Vault. Option C is correct because using customer-managed keys (CMK) in Azure Key Vault allows you to control and rotate the encryption keys used for Azure Storage Service Encryption (SSE), providing an additional layer of security for data at rest. This is a valid method to secure data at rest in Azure Data Lake Storage Gen2, as it ensures that only authorized parties with access to the key vault can decrypt the data.

Answer

Assign RBAC roles for data access

Answer

Configure storage firewall rules

Answer

Enable TLS 1.2 for all connections

Question 12

Which THREE of the following are required to implement column-level security in Azure Synapse Analytics dedicated SQL pool?

Accepted Answer

A GRANT statement on specific columns to users or roles. Option A is correct because column-level security in Azure Synapse Analytics dedicated SQL pool is implemented using GRANT statements on specific columns. By granting SELECT on only certain columns to a user or role, you restrict access to sensitive data at the column level without needing to create views or modify schemas. This is the native mechanism provided by SQL Server and Azure Synapse for column-level security.

Answer

A VIEW that selects only the allowed columns

Answer

A DENY statement on specific columns to users or roles

Answer

A row-level security policy must be in place

Question 13

A company uses Azure Synapse Analytics with dedicated SQL pools. They notice that query performance degrades significantly during peak hours. They have already scaled up the Data Warehouse Units (DWU) to the maximum. Which action should they take next to improve performance?

Accepted Answer

Enable result-set caching.. When a dedicated SQL pool is already at maximum DWU, further scaling is not possible. Enabling result-set caching stores query results in the SSD-based cache of the SQL pool, allowing repeated queries to be served directly from cache without re-scanning data or re-computing aggregations. This reduces I/O and CPU pressure during peak hours, improving performance for recurring queries without requiring additional compute resources.

Answer

Rebuild all clustered columnstore indexes.

Answer

Increase the number of concurrency slots.

Answer

Move the data to Azure Data Lake Storage Gen2.

Question 14

You need to configure encryption for an Azure SQL Database to protect data at rest. Which Azure service or feature should you enable?

Accepted Answer

Transparent Data Encryption (TDE). Transparent Data Encryption (TDE) is the correct choice because it performs real-time I/O encryption and decryption of the data and log files at rest, protecting against unauthorized access to the physical storage media. TDE uses an AES-256 encryption algorithm and is fully transparent to the application, requiring no changes to the database schema or queries.

Answer

Dynamic Data Masking

Answer

Always Encrypted

Answer

Azure Information Protection

Question 15

Which THREE factors should you consider when choosing between rowstore and columnstore indexes in Azure Synapse Analytics?

Accepted Answer

The table size is expected to be over 1 TB.. Option C is correct because columnstore indexes in Azure Synapse Analytics are optimized for large-scale data warehousing workloads, where table sizes exceeding 1 TB benefit from high compression and columnar storage, significantly improving scan and aggregation performance. Rowstore indexes, in contrast, are less efficient for such large datasets due to higher I/O and storage overhead.

Answer

The table contains many NULL values in indexed columns.

Answer

The table will be partitioned frequently.

Microsoft Azure Data Engineer Associate DP-203 practice test

Three ways to study

All 846 DP-203 questions with answers

Study DP-203 by domain

Study DP-203 by topic

Secure, monitor, and optimize data storage and data processing practice questions

Design and develop data processing practice questions

Design and implement data security practice questions

Monitor and optimize data storage and processing practice questions

Design and implement data storage practice questions

Develop data processing practice questions

DP-203 fundamentals practice questions

DP-203 scenario practice questions

DP-203 troubleshooting practice questions

Top DP-203 questions

Microsoft Azure Data Engineer Associate DP-203 practice questions

You are designing a data storage solution for IoT sensor data. The data is written thousands of times per second and requires low-latency reads for real-time dashboards. Which Azure storage solution should you use?

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

A company uses Azure Data Factory to copy sensitive data from on-premises SQL Server to Azure Blob Storage. They must ensure that data is encrypted in transit and at rest. Which combination of features should they use?

Which TWO of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2?

Which THREE of the following are required to implement column-level security in Azure Synapse Analytics dedicated SQL pool?

A company uses Azure Synapse Analytics with dedicated SQL pools. They notice that query performance degrades significantly during peak hours. They have already scaled up the Data Warehouse Units (DWU) to the maximum. Which action should they take next to improve performance?

You need to configure encryption for an Azure SQL Database to protect data at rest. Which Azure service or feature should you enable?

Which THREE factors should you consider when choosing between rowstore and columnstore indexes in Azure Synapse Analytics?

You are designing a data pipeline that ingests JSON files from Azure Blob Storage into Azure Synapse Analytics using PolyBase. The files contain nested JSON arrays. What should you do to ensure that the data is loaded correctly?

Refer to the exhibit. A custom RBAC role is defined as shown. A user is assigned this role at the resource group scope. Which operation can the user perform?

Exhibit

Which THREE components are part of a defense-in-depth strategy for data security in Azure?

A company uses Azure Synapse Analytics dedicated SQL pool for a data warehouse. They notice that some queries are using more memory than expected, causing resource contention. Which TWO actions should they take to diagnose and optimize memory usage?

A company is using Azure Data Factory to copy data from an on-premises SQL Server to Azure Blob Storage. The data must be encrypted in transit using TLS 1.2. The on-premises SQL Server is configured to support TLS 1.2. Which Data Factory property should be configured?

You are designing a data solution in Azure that requires all data in transit between Azure Databricks and Azure Storage to be encrypted using a customer-managed key. Which configuration meets this requirement?

Question Discussion

How to use these DP-203 questions

Quick answer