DP-203 Exam Questions and Answers

Your organization uses Azure Synapse Analytics dedicated SQL pool. You need to ensure that all data at rest in the SQL pool is encrypted using a customer-managed key stored in Azure Key Vault. What should you configure?

Implement Always Encrypted with column encryption keys stored in Azure Key Vault.

Configure Dynamic Data Masking to obfuscate sensitive data.

Enable Azure Storage Service Encryption with a customer-managed key.

Enable Transparent Data Encryption (TDE) with a customer-managed key in Azure Key Vault.

TDE with customer-managed key provides encryption at rest for the entire database, meeting the requirement.

Why: Option C is correct because Transparent Data Encryption (TDE) with customer-managed keys in Azure Key Vault provides the required encryption. Option A is wrong because Azure Storage Service Encryption is for storage accounts, not SQL pools. Option B is wrong because Always Encrypted protects data in transit and at rest in application logic, not at rest in the database. Option D is wrong because Dynamic Data Masking does not encrypt data.

You are designing a data processing solution in Azure Databricks that uses Unity Catalog. The security team requires that all users authenticate using Microsoft Entra ID and that access to tables is governed by attribute-based access control (ABAC) using table tags. Which feature should you enable?

Column-level security masks.

Dynamic views with user context functions.

Row-level security filters.

Table tags with access control lists (ACLs) in Unity Catalog.

Unity Catalog supports ABAC using tags and ACLs, and integrates with Microsoft Entra ID for authentication.

Why: Option B is correct because Unity Catalog supports ABAC through tags and Azure Databricks can use Microsoft Entra ID for authentication. Option A is wrong because row-level security is for filtering rows, not ABAC with tags. Option C is wrong because column-level security is for columns. Option D is wrong because dynamic views are for custom logic, not attribute-based access control with tags.

Your team uses Azure Data Factory to copy data from on-premises SQL Server to Azure Blob Storage. You need to ensure that data in transit is encrypted using TLS 1.2. What should you configure?

Ensure the Azure Data Factory pipeline uses HTTPS for the copy activity.

HTTPS uses TLS to encrypt data in transit, and Azure Data Factory supports this by default.

Set up a site-to-site VPN between the on-premises network and Azure.

Configure the on-premises SQL Server to use SSL certificates.

Use Azure ExpressRoute with private peering.

Why: Option A is correct because Azure Data Factory uses HTTPS by default, which enforces TLS encryption for data in transit. Option B is wrong because VPN is not required for TLS encryption. Option C is wrong because Azure ExpressRoute provides a private connection but does not replace TLS. Option D is wrong because configuring the on-premises SQL Server to use SSL certificates is necessary but not sufficient; Azure Data Factory must also use HTTPS.

You have an Azure Databricks workspace that uses a managed resource group. The security team requires that all cluster nodes use no public IP addresses and that all outbound traffic goes through a firewall. What should you configure?

Configure service endpoints for Azure Storage and Azure Data Lake Storage.

Deploy the workspace in a VNet with forced tunneling enabled and a firewall.

VNet injection with forced tunneling ensures cluster nodes have no public IPs and all outbound traffic goes through the firewall.

Apply network security groups (NSGs) to the subnet that restrict outbound traffic.

Enable Azure Private Link for the Databricks workspace.

Why: Option D is correct because Azure Databricks can be deployed in a VNet injected configuration with no public IPs and route all traffic through a firewall. Option A is wrong because private endpoints are for data plane services, not for cluster nodes. Option B is wrong because service endpoints do not eliminate public IPs. Option C is wrong because network security groups control inbound/outbound rules but do not force traffic through a firewall.

Refer to the exhibit. You are creating an Azure Storage account using an ARM template with the above snippet. After deployment, a security auditor reviews the configuration and notes that the storage account is not using a customer-managed key for encryption. What is the most likely reason?

The 'keyVersion' is missing a specific version, so Azure Storage defaults to Microsoft-managed key.

For customer-managed key, a specific key version is required; an empty version may cause Azure to use the latest but if the key is not accessible, it falls back to Microsoft-managed key.

The 'keySource' should be 'Microsoft.Storage' for customer-managed key.

The storage account requires double encryption to use customer-managed key.

The 'infrastructureEncryption' setting is enabled, which overrides customer-managed key.

Why: Option B is correct because the keyVersion is empty, which means Azure Storage will use the latest version of the key, but the key must be present in the vault. However, the snippet does not specify the 'keyVaultProperties' correctly; the property 'keyName' is valid, but the 'keyVersion' being empty might cause Azure to default to a Microsoft-managed key if the key does not exist or if the vault is not accessible. Option A is wrong because infrastructure encryption is independent of key source. Option C is wrong because 'infrastructureEncryption' is a separate setting. Option D is wrong because double encryption is not the issue.

You are monitoring an Azure Synapse Analytics dedicated SQL pool and notice that some queries are taking longer than expected. You need to identify queries that are experiencing significant memory pressure. Which dynamic management view (DMV) should you query?

sys.dm_pdw_exec_requests

This DMV includes memory_grant and memory_used columns to assess memory pressure.

sys.dm_pdw_wait_stats

sys.dm_pdw_query_stats_xe

sys.dm_pdw_nodes_os_performance_counters

Why: Option C is correct because sys.dm_pdw_exec_requests shows memory grants for queries. Option A is wrong because sys.dm_pdw_nodes_os_performance_counters shows OS-level counters. Option B is wrong because sys.dm_pdw_wait_stats shows wait statistics. Option D is wrong because sys.dm_pdw_query_stats_xe shows extended events.

Want more Secure, monitor, and optimize data storage and data processing practice?

All Design and develop data processing questions

Domain 2: Design and develop data processing

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

Use Azure Data Factory to copy the JSON data into Azure SQL Database, then use T-SQL to transform.

Use Azure Data Factory with SSIS to transform and load into dedicated SQL pool.

Load data into a Spark DataFrame in Synapse notebooks, transform, and write back.

Create external tables on the JSON files using PolyBase, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write transformed Parquet files.

Minimizes movement by querying in place.

Why: Option D is correct because it uses PolyBase external tables and CETAS to transform JSON data directly in Azure Data Lake Storage Gen2, minimizing data movement by leveraging the compute power of the dedicated SQL pool or serverless SQL pool closest to the data. This approach reads JSON in place, transforms it into Parquet format, and writes the star schema tables back to the data lake without copying data to an intermediate store.

You are designing a batch processing pipeline that reads CSV files from Azure Blob Storage, performs aggregations using Azure Databricks, and writes results to Azure Synapse Analytics. The pipeline must handle schema drift (new columns appearing in source files). Which approach should you recommend?

Use Azure Data Factory mapping data flows with schema drift enabled, mapping to a fixed sink schema.

Define a fixed schema in the source and ignore any new columns.

Use Spark with mergeSchema option when reading, and write using a Delta table to evolve schema automatically.

Handles schema drift automatically.

Use Azure Stream Analytics to pre-process and enforce schema.

Why: Option C is correct because Spark's `mergeSchema` option, when used with Delta Lake, automatically evolves the schema to accommodate new columns in CSV files. This allows the batch pipeline to handle schema drift without manual intervention, and writing to a Delta table ensures the schema evolution is persisted and compatible with downstream writes to Azure Synapse Analytics.

A company is running a Spark job on Azure Databricks that processes 500 GB of data daily. The job frequently fails with 'OutOfMemoryError' during shuffles. The cluster uses 10 workers of type Standard_DS3_v2 (14 GB memory each). Which configuration change should you make to improve stability without over-provisioning?

Set spark.sql.shuffle.partitions to a higher value, e.g., 500.

Reduces data per partition, easing memory.

Increase the driver memory to 28 GB.

Increase the number of workers to 20.

Reduce spark.sql.shuffle.partitions to 100.

Why: The 'OutOfMemoryError' during shuffles indicates that individual partitions are too large for the executor memory. Increasing `spark.sql.shuffle.partitions` to 500 reduces the amount of data per partition, lowering memory pressure during shuffle operations. This directly addresses the error without adding more hardware.

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

Azure Event Hubs -> Azure HDInsight (Kafka) -> Azure Cosmos DB

Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB

Stream Analytics provides near-real-time aggregation.

Azure IoT Hub -> Azure Databricks (Structured Streaming) -> Azure Cosmos DB

Azure Event Hubs -> Azure Data Factory -> Azure Cosmos DB

Why: Option B is correct because Azure Stream Analytics provides native, low-latency windowed aggregation (e.g., TumblingWindow for per-minute aggregates) directly on data ingested from Event Hubs, and it has a built-in output sink to Azure Cosmos DB. This combination meets the near-real-time requirement without needing an intermediate compute or storage layer, minimizing end-to-end latency.

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

Use a TRUNCATE statement before each insert.

Use a MERGE statement with a unique key to upsert data.

Use a staging table and then swap partitions with the target table.

Atomic swap ensures idempotency.

Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.

Why: Option C is correct because using a staging table with partition swapping ensures idempotent writes by atomically replacing the target partition with a fully loaded staging partition. This avoids duplicates even if the job restarts, as the swap operation is transactional and the staging table can be truncated before each run. In Azure Synapse dedicated SQL pool, partition switching is a metadata-only operation that provides consistency without data movement.

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Azure Event Hubs

Common ingestion for batch and streaming.

Azure SQL Database

Apache Kafka on HDInsight

Delta Lake (on Azure Databricks)

Supports batch and streaming, schema evolution.

Azure Data Lake Storage Gen2

Why: Azure Event Hubs is correct because it is a fully managed, real-time data ingestion service that can capture streaming data and store it in Azure Data Lake Storage Gen2 or Blob Storage, enabling a unified storage layer for both batch and streaming pipelines. It supports schema evolution through Avro or JSON serialization, allowing downstream consumers to adapt to schema changes without breaking existing processes.

Want more Design and develop data processing practice?

All Design and implement data security questions

Domain 3: Design and implement data security

A company uses Azure Synapse Analytics dedicated SQL pool. They need to ensure that only users with a specific Azure AD group can query a particular schema. Which approach should they use?

Configure a server-level firewall rule to block other users.

Use the GRANT statement to grant SELECT on the schema to the Azure AD group.

GRANT schema permission controls access at schema level.

Create a row-level security policy on all tables in the schema.

Apply dynamic data masking to the schema.

Why: The GRANT statement in Azure Synapse dedicated SQL pool allows you to assign permissions directly to Azure AD groups. By granting SELECT on the schema to the specific Azure AD group, only members of that group can query objects within that schema, meeting the requirement precisely.

You are designing data security for Azure Data Lake Storage Gen2. The requirement is to prevent data from being accessed by anyone outside the corporate network. Which feature should you enable?

Use Azure Private Endpoint or service endpoint with a VNet.

Private endpoint ensures data is accessed only from within the VNet.

Assign RBAC roles to deny access to all except corporate users.

Configure IP firewall rules to allow only corporate IP ranges.

Enable encryption at rest using customer-managed keys.

Why: Azure Private Endpoint or service endpoint with a VNet ensures that all traffic to the storage account stays within the corporate network and never traverses the public internet. Private Endpoint assigns a private IP from the VNet to the storage account, effectively isolating it from public access. This meets the requirement to prevent access from outside the corporate network by enforcing network-level isolation.

A company uses Azure Data Factory to copy sensitive data from on-premises SQL Server to Azure Blob Storage. They must ensure that data is encrypted in transit and at rest. Which combination of features should they use?

Use Always Encrypted in SQL Server and customer-managed keys in Blob Storage.

Set up a VPN between on-premises and Azure, and use Azure Disk Encryption.

Configure the copy activity to use TLS and enable Azure Storage Service Encryption.

TLS encrypts data in transit; Storage Service Encryption encrypts at rest automatically.

Use HTTPS for the copy activity and enable Azure Storage Service Encryption.

Why: Option C is correct because Azure Data Factory's copy activity uses TLS (Transport Layer Security) to encrypt data in transit between the on-premises SQL Server and Azure Blob Storage, and Azure Storage Service Encryption (SSE) automatically encrypts data at rest using 256-bit AES encryption. This combination satisfies both encryption requirements without additional complexity.

You need to audit all data access to an Azure Storage account. Which Azure service should you enable?

Azure Storage analytics logs and send to Log Analytics workspace

Storage logs capture access details; Log Analytics enables querying.

Azure Policy to audit storage account access

Azure Monitor metrics

Azure Security Center

Why: Azure Storage analytics logs capture detailed information about successful and failed requests to a storage account, including authentication details, IP addresses, and operation types. By sending these logs to a Log Analytics workspace, you can query and analyze them using KQL, enabling comprehensive auditing of all data access. This is the correct service for auditing because it provides the granular, queryable logs required for security and compliance audits.

A company uses Azure Key Vault to store secrets for data pipelines. They need to rotate the secrets automatically every 90 days. What should they implement?

Use Azure Policy to enforce secret expiration.

Assign RBAC roles to a service principal to update the secret.

Create a manual process to update the secret in Key Vault.

Configure Key Vault secret rotation with an expiration date of 90 days.

Key Vault can automatically rotate secrets based on expiration.

Why: Option C is correct because Key Vault supports automatic rotation with a specified expiration period. Option A is wrong because manual rotation is not automatic. Option B is wrong because Azure Policy does not rotate secrets. Option D is wrong because RBAC controls access, not rotation.

Which TWO of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2?

Assign RBAC roles for data access

Configure storage firewall rules

Use customer-managed keys in Azure Key Vault

Customer-managed keys provide additional control over encryption.

Use Azure Storage Service Encryption (SSE)

SSE encrypts data at rest automatically.

Enable TLS 1.2 for all connections

Why: Option C is correct because using customer-managed keys (CMK) in Azure Key Vault allows you to control and rotate the encryption keys used for Azure Storage Service Encryption (SSE), providing an additional layer of security for data at rest. This is a valid method to secure data at rest in Azure Data Lake Storage Gen2, as it ensures that only authorized parties with access to the key vault can decrypt the data.

Want more Design and implement data security practice?

All Monitor and optimize data storage and processing questions

Domain 4: Monitor and optimize data storage and processing

A company runs a mission-critical Azure Data Factory pipeline that ingests data every hour from Azure Blob Storage into Azure Synapse Dedicated SQL Pool. Recently, the pipeline has been failing with timeout errors during the copy activity. The source blob files are around 500 MB each. Which configuration change would MOST effectively reduce the likelihood of timeout errors?

Decrease the 'Batch size' for the copy activity.

Change the sink to use PolyBase with staging enabled.

Increase the Data Integration Unit (DIU) to 8.

Enable 'Enable staging' and set 'Degree of copy parallelism' to a higher value.

Increases parallelism, reducing copy time and timeout likelihood.

Why: Option D is correct because enabling staging allows the copy activity to use Azure Blob Storage as an intermediate staging area, which breaks the 500 MB files into manageable chunks and uses parallel staging writes to the Dedicated SQL Pool. This reduces the load on the single copy session and prevents timeout errors by leveraging the staging engine's retry and parallelization capabilities.

You are designing a data processing solution using Azure Databricks with Delta Lake. The data is partitioned by date and ingested daily. You notice that the Delta table has many small files, causing slow read performance. Which strategy should you recommend to optimize the table for faster queries?

Run OPTIMIZE on the table to compact small files.

OPTIMIZE merges small files into larger ones.

Run ZORDER BY on the date column.

Run VACUUM to delete old files.

Increase the number of partitions by adding a new partition column.

Why: Option A is correct because running OPTIMIZE on a Delta Lake table compacts many small files into larger ones, reducing the number of files that need to be read during queries. This directly addresses the slow read performance caused by the small file problem, which is common in daily partitioned ingestion. OPTIMIZE uses bin-packing to merge files up to a target size (default 256 MB), improving scan efficiency without changing the data.

A data engineer monitors an Azure Stream Analytics job that processes real-time data. The job is falling behind, and the SU utilization is at 100%. Which action should be taken to improve performance?

Increase the number of Streaming Units (SU).

More SU provides more processing power.

Reduce the number of Streaming Units.

Change the query compatibility level to 1.0.

Deploy a second Stream Analytics job and split the input.

Why: When SU utilization reaches 100%, the job is fully saturated and cannot process incoming data fast enough. Increasing the number of Streaming Units (SU) allocates more compute resources (CPU and memory) to the job, allowing it to handle higher throughput and reduce backlog. This is the direct and recommended action for resolving performance bottlenecks caused by insufficient SU capacity.

You have an Azure Data Lake Storage Gen2 account that stores large volumes of parquet files. A reporting application frequently queries a specific subset of data filtered by a 'region' column. To minimize query latency and cost, which optimization should you implement?

Partition the data by region in the folder structure.

Partition elimination reduces data scanned.

Create a clustered index on the region column.

Compress the parquet files using gzip.

Enable hierarchical namespace on the storage account.

Why: Partitioning the data by region in the folder structure (e.g., /region=NorthAmerica/...) enables Azure Data Lake Storage Gen2 and query engines like Azure Synapse or PolyBase to perform partition pruning. This skips scanning irrelevant files entirely, reducing I/O and query latency while lowering cost by minimizing data processed.

A company uses Azure Data Lake Storage Gen2 with Azure Databricks. They notice that the job to write data into Delta Lake tables takes too long. The data is coming from a streaming source with a high velocity of small writes. Which approach should be taken to optimize write performance?

Configure the streaming to write in micro-batches with a higher trigger interval.

Batching reduces the number of small file writes.

Increase the cluster size to 16 nodes.

Enable 'auto optimize' and 'optimized writes' on the Delta table.

Change the output format from Delta to Parquet.

Why: Option A is correct because increasing the trigger interval for micro-batches reduces the frequency of writes, allowing more data to accumulate per batch. This minimizes the overhead of small file commits and metadata operations in Delta Lake, which is the primary bottleneck for high-velocity streaming writes. By batching more records together, the job writes fewer, larger files, improving overall throughput.

Which TWO actions should you take to reduce costs associated with an Azure Synapse Dedicated SQL Pool that is used for reporting during business hours only?

Pause the pool during non-business hours.

Stops compute billing when not in use.

Enable advanced data compression on all tables.

Scale down the pool during business hours.

Change the distribution of large tables to ROUND_ROBIN.

Implement result set caching for frequently run queries.

Reduces compute usage for repeated queries.

Why: Option A is correct because pausing a Dedicated SQL Pool stops billing for compute resources (DWU) while retaining storage costs. Since the pool is only needed for reporting during business hours, pausing it during non-business hours directly eliminates compute charges for that period, which is the most significant cost driver.

Want more Monitor and optimize data storage and processing practice?

All Design and implement data storage questions

Domain 5: Design and implement data storage

A company is designing a data lake solution on Azure Data Lake Storage Gen2. Data will be ingested from IoT devices at high frequency (every 5 seconds). Each device sends a JSON payload of 2 KB. The data must be stored in a hierarchical namespace and partitioned by date and device ID to optimize query performance. Which partition strategy should be used?

Use Azure SQL Database with clustered columnstore index on date and device ID.

Organize folders as /YYYY/MM/DD/DeviceID/ in ADLS Gen2 and use file naming that includes timestamp.

This folder structure enables efficient partition pruning based on date and device ID.

Use Azure Table Storage with PartitionKey set to date and RowKey set to device ID.

Use Azure Cosmos DB with partition key on (date, device ID) and TTL for data retention.

Why: Option B is correct because ADLS Gen2 with a hierarchical namespace allows folder-based partitioning by date and device ID (e.g., /YYYY/MM/DD/DeviceID/), which directly maps to the query optimization requirement. This structure enables efficient partition pruning for time-range and device-specific queries, and the high-frequency 2 KB JSON payloads are well-suited for append-friendly file naming with timestamps.

You are designing a near-real-time analytics pipeline for a retail company. Transaction data is generated in Azure SQL Database and must be replicated to Azure Synapse Analytics (dedicated SQL pool) with less than 5 minutes latency. The source table has 50 million rows and 200 columns, but only 30 columns are needed for analytics. Which approach should you recommend?

Use Azure SQL Database Change Tracking and push changes to Azure Event Hubs, then use Azure Stream Analytics to write to Synapse.

Enable Change Data Capture (CDC) on the source table and use Azure Data Factory with a 1-minute tumbling window to copy changes into Synapse.

CDC captures only changed rows, and ADF can run frequently to meet latency target.

Use Azure Synapse PolyBase to directly query the source SQL database every 5 minutes.

Schedule a full copy of the entire table every 5 minutes using Azure Data Factory.

Why: Option B is correct because Azure Data Factory (ADF) with Change Data Capture (CDC) on the source SQL database can incrementally copy only changed rows (inserts, updates, deletes) into Azure Synapse Analytics using a 1-minute tumbling window, meeting the sub-5-minute latency requirement while minimizing data volume. This approach efficiently handles 50 million rows by transferring only the 30 needed columns, avoiding full table scans and reducing network load.

A data engineer needs to store semi-structured JSON log files from a web application. Each log entry is about 1 KB. The logs are rarely queried (once a month) and must be retained for 7 years for compliance. The solution must minimize storage cost. Which storage option should be used?

Store the logs in Azure SQL Database as a table.

Store the logs in Azure Files share.

Store the logs in Azure Blob Storage with cool access tier.

Blob Storage cool tier is low-cost for infrequent access, suitable for logs.

Store the logs in Azure Cosmos DB with a JSON container.

Why: Azure Blob Storage with the cool access tier is the correct choice because it is optimized for storing large amounts of semi-structured data (like JSON logs) at low cost, with infrequent access (once a month) and long retention (7 years). The cool tier offers lower storage costs than hot or premium tiers, while still providing high durability and the ability to query logs using tools like Azure Data Lake Storage or serverless SQL. This meets the compliance requirement without the high compute or transaction costs of a database solution.

You are designing a solution to store streaming data from multiple sources into Azure Data Lake Storage Gen2. The data must be organized by ingestion time and source system. Each source system produces data in a different format: CSV, JSON, and Parquet. The solution must allow efficient querying using Azure Synapse Serverless SQL and must support partitioning on ingestion date. What is the recommended folder structure?

/data/{date}/{source_system}/ (e.g., /data/2023-01-01/SourceA/)

/data/{source_system}/{date}/ (e.g., /data/SourceA/2023-01-01/)

This structure separates sources and dates, enabling efficient query pruning.

/data/{source_system}/ with files named {timestamp}.csv/.json/.parquet

/data/{source_system}/{year}/{month}/{day}/ (e.g., /data/SourceA/2023/01/01/)

Why: Option B is correct because it places the source system partition first, which aligns with Azure Synapse Serverless SQL's partition elimination behavior when querying by source system. The date partition at the end allows efficient pruning on ingestion date, and the hierarchical folder structure maps directly to Hive-style partitioning, which Synapse Serverless SQL natively supports for CSV, JSON, and Parquet formats.

A healthcare company stores sensitive patient data in Azure Data Lake Storage Gen2. They need to ensure that only authorized users can access data and that all access is audited. They also need to prevent data from being accessed by unauthorized Azure services. Which combination of security features should be used?

Use a private endpoint and Azure AD authentication, disable public access.

Use Azure RBAC and ACLs for authorization, enable firewall and virtual network service endpoints, and enable diagnostic settings for auditing.

Combination provides layered security and full audit.

Use managed identity for service access and disable public access.

Use Azure AD authentication and SAS tokens for access, enabling diagnostic logs for auditing.

Why: Option B is correct because it combines Azure RBAC and ACLs for fine-grained authorization, a firewall with virtual network service endpoints to restrict access to authorized networks, and diagnostic settings to capture audit logs. This layered approach ensures that only authorized users can access the data, all access is audited, and unauthorized Azure services are blocked by the firewall and service endpoints.

Which TWO of the following are supported storage options for use as a source in Azure Synapse Pipeline Copy Activity?

Azure Data Lake Storage Gen2

ADLS Gen2 is a supported source.

Azure Analysis Services

Azure Cognitive Search

Azure Purview

Azure Blob Storage

Azure Blob Storage is a supported source.

Why: Azure Data Lake Storage Gen2 is a supported source for Azure Synapse Pipeline Copy Activity because it combines a hierarchical file system with Azure Blob Storage APIs, enabling efficient data ingestion. The Copy Activity can read data from ADLS Gen2 using the AzureBlobFS linked service, which supports both file and folder-level reads for structured and unstructured data.

Want more Design and implement data storage practice?

All Develop data processing questions

Domain 6: Develop data processing

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data volume is approximately 500 GB per hour with peak spikes. The pipeline must minimize data loss during transient failures. Which feature should you implement?

Use Azure Synapse Pipeline with Auto-commit and checkpointing to process streaming data.

Auto-commit with checkpointing in Synapse Pipeline provides fault tolerance and exactly-once processing for streaming data.

Use PolyBase to load data directly from Event Hubs to the dedicated SQL pool.

Use COPY INTO statement to ingest data from Event Hubs into the dedicated SQL pool.

Enable Event Hubs Capture to write data to Azure Data Lake Storage and then load using PolyBase.

Why: Option A is correct because Azure Synapse Pipeline with Auto-commit and checkpointing provides exactly-once processing semantics for streaming data from Event Hubs, ensuring no data loss during transient failures by committing offsets only after successful writes to the dedicated SQL pool. This feature is designed for high-volume streaming (500 GB/hour) and handles peak spikes through parallelization and retry logic, making it the optimal choice for minimizing data loss.

You are designing a batch processing solution using Azure Databricks. The data source is a large Parquet dataset stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The processing requires joining two datasets: one with 10 billion rows and another with 1 million rows. The cluster uses Photon runtime. Which optimization should you apply to minimize shuffle?

Broadcast the smaller table (1 million rows) to all worker nodes.

Broadcasting the smaller table avoids shuffling the large table, significantly reducing data movement.

Increase the cluster size to reduce shuffle overhead.

Create bucketed tables on the join key for both datasets.

Use Delta Lake and optimize file layout with OPTIMIZE command.

Why: Broadcasting the smaller table (1 million rows) to all worker nodes is the correct optimization because it eliminates the need for a full shuffle during the join. With Photon runtime, broadcast joins are highly efficient as they replicate the small table to each executor, allowing map-side joins that avoid costly data movement across the network. Given the 10:1 row ratio, the 1-million-row table is well within the default broadcast threshold (10 MB compressed, configurable via spark.sql.autoBroadcastJoinThreshold), making this the most effective shuffle-minimization technique.

You are running a Spark job in Azure Synapse Analytics that reads from a Delta Lake table and performs multiple transformations. The job fails with an out-of-memory error on the executors. Which action should you take first to resolve the issue?

Enable checkpointing to truncate the lineage.

Decrease the number of partitions to reduce overhead.

Increase the executor memory setting in the Spark configuration.

Increasing executor memory provides more heap space to avoid OOM errors.

Use the cache() action on intermediate DataFrames.

Why: Option C is correct because an out-of-memory error on executors indicates that the available memory per executor is insufficient for the data being processed. Increasing the executor memory setting in the Spark configuration directly addresses this by allocating more heap space, allowing transformations to complete without spilling to disk or failing. This is the first and most straightforward action to take before optimizing partitioning or caching.

You are designing a data pipeline in Azure Data Factory (ADF) that copies data from an on-premises SQL Server database to Azure Synapse Analytics dedicated SQL pool. The pipeline must run daily and handle incremental loads efficiently. Which sink dataset type and copy method should you use?

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use the Copy activity with PolyBase enabled.

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and enable the built-in Upsert option.

Use Azure Blob Storage as the sink dataset, then use PolyBase to load into the dedicated SQL pool.

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use Stored Procedure with staging table and PolyBase.

This combination enables high-throughput ingestion and supports incremental loading via merge logic in the stored procedure.

Why: Option D is correct because it uses a staging table and PolyBase to efficiently load incremental data into Azure Synapse Analytics dedicated SQL pool. PolyBase provides high-throughput parallel loading, and the stored procedure handles the merge logic (upsert) to manage incremental changes. This approach is recommended for large-scale, daily incremental loads to Synapse.

You are implementing a streaming solution using Azure Stream Analytics. The input is from an IoT Hub receiving telemetry from thousands of devices. The output is to Azure Synapse Analytics dedicated SQL pool. The requirement is to compute rolling averages over a 5-minute tumbling window and write results every minute. Which windowing function and output configuration should you use?

Use a TumblingWindow with duration of 5 minutes and output every 5 minutes.

Use a SlidingWindow with duration 5 minutes and output every 1 minute.

Use a HoppingWindow with size 5 minutes and hop 1 minute.

Hopping windows with a 1-minute hop produce results every minute, each covering the last 5 minutes.

Use a SessionWindow with timeout 5 minutes and maximum duration 10 minutes.

Why: Option C is correct because a HoppingWindow with a size of 5 minutes and a hop of 1 minute allows you to compute rolling averages over a 5-minute window while producing results every minute. This satisfies the requirement of outputting results at a higher frequency than the window duration, which is not possible with a TumblingWindow (which only outputs at the end of the window) or a SlidingWindow (which outputs on each event, not at fixed intervals).

You are optimizing a Spark DataFrame transformation in Azure Synapse Analytics. The DataFrame has 20 columns and 100 million rows. You notice that the job is slow due to many small files being written to the output. Which two actions can you take to reduce the number of output files? (Choose two.)

Use coalesce() to reduce the number of partitions without a shuffle.

Coalesce reduces partitions and thus output files, minimizing shuffle.

Enable caching on the DataFrame before writing.

Apply bucketing on a column to group data.

Increase the number of partitions using repartition() with a larger number.

Use repartition() with a smaller number of partitions.

Fewer partitions mean fewer files written.

Why: Option A is correct because `coalesce()` reduces the number of partitions without triggering a full shuffle, which minimizes the number of output files while preserving performance. Since the DataFrame already has 100 million rows and 20 columns, coalescing to fewer partitions directly reduces the number of files written, addressing the small-file problem efficiently.

Want more Develop data processing practice?

Browse all DP-203 questions Take a timed practice test

Frequently asked questions

How many questions are on the DP-203 exam?

The DP-203 exam has 50 questions and must be completed in 120 minutes. The passing score is 700/1000.

What types of questions appear on the DP-203 exam?

Scenario-based questions covering exam objectives with detailed answer explanations.

How are DP-203 questions organised by domain?

The exam covers 6 domains: Secure, monitor, and optimize data storage and data processing, Design and develop data processing, Design and implement data security, Monitor and optimize data storage and processing, Design and implement data storage, Develop data processing. Questions are weighted by domain — higher-weight domains appear more on your actual exam.

Are these the actual DP-203 exam questions?

No. These are original exam-style practice questions written against the official Microsoft DP-203 exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.

Ready to practice all 60 DP-203 questions?

Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.

Microsoft · Free Practice Questions · Last reviewed May 2026

DP-203 Exam Questions and Answers

36real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.

50 exam questions

120 min time limit

Pass: 700/1000 / 1000

6 exam domains

Overview Domain Blueprint Study Guide All QuestionsSample by Domain

Domain 1: Secure, monitor, and optimize data storage and data processing

All Secure, monitor, and optimize data storage and data processing questions

Implement Always Encrypted with column encryption keys stored in Azure Key Vault.

Configure Dynamic Data Masking to obfuscate sensitive data.

Enable Azure Storage Service Encryption with a customer-managed key.

Enable Transparent Data Encryption (TDE) with a customer-managed key in Azure Key Vault.

TDE with customer-managed key provides encryption at rest for the entire database, meeting the requirement.

Column-level security masks.

Dynamic views with user context functions.

Row-level security filters.

Table tags with access control lists (ACLs) in Unity Catalog.

Unity Catalog supports ABAC using tags and ACLs, and integrates with Microsoft Entra ID for authentication.

Your team uses Azure Data Factory to copy data from on-premises SQL Server to Azure Blob Storage. You need to ensure that data in transit is encrypted using TLS 1.2. What should you configure?

Ensure the Azure Data Factory pipeline uses HTTPS for the copy activity.

HTTPS uses TLS to encrypt data in transit, and Azure Data Factory supports this by default.

Set up a site-to-site VPN between the on-premises network and Azure.

Configure the on-premises SQL Server to use SSL certificates.

Use Azure ExpressRoute with private peering.

Configure service endpoints for Azure Storage and Azure Data Lake Storage.

Deploy the workspace in a VNet with forced tunneling enabled and a firewall.

VNet injection with forced tunneling ensures cluster nodes have no public IPs and all outbound traffic goes through the firewall.

Apply network security groups (NSGs) to the subnet that restrict outbound traffic.

Enable Azure Private Link for the Databricks workspace.

The 'keyVersion' is missing a specific version, so Azure Storage defaults to Microsoft-managed key.

For customer-managed key, a specific key version is required; an empty version may cause Azure to use the latest but if the key is not accessible, it falls back to Microsoft-managed key.

The 'keySource' should be 'Microsoft.Storage' for customer-managed key.

The storage account requires double encryption to use customer-managed key.

The 'infrastructureEncryption' setting is enabled, which overrides customer-managed key.

sys.dm_pdw_exec_requests

This DMV includes memory_grant and memory_used columns to assess memory pressure.

sys.dm_pdw_wait_stats

sys.dm_pdw_query_stats_xe

sys.dm_pdw_nodes_os_performance_counters

Want more Secure, monitor, and optimize data storage and data processing practice?

All Design and develop data processing questions

Domain 2: Design and develop data processing

Use Azure Data Factory to copy the JSON data into Azure SQL Database, then use T-SQL to transform.

Use Azure Data Factory with SSIS to transform and load into dedicated SQL pool.

Load data into a Spark DataFrame in Synapse notebooks, transform, and write back.

Create external tables on the JSON files using PolyBase, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write transformed Parquet files.

Minimizes movement by querying in place.

Use Azure Data Factory mapping data flows with schema drift enabled, mapping to a fixed sink schema.

Define a fixed schema in the source and ignore any new columns.

Use Spark with mergeSchema option when reading, and write using a Delta table to evolve schema automatically.

Handles schema drift automatically.

Use Azure Stream Analytics to pre-process and enforce schema.

Set spark.sql.shuffle.partitions to a higher value, e.g., 500.

Reduces data per partition, easing memory.

Increase the driver memory to 28 GB.

Increase the number of workers to 20.

Reduce spark.sql.shuffle.partitions to 100.

Azure Event Hubs -> Azure HDInsight (Kafka) -> Azure Cosmos DB

Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB

Stream Analytics provides near-real-time aggregation.

Azure IoT Hub -> Azure Databricks (Structured Streaming) -> Azure Cosmos DB

Azure Event Hubs -> Azure Data Factory -> Azure Cosmos DB

Use a TRUNCATE statement before each insert.

Use a MERGE statement with a unique key to upsert data.

Use a staging table and then swap partitions with the target table.

Atomic swap ensures idempotency.

Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.

Azure Event Hubs

Common ingestion for batch and streaming.

Azure SQL Database

Apache Kafka on HDInsight

Delta Lake (on Azure Databricks)

Supports batch and streaming, schema evolution.

Azure Data Lake Storage Gen2

Want more Design and develop data processing practice?

All Design and implement data security questions

Domain 3: Design and implement data security

A company uses Azure Synapse Analytics dedicated SQL pool. They need to ensure that only users with a specific Azure AD group can query a particular schema. Which approach should they use?

Configure a server-level firewall rule to block other users.

Use the GRANT statement to grant SELECT on the schema to the Azure AD group.

GRANT schema permission controls access at schema level.

Create a row-level security policy on all tables in the schema.

Apply dynamic data masking to the schema.

You are designing data security for Azure Data Lake Storage Gen2. The requirement is to prevent data from being accessed by anyone outside the corporate network. Which feature should you enable?

Use Azure Private Endpoint or service endpoint with a VNet.

Private endpoint ensures data is accessed only from within the VNet.

Assign RBAC roles to deny access to all except corporate users.

Configure IP firewall rules to allow only corporate IP ranges.

Enable encryption at rest using customer-managed keys.

Use Always Encrypted in SQL Server and customer-managed keys in Blob Storage.

Set up a VPN between on-premises and Azure, and use Azure Disk Encryption.

Configure the copy activity to use TLS and enable Azure Storage Service Encryption.

TLS encrypts data in transit; Storage Service Encryption encrypts at rest automatically.

Use HTTPS for the copy activity and enable Azure Storage Service Encryption.

You need to audit all data access to an Azure Storage account. Which Azure service should you enable?

Azure Storage analytics logs and send to Log Analytics workspace

Storage logs capture access details; Log Analytics enables querying.

Azure Policy to audit storage account access

Azure Monitor metrics

Azure Security Center

A company uses Azure Key Vault to store secrets for data pipelines. They need to rotate the secrets automatically every 90 days. What should they implement?

Use Azure Policy to enforce secret expiration.

Assign RBAC roles to a service principal to update the secret.

Create a manual process to update the secret in Key Vault.

Configure Key Vault secret rotation with an expiration date of 90 days.

Key Vault can automatically rotate secrets based on expiration.

Which TWO of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2?

Assign RBAC roles for data access

Configure storage firewall rules

Use customer-managed keys in Azure Key Vault

Customer-managed keys provide additional control over encryption.

Use Azure Storage Service Encryption (SSE)

SSE encrypts data at rest automatically.

Enable TLS 1.2 for all connections

Want more Design and implement data security practice?

All Monitor and optimize data storage and processing questions

Domain 4: Monitor and optimize data storage and processing

Decrease the 'Batch size' for the copy activity.

Change the sink to use PolyBase with staging enabled.

Increase the Data Integration Unit (DIU) to 8.

Enable 'Enable staging' and set 'Degree of copy parallelism' to a higher value.

Increases parallelism, reducing copy time and timeout likelihood.

Run OPTIMIZE on the table to compact small files.

OPTIMIZE merges small files into larger ones.

Run ZORDER BY on the date column.

Run VACUUM to delete old files.

Increase the number of partitions by adding a new partition column.

Increase the number of Streaming Units (SU).

More SU provides more processing power.

Reduce the number of Streaming Units.

Change the query compatibility level to 1.0.

Deploy a second Stream Analytics job and split the input.

Partition the data by region in the folder structure.

Partition elimination reduces data scanned.

Create a clustered index on the region column.

Compress the parquet files using gzip.

Enable hierarchical namespace on the storage account.

Configure the streaming to write in micro-batches with a higher trigger interval.

Batching reduces the number of small file writes.

Increase the cluster size to 16 nodes.

Enable 'auto optimize' and 'optimized writes' on the Delta table.

Change the output format from Delta to Parquet.

Which TWO actions should you take to reduce costs associated with an Azure Synapse Dedicated SQL Pool that is used for reporting during business hours only?

Pause the pool during non-business hours.

Stops compute billing when not in use.

Enable advanced data compression on all tables.

Scale down the pool during business hours.

Change the distribution of large tables to ROUND_ROBIN.

Implement result set caching for frequently run queries.

Reduces compute usage for repeated queries.

Want more Monitor and optimize data storage and processing practice?

All Design and implement data storage questions

Domain 5: Design and implement data storage

Use Azure SQL Database with clustered columnstore index on date and device ID.

Organize folders as /YYYY/MM/DD/DeviceID/ in ADLS Gen2 and use file naming that includes timestamp.

This folder structure enables efficient partition pruning based on date and device ID.

Use Azure Table Storage with PartitionKey set to date and RowKey set to device ID.

Use Azure Cosmos DB with partition key on (date, device ID) and TTL for data retention.

Use Azure SQL Database Change Tracking and push changes to Azure Event Hubs, then use Azure Stream Analytics to write to Synapse.

Enable Change Data Capture (CDC) on the source table and use Azure Data Factory with a 1-minute tumbling window to copy changes into Synapse.

CDC captures only changed rows, and ADF can run frequently to meet latency target.

Use Azure Synapse PolyBase to directly query the source SQL database every 5 minutes.

Schedule a full copy of the entire table every 5 minutes using Azure Data Factory.

Store the logs in Azure SQL Database as a table.

Store the logs in Azure Files share.

Store the logs in Azure Blob Storage with cool access tier.

Blob Storage cool tier is low-cost for infrequent access, suitable for logs.

Store the logs in Azure Cosmos DB with a JSON container.

/data/{date}/{source_system}/ (e.g., /data/2023-01-01/SourceA/)

/data/{source_system}/{date}/ (e.g., /data/SourceA/2023-01-01/)

This structure separates sources and dates, enabling efficient query pruning.

/data/{source_system}/ with files named {timestamp}.csv/.json/.parquet

/data/{source_system}/{year}/{month}/{day}/ (e.g., /data/SourceA/2023/01/01/)

Use a private endpoint and Azure AD authentication, disable public access.

Use Azure RBAC and ACLs for authorization, enable firewall and virtual network service endpoints, and enable diagnostic settings for auditing.

Combination provides layered security and full audit.

Use managed identity for service access and disable public access.

Use Azure AD authentication and SAS tokens for access, enabling diagnostic logs for auditing.

Which TWO of the following are supported storage options for use as a source in Azure Synapse Pipeline Copy Activity?

Azure Data Lake Storage Gen2

ADLS Gen2 is a supported source.

Azure Analysis Services

Azure Cognitive Search

Azure Purview

Azure Blob Storage

Azure Blob Storage is a supported source.

Want more Design and implement data storage practice?

All Develop data processing questions

Domain 6: Develop data processing

Use Azure Synapse Pipeline with Auto-commit and checkpointing to process streaming data.

Auto-commit with checkpointing in Synapse Pipeline provides fault tolerance and exactly-once processing for streaming data.

Use PolyBase to load data directly from Event Hubs to the dedicated SQL pool.

Use COPY INTO statement to ingest data from Event Hubs into the dedicated SQL pool.

Enable Event Hubs Capture to write data to Azure Data Lake Storage and then load using PolyBase.

Broadcast the smaller table (1 million rows) to all worker nodes.

Broadcasting the smaller table avoids shuffling the large table, significantly reducing data movement.

Increase the cluster size to reduce shuffle overhead.

Create bucketed tables on the join key for both datasets.

Use Delta Lake and optimize file layout with OPTIMIZE command.

Enable checkpointing to truncate the lineage.

Decrease the number of partitions to reduce overhead.

Increase the executor memory setting in the Spark configuration.

Increasing executor memory provides more heap space to avoid OOM errors.

Use the cache() action on intermediate DataFrames.

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use the Copy activity with PolyBase enabled.

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and enable the built-in Upsert option.

Use Azure Blob Storage as the sink dataset, then use PolyBase to load into the dedicated SQL pool.

Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use Stored Procedure with staging table and PolyBase.

This combination enables high-throughput ingestion and supports incremental loading via merge logic in the stored procedure.

Use a TumblingWindow with duration of 5 minutes and output every 5 minutes.

Use a SlidingWindow with duration 5 minutes and output every 1 minute.

Use a HoppingWindow with size 5 minutes and hop 1 minute.

Hopping windows with a 1-minute hop produce results every minute, each covering the last 5 minutes.

Use a SessionWindow with timeout 5 minutes and maximum duration 10 minutes.

Use coalesce() to reduce the number of partitions without a shuffle.

Coalesce reduces partitions and thus output files, minimizing shuffle.

Enable caching on the DataFrame before writing.

Apply bucketing on a column to group data.

Increase the number of partitions using repartition() with a larger number.

Use repartition() with a smaller number of partitions.

Fewer partitions mean fewer files written.

Want more Develop data processing practice?