Practice DP-203 Design and develop data processing questions with full explanations on every answer.
Start practicing
Design and develop data processing — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?
2You are designing a batch processing pipeline that reads CSV files from Azure Blob Storage, performs aggregations using Azure Databricks, and writes results to Azure Synapse Analytics. The pipeline must handle schema drift (new columns appearing in source files). Which approach should you recommend?
3A company is running a Spark job on Azure Databricks that processes 500 GB of data daily. The job frequently fails with 'OutOfMemoryError' during shuffles. The cluster uses 10 workers of type Standard_DS3_v2 (14 GB memory each). Which configuration change should you make to improve stability without over-provisioning?
4You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?
5A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?
6You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?
7A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?
8A company ingests streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time to detect anomalies and stored in Azure Data Lake Storage Gen2 for historical analysis. The solution must minimize latency and avoid duplicate processing. Which Azure service should be used for processing?
9A data engineer is designing a batch processing pipeline that reads data from Azure Blob Storage, transforms it using Azure Databricks, and writes the output to Azure Synapse Analytics. The source files are in CSV format and arrive daily at 02:00 UTC. The transformation must be idempotent and the pipeline should handle late-arriving data (up to 2 hours). What is the best approach to trigger the pipeline?
10A multinational corporation uses Azure Data Lake Storage Gen2 to store petabytes of parquet files partitioned by date and hour. Data scientists report that queries on the last 7 days of data take over 30 minutes, while queries on older data are fast. The storage account uses the default Azure Blob Storage hierarchical namespace. Which action will MOST improve query performance on recent data?
11A company uses Azure Data Factory to orchestrate an ETL pipeline that copies data from an on-premises SQL Server to Azure Synapse Analytics. The pipeline runs hourly and uses a self-hosted integration runtime. Recently, the pipeline started failing with timeout errors. The on-premises SQL Server is healthy and the network is stable. What is the most likely cause and solution?
12A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?
13Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?
14Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?
15Refer to the exhibit. An Azure Data Factory instance uses a self-hosted integration runtime. The exhibit shows the properties of the integration runtime. The data engineer notices that copy activities are failing with errors indicating that the integration runtime is not available. What is the most likely cause?
16You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?
17You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?
18A data engineering team is designing a batch processing solution using Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) and must be processed daily with minimal cost. The team needs to choose between using a Delta Lake table or a Parquet file format for the processed output. Which TWO factors should the team consider when making this decision?
19You are a data engineer at a retail company. You have designed a near real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, which receives clickstream events from the company's e-commerce website. The output is written to an Azure SQL Database table for reporting. Each event includes fields: UserId, ProductId, EventType (e.g., 'click', 'purchase'), and Timestamp. The requirement is to calculate the number of purchases per product in a 5-minute tumbling window and update a SQL table. The Stream Analytics job has been running for a week, but the reporting team notices that the purchase counts in SQL are consistently lower than expected compared to a direct count from Event Hubs. You suspect that late-arriving events are being dropped. The job's configuration includes a 5-minute tumbling window with no late arrival policy. What should you do to fix the issue without losing data?
20You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?
21You are designing a data processing solution for a retail company. The solution must ingest streaming sales data from point-of-sale (POS) systems and batch uploads from stores that are offline. The total data volume is 5 TB daily. The solution must allow real-time dashboards and periodic batch processing. Which combination of services and ingestion patterns is most cost-effective and scalable?
22You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?
23You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?
24You are designing a stream processing solution using Azure Stream Analytics. The job must reference a static lookup table (product catalog) stored in Azure Blob Storage. The catalog is updated once daily. The job should automatically pick up the latest version without restarting. Which two configurations are required? (Choose two.)
25You are designing a data transformation pipeline using Azure Databricks. The pipeline reads from Azure Data Lake Storage Gen2, performs aggregations, and writes to a Synapse dedicated SQL pool. Which three configurations should you implement to optimize performance and minimize cost? (Choose three.)
26Which Azure service is primarily used for orchestrating data pipelines in a cloud-native ETL workflow?
27When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?
28You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?
29Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)
30You are designing a data processing solution that requires exactly-once processing semantics for streaming data. Which two Azure services support exactly-once processing? (Choose two.)
31Match the Azure service to its primary data processing use case. Drag each service on the left to the correct use case on the right. Services: Azure Databricks, Azure Stream Analytics, Azure Data Factory, Azure Synapse Analytics Use Cases: - Real-time event processing - Orchestration of ETL pipelines - Big data analytics with Spark - Enterprise data warehousing
32Drag and drop the steps to implement incremental data loading using Azure Data Factory into the correct order.
33Drag and drop the steps to implement Slowly Changing Dimension (SCD) Type 2 in Azure Synapse Analytics dedicated SQL pool into the correct order.
34Drag and drop the steps to set up Azure Data Factory pipeline with parameterization and dynamic expressions into the correct order.
35Match each storage redundancy option to its description in Azure Storage.
36Match each Azure data integration tool to its typical use case.
37Match each Azure service tier to its description.
38Refer to the exhibit. A data engineer runs a Synapse Spark job that fails with the error shown. Which configuration change is most likely to resolve the issue?
39Refer to the exhibit. A data engineer notices that the target SQL table contains duplicate rows after a pipeline run. Which change to the pipeline configuration would prevent duplicates?
40Refer to the exhibit. A Stream Analytics job shows increasing watermark delay and input deserialization errors. Which action should be taken first to troubleshoot?
41Refer to the exhibit. A user with Storage Blob Data Reader role on the container rawdata cannot list files under /2023/07/. What is the most likely reason?
42Refer to the exhibit. A data engineer notices that Spark jobs on this cluster are running slower than expected. The cluster is using spot instances with fallback. Which factor is most likely causing the performance degradation?
The Design and develop data processing domain covers the key concepts tested in this area of the DP-203 exam blueprint published by Microsoft. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all DP-203 domains — no account required.
The Courseiva DP-203 question bank contains 42 questions in the Design and develop data processing domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Design and develop data processing domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included