DP-203 Design and develop data processing — All Questions With Answers

Question 1

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

Accepted Answer

Create external tables on the JSON files using PolyBase, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write transformed Parquet files.. Option D is correct because it uses PolyBase external tables and CETAS to transform JSON data directly in Azure Data Lake Storage Gen2, minimizing data movement by leveraging the compute power of the dedicated SQL pool or serverless SQL pool closest to the data. This approach reads JSON in place, transforms it into Parquet format, and writes the star schema tables back to the data lake without copying data to an intermediate store.

Answer

Use Azure Data Factory to copy the JSON data into Azure SQL Database, then use T-SQL to transform.

Answer

Use Azure Data Factory with SSIS to transform and load into dedicated SQL pool.

Answer

Load data into a Spark DataFrame in Synapse notebooks, transform, and write back.

Question 2

You are designing a batch processing pipeline that reads CSV files from Azure Blob Storage, performs aggregations using Azure Databricks, and writes results to Azure Synapse Analytics. The pipeline must handle schema drift (new columns appearing in source files). Which approach should you recommend?

Accepted Answer

Use Spark with mergeSchema option when reading, and write using a Delta table to evolve schema automatically.. Option C is correct because Spark's `mergeSchema` option, when used with Delta Lake, automatically evolves the schema to accommodate new columns in CSV files. This allows the batch pipeline to handle schema drift without manual intervention, and writing to a Delta table ensures the schema evolution is persisted and compatible with downstream writes to Azure Synapse Analytics.

Answer

Use Azure Data Factory mapping data flows with schema drift enabled, mapping to a fixed sink schema.

Answer

Define a fixed schema in the source and ignore any new columns.

Answer

Use Azure Stream Analytics to pre-process and enforce schema.

Question 3

A company is running a Spark job on Azure Databricks that processes 500 GB of data daily. The job frequently fails with 'OutOfMemoryError' during shuffles. The cluster uses 10 workers of type Standard_DS3_v2 (14 GB memory each). Which configuration change should you make to improve stability without over-provisioning?

Accepted Answer

Set spark.sql.shuffle.partitions to a higher value, e.g., 500.. The 'OutOfMemoryError' during shuffles indicates that individual partitions are too large for the executor memory. Increasing `spark.sql.shuffle.partitions` to 500 reduces the amount of data per partition, lowering memory pressure during shuffle operations. This directly addresses the error without adding more hardware.

Answer

Increase the driver memory to 28 GB.

Answer

Increase the number of workers to 20.

Answer

Reduce spark.sql.shuffle.partitions to 100.

Question 4

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

Accepted Answer

Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB. Option B is correct because Azure Stream Analytics provides native, low-latency windowed aggregation (e.g., TumblingWindow for per-minute aggregates) directly on data ingested from Event Hubs, and it has a built-in output sink to Azure Cosmos DB. This combination meets the near-real-time requirement without needing an intermediate compute or storage layer, minimizing end-to-end latency.

Answer

Azure Event Hubs -> Azure HDInsight (Kafka) -> Azure Cosmos DB

Answer

Azure IoT Hub -> Azure Databricks (Structured Streaming) -> Azure Cosmos DB

Answer

Azure Event Hubs -> Azure Data Factory -> Azure Cosmos DB

Question 5

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

Accepted Answer

Use a staging table and then swap partitions with the target table.. Option C is correct because using a staging table with partition swapping ensures idempotent writes by atomically replacing the target partition with a fully loaded staging partition. This avoids duplicates even if the job restarts, as the swap operation is transactional and the staging table can be truncated before each run. In Azure Synapse dedicated SQL pool, partition switching is a metadata-only operation that provides consistency without data movement.

Answer

Use a TRUNCATE statement before each insert.

Answer

Use a MERGE statement with a unique key to upsert data.

Answer

Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.

Question 6

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Accepted Answer

Azure Event Hubs. Azure Event Hubs is correct because it is a fully managed, real-time data ingestion service that can capture streaming data and store it in Azure Data Lake Storage Gen2 or Blob Storage, enabling a unified storage layer for both batch and streaming pipelines. It supports schema evolution through Avro or JSON serialization, allowing downstream consumers to adapt to schema changes without breaking existing processes.

Answer

Azure SQL Database

Answer

Apache Kafka on HDInsight

Answer

Azure Data Lake Storage Gen2

Question 7

A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?

Accepted Answer

Configure checkpointing to a durable storage like Azure Data Lake Storage.. Option A is correct because checkpointing to durable storage like Azure Data Lake Storage (ADLS) ensures that streaming progress metadata is persisted across cluster restarts and failures. ADLS provides high durability and availability, preventing checkpoint corruption that can occur with local or ephemeral storage, thereby enabling exactly-once processing guarantees in Structured Streaming.

Answer

Increase the batch interval to reduce load.

Answer

Increase the cluster size to handle spikes.

Question 8

A company ingests streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time to detect anomalies and stored in Azure Data Lake Storage Gen2 for historical analysis. The solution must minimize latency and avoid duplicate processing. Which Azure service should be used for processing?

Accepted Answer

Azure Stream Analytics. Azure Stream Analytics is the correct choice because it is purpose-built for near real-time stream processing with sub-second latency, directly integrates with Event Hubs as input and Data Lake Storage Gen2 as output, and provides built-in exactly-once delivery semantics to avoid duplicate processing. It also supports temporal windowing and anomaly detection functions natively, making it ideal for this IoT anomaly detection scenario.

Answer

Azure Data Factory

Answer

Azure Databricks with Structured Streaming

Answer

Azure Functions with Event Hubs trigger

Question 9

A data engineer is designing a batch processing pipeline that reads data from Azure Blob Storage, transforms it using Azure Databricks, and writes the output to Azure Synapse Analytics. The source files are in CSV format and arrive daily at 02:00 UTC. The transformation must be idempotent and the pipeline should handle late-arriving data (up to 2 hours). What is the best approach to trigger the pipeline?

Accepted Answer

Tumbling window trigger with window size of 1 day and a late arrival window of 2 hours. Option C is correct because a tumbling window trigger in Azure Data Factory allows you to define a fixed-size window (1 day) and a late arrival window (2 hours), which ensures idempotent processing by automatically rerunning the window for late-arriving data within the specified delay. This matches the requirement for daily batch processing at 02:00 UTC while handling data arriving up to 2 hours late.

Answer

Storage event trigger using Azure Event Grid

Answer

Schedule trigger set to 02:00 UTC daily

Answer

Event trigger on blob creation in the container

Question 10

A multinational corporation uses Azure Data Lake Storage Gen2 to store petabytes of parquet files partitioned by date and hour. Data scientists report that queries on the last 7 days of data take over 30 minutes, while queries on older data are fast. The storage account uses the default Azure Blob Storage hierarchical namespace. Which action will MOST improve query performance on recent data?

Accepted Answer

Optimize the partition layout by partitioning by date first, then by hour, to reduce the number of partitions scanned for recent data. Option C is correct because partitioning by date first, then by hour, ensures that queries filtering on the last 7 days scan only the relevant date partitions, drastically reducing the amount of data read. In Azure Data Lake Storage Gen2, the hierarchical namespace allows partition pruning at the directory level, so a date-first layout minimizes the number of partitions scanned for recent data, directly addressing the performance bottleneck.

Answer

Convert the parquet files to CSV format to reduce metadata overhead

Answer

Enable soft delete on the storage account to reduce read latency

Answer

Apply Z-order clustering on the parquet files using Azure Databricks

Question 11

A company uses Azure Data Factory to orchestrate an ETL pipeline that copies data from an on-premises SQL Server to Azure Synapse Analytics. The pipeline runs hourly and uses a self-hosted integration runtime. Recently, the pipeline started failing with timeout errors. The on-premises SQL Server is healthy and the network is stable. What is the most likely cause and solution?

Accepted Answer

The self-hosted integration runtime is under-provisioned; scale up the VM or add more nodes. Option B is correct because timeouts often occur when the self-hosted IR is overloaded or has insufficient resources, and scaling up or adding nodes resolves the issue. Option A is wrong because the integration runtime version is automatically updated. Option C is wrong because the source query timeout is set to 120 seconds by default and increasing it may mask the problem. Option D is wrong because staging is used for large data transfers, but the issue is likely IR performance.

Answer

The self-hosted integration runtime version is outdated; update it to the latest version

Answer

The copy activity is not using staging; enable staging through Azure Blob Storage

Answer

The source query timeout in the copy activity is too low; increase it to 3600 seconds

Question 12

A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?

Accepted Answer

Coalesce the small files into larger files using a Databricks notebook. Option C is correct because coalescing the millions of small CSV files into larger files reduces the metadata overhead and I/O operations when reading from Azure Blob Storage. Databricks can then process fewer, larger files more efficiently, as each task handles a substantial data chunk rather than incurring the cost of opening and closing many small files.

Answer

Increase the number of worker nodes in the cluster

Answer

Convert the CSV files to Parquet format

Answer

Use Delta Lake caching to store the data in memory

Question 13

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Accepted Answer

Partition data by date and hour to improve query performance. Partitioning data by date and hour (Option A) is appropriate because it enables partition elimination, where queries only scan relevant partitions rather than the entire dataset. This directly reduces latency and improves throughput by minimizing I/O and compute resources needed for time-range queries, which is critical for meeting strict SLAs in data processing solutions.

Answer

Process all data synchronously to ensure consistency

Answer

Use a single large cluster for all workloads to simplify management

Answer

Use a single node for orchestration to reduce complexity

Question 14

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Accepted Answer

Need for complex transformations and machine learning model integration. Option B is correct because Azure Databricks provides native support for complex transformations (e.g., windowed aggregations, multi-step ETL) and seamless integration with machine learning libraries (e.g., MLflow, Spark MLlib), which are not natively available in Azure Stream Analytics. Stream Analytics uses a SQL-like query language and is optimized for simpler, declarative transformations, making Databricks the better choice when advanced analytics or ML model scoring is required in real-time pipelines.

Answer

Integration with Power BI for real-time dashboards

Answer

Maximum allowed latency for late-arriving data

Question 15

Refer to the exhibit. An Azure Data Factory instance uses a self-hosted integration runtime. The exhibit shows the properties of the integration runtime. The data engineer notices that copy activities are failing with errors indicating that the integration runtime is not available. What is the most likely cause?

Accepted Answer

The integration runtime version is outdated and needs to be manually updated. Option C is correct because the exhibit shows the integration runtime version as '5.24.8345.1' and the status as 'Online', but copy activities are failing. The most likely cause is that the self-hosted IR version is outdated and no longer compatible with the Azure Data Factory service endpoints, leading to connectivity failures. Auto-update being disabled (Option B) would prevent automatic updates, but the core issue is the outdated version itself, which requires manual intervention to update.

Answer

The integration runtime status is "Offline"

Answer

Auto-update is disabled, preventing the IR from updating

Answer

Self-contained interactive authoring is disabled, causing connectivity issues

Question 16

You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?

Accepted Answer

Implement incremental processing using Auto Loader with 'directoryListing' mode to process only new files since the last run.. Option C is correct because the job reads the entire 5 TB dataset daily, which is inefficient when only new data needs processing. Auto Loader with 'directoryListing' mode incrementally identifies and processes only new files since the last run, drastically reducing the data volume and execution time. This directly addresses the root cause of the SLA breach—reading unchanged historical data repeatedly—rather than tuning resources or partitioning.

Answer

Increase the executor memory and cores in the Spark pool configuration to handle larger shuffles.

Answer

Repartition the data on the 'product_category' column with a higher number of partitions (e.g., 2000).

Answer

Use a broadcast join hint on the fact table to reduce shuffle operations.

Question 17

You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?

Accepted Answer

The corporate firewall or network device is closing idle TCP connections to the SQL Server database.. The intermittent nature of the timeout, combined with the fact that the SHIR VM can connect to SQL Server via SSMS, strongly suggests that the corporate firewall or a network intermediary (such as a load balancer or NAT device) is closing idle TCP connections. ADF pipelines may hold connections open between activities or during long-running data transfers, and if no keep-alive packets are sent within the firewall's idle timeout window (commonly 4–30 minutes), the firewall drops the TCP session. When ADF attempts to reuse that connection, it receives a 'Connection timed out' error because the socket is no longer valid.

Answer

The data being transferred is skewed, causing the sink to be overwhelmed.

Answer

The SQL Server database is experiencing high CPU utilization during the pipeline execution window.

Answer

The self-hosted integration runtime is running out of memory during peak loads.

Question 18

A data engineering team is designing a batch processing solution using Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) and must be processed daily with minimal cost. The team needs to choose between using a Delta Lake table or a Parquet file format for the processed output. Which TWO factors should the team consider when making this decision?

Accepted Answer

Delta Lake provides time travel capabilities for accessing historical data versions.. Option A is correct because Delta Lake's time travel feature allows querying previous versions of data using a timestamp or version number, which is essential for auditing, rollback, and reproducing historical reports. This capability is built on the transaction log that tracks every change, making it a key differentiator from plain Parquet files.

Answer

Parquet is easier to implement for schema evolution than Delta Lake.

Answer

Delta Lake reduces storage costs by automatically compressing data.

Answer

Parquet files are not natively supported by Azure Databricks.

Question 19

You are a data engineer at a retail company. You have designed a near real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, which receives clickstream events from the company's e-commerce website. The output is written to an Azure SQL Database table for reporting. Each event includes fields: UserId, ProductId, EventType (e.g., 'click', 'purchase'), and Timestamp. The requirement is to calculate the number of purchases per product in a 5-minute tumbling window and update a SQL table. The Stream Analytics job has been running for a week, but the reporting team notices that the purchase counts in SQL are consistently lower than expected compared to a direct count from Event Hubs. You suspect that late-arriving events are being dropped. The job's configuration includes a 5-minute tumbling window with no late arrival policy. What should you do to fix the issue without losing data?

Accepted Answer

Modify the query to use a larger tumbling window (e.g., 10 minutes) and add a late arrival policy with a 5-minute grace period to allow late events to be included.. Option A is correct because the current 5-minute tumbling window has no late arrival policy, so any event that arrives after the window ends is dropped. By increasing the window size to 10 minutes and adding a 5-minute late arrival grace period, you allow events that arrive up to 5 minutes late to still be included in the correct window, matching the actual purchase count from Event Hubs.

Answer

Modify the query to use TIMESTAMP BY on the EventHubs enqueued time instead of the event's Timestamp field.

Answer

Change the tumbling window to a hopping window with a 1-minute hop size to increase the frequency of output updates.

Answer

Add a second Stream Analytics job to process late-arriving events separately and union the results.

Question 20

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

Accepted Answer

Out of order tolerance window: 5 minutes; Late arrival tolerance window: 1 hour. Azure Stream Analytics uses two distinct temporal policies to handle event timing issues. The 'Late arrival tolerance window' defines how long the system waits for events that arrive after their timestamp (up to 1 hour in this scenario), while the 'Out of order tolerance window' specifies the maximum time difference allowed for events that arrive out of sequence (up to 5 minutes). Option A correctly configures both policies to match the requirements.

Answer

Out of order tolerance window: 1 hour; Late arrival tolerance window: 5 minutes

Answer

Use Event Hubs capture to handle late events; no additional configuration needed

DP-203 Design and develop data processing — All Questions With Answers

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?

A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Exhibit

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?

You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?

Which Azure service is primarily used for orchestrating data pipelines in a cloud-native ETL workflow?

When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?

You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?

Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)

You are designing a data processing solution that requires exactly-once processing semantics for streaming data. Which two Azure services support exactly-once processing? (Choose two.)

Drag and drop the steps to implement incremental data loading using Azure Data Factory into the correct order.

Drag and drop the steps to implement Slowly Changing Dimension (SCD) Type 2 in Azure Synapse Analytics dedicated SQL pool into the correct order.

Drag and drop the steps to set up Azure Data Factory pipeline with parameterization and dynamic expressions into the correct order.

Match each storage redundancy option to its description in Azure Storage.

Match each Azure data integration tool to its typical use case.

Match each Azure service tier to its description.

Refer to the exhibit. A data engineer runs a Synapse Spark job that fails with the error shown. Which configuration change is most likely to resolve the issue?

Refer to the exhibit. A data engineer notices that the target SQL table contains duplicate rows after a pipeline run. Which change to the pipeline configuration would prevent duplicates?

Exhibit

Refer to the exhibit. A Stream Analytics job shows increasing watermark delay and input deserialization errors. Which action should be taken first to troubleshoot?

Exhibit

Refer to the exhibit. A user with Storage Blob Data Reader role on the container rawdata cannot list files under /2023/07/. What is the most likely reason?

Refer to the exhibit. A data engineer notices that Spark jobs on this cluster are running slower than expected. The cluster is using spot instances with fallback. Which factor is most likely causing the performance degradation?

Exhibit

DP-203 Design and develop data processing — All Questions With Answers

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?

A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Exhibit

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?

You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?

Which Azure service is primarily used for orchestrating data pipelines in a cloud-native ETL workflow?

When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?

You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?

Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)

You are designing a data processing solution that requires exactly-once processing semantics for streaming data. Which two Azure services support exactly-once processing? (Choose two.)

Drag and drop the steps to implement incremental data loading using Azure Data Factory into the correct order.

Drag and drop the steps to implement Slowly Changing Dimension (SCD) Type 2 in Azure Synapse Analytics dedicated SQL pool into the correct order.

Drag and drop the steps to set up Azure Data Factory pipeline with parameterization and dynamic expressions into the correct order.

Match each storage redundancy option to its description in Azure Storage.

Match each Azure data integration tool to its typical use case.

Match each Azure service tier to its description.

Refer to the exhibit. A data engineer runs a Synapse Spark job that fails with the error shown. Which configuration change is most likely to resolve the issue?

Refer to the exhibit. A data engineer notices that the target SQL table contains duplicate rows after a pipeline run. Which change to the pipeline configuration would prevent duplicates?

Exhibit

Refer to the exhibit. A Stream Analytics job shows increasing watermark delay and input deserialization errors. Which action should be taken first to troubleshoot?

Exhibit

Refer to the exhibit. A user with Storage Blob Data Reader role on the container rawdata cannot list files under /2023/07/. What is the most likely reason?

Refer to the exhibit. A data engineer notices that Spark jobs on this cluster are running slower than expected. The cluster is using spot instances with fallback. Which factor is most likely causing the performance degradation?

Exhibit