CCNA Data Processing Questions

42 questions · Data Processing topic · All types, answers revealed

1
MCQhard

You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?

A.Increase the executor memory and cores in the Spark pool configuration to handle larger shuffles.
B.Repartition the data on the 'product_category' column with a higher number of partitions (e.g., 2000).
C.Implement incremental processing using Auto Loader with 'directoryListing' mode to process only new files since the last run.
D.Use a broadcast join hint on the fact table to reduce shuffle operations.
AnswerC

Auto Loader incrementally ingests new files, avoiding a full scan of the 5 TB dataset daily. This directly reduces the data processed and speeds up the job.

Why this answer

Option C is correct because the job reads the entire 5 TB dataset daily, which is inefficient when only new data needs processing. Auto Loader with 'directoryListing' mode incrementally identifies and processes only new files since the last run, drastically reducing the data volume and execution time. This directly addresses the root cause of the SLA breach—reading unchanged historical data repeatedly—rather than tuning resources or partitioning.

Exam trap

The trap here is that candidates focus on tuning Spark parameters (memory, partitions, joins) to handle the existing workload, but the real issue is the unnecessary reprocessing of unchanged data, which only incremental loading can solve.

How to eliminate wrong answers

Option A is wrong because increasing executor memory and cores only improves shuffle performance for existing data volumes; it does not reduce the amount of data read or processed, so the job would still read the full 5 TB daily and likely remain over 6 hours. Option B is wrong because repartitioning on 'product_category' with 2000 partitions may help with data skew but does not eliminate the need to read the entire dataset each day; it also introduces a costly shuffle operation that could worsen performance. Option D is wrong because a broadcast join hint is used to optimize joins by broadcasting a small table to all executors, but the fact table (sales transactions) is massive (5 TB) and cannot be broadcast; this approach would cause out-of-memory errors or be ignored by Spark.

2
MCQmedium

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

A.Azure Event Hubs -> Azure HDInsight (Kafka) -> Azure Cosmos DB
B.Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB
C.Azure IoT Hub -> Azure Databricks (Structured Streaming) -> Azure Cosmos DB
D.Azure Event Hubs -> Azure Data Factory -> Azure Cosmos DB
AnswerB

Stream Analytics provides near-real-time aggregation.

Why this answer

Option B is correct because Azure Stream Analytics provides native, low-latency windowed aggregation (e.g., TumblingWindow for per-minute aggregates) directly on data ingested from Event Hubs, and it has a built-in output sink to Azure Cosmos DB. This combination meets the near-real-time requirement without needing an intermediate compute or storage layer, minimizing end-to-end latency.

Exam trap

The trap here is that candidates often over-engineer the solution by adding a big-data processing layer (like HDInsight or Databricks) when a simpler, fully managed stream analytics service (Azure Stream Analytics) is the correct choice for fixed-window aggregation and direct Cosmos DB output.

How to eliminate wrong answers

Option A is wrong because Azure HDInsight (Kafka) introduces unnecessary complexity and latency for simple per-minute aggregation; it requires manual stream processing setup and is not optimized for direct, low-latency output to Cosmos DB. Option C is wrong because Azure Databricks Structured Streaming, while capable, adds startup and cluster management overhead that is not ideal for sub-minute latency, and IoT Hub is typically used for device management and bi-directional communication, not purely for high-throughput telemetry ingestion. Option D is wrong because Azure Data Factory is a batch-oriented orchestration service, not designed for near-real-time stream processing or windowed aggregation.

3
MCQhard

Refer to the exhibit. A Stream Analytics job shows increasing watermark delay and input deserialization errors. Which action should be taken first to troubleshoot?

A.Check the input data schema and ensure it matches the query
B.Change the output to a different sink
C.Increase the number of Streaming Units (SUs)
D.Set the watermark delay threshold higher
AnswerA

Deserialization errors are often due to schema mismatch; fixing the data or query resolves the root cause.

Why this answer

The input deserialization errors indicate that some incoming data cannot be parsed correctly. This can cause backpressure and increase watermark delay. Checking the event schema or data format is the first step.

The watermark delay of 120 seconds (max 300) is high but the root cause is likely the deserialization errors.

4
Multi-Selecthard

A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?

Select 3 answers
A.Configure checkpointing to a durable storage like Azure Data Lake Storage.
B.Increase the batch interval to reduce load.
C.Increase the cluster size to handle spikes.
D.Use Structured Streaming with `failOnDataLoss` set to false.
E.Implement a retry policy with exponential backoff for transient failures.
AnswersA, D, E

Durable storage prevents corruption.

Why this answer

Option A is correct because checkpointing to durable storage like Azure Data Lake Storage (ADLS) ensures that streaming progress metadata is persisted across cluster restarts and failures. ADLS provides high durability and availability, preventing checkpoint corruption that can occur with local or ephemeral storage, thereby enabling exactly-once processing guarantees in Structured Streaming.

Exam trap

The trap here is that candidates may confuse scaling solutions (increasing cluster size or batch interval) with reliability mechanisms, failing to recognize that checkpoint durability and data loss tolerance are the core mitigations for corruption and streaming failures.

5
Multi-Selectmedium

A data engineering team is designing a batch processing solution using Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) and must be processed daily with minimal cost. The team needs to choose between using a Delta Lake table or a Parquet file format for the processed output. Which TWO factors should the team consider when making this decision?

Select 2 answers
A.Delta Lake provides time travel capabilities for accessing historical data versions.
B.Parquet is easier to implement for schema evolution than Delta Lake.
C.Delta Lake reduces storage costs by automatically compressing data.
D.Delta Lake supports ACID transactions, ensuring data consistency during concurrent writes.
E.Parquet files are not natively supported by Azure Databricks.
AnswersA, D

Delta Lake's time travel feature allows querying previous versions of data.

Why this answer

Option A is correct because Delta Lake's time travel feature allows querying previous versions of data using a timestamp or version number, which is essential for auditing, rollback, and reproducing historical reports. This capability is built on the transaction log that tracks every change, making it a key differentiator from plain Parquet files.

Exam trap

The trap here is that candidates assume Parquet is not natively supported in Databricks, but in reality, Parquet is the default storage format for Delta Lake and is fully supported; the key differentiators are ACID transactions and time travel, not format compatibility.

6
MCQhard

Refer to the exhibit. An Azure Data Factory instance uses a self-hosted integration runtime. The exhibit shows the properties of the integration runtime. The data engineer notices that copy activities are failing with errors indicating that the integration runtime is not available. What is the most likely cause?

A.The integration runtime status is "Offline"
B.Auto-update is disabled, preventing the IR from updating
C.The integration runtime version is outdated and needs to be manually updated
D.Self-contained interactive authoring is disabled, causing connectivity issues
AnswerC

The version is behind the pushed version, indicating auto-update has not applied the latest update.

Why this answer

Option C is correct because the exhibit shows the integration runtime version as '5.24.8345.1' and the status as 'Online', but copy activities are failing. The most likely cause is that the self-hosted IR version is outdated and no longer compatible with the Azure Data Factory service endpoints, leading to connectivity failures. Auto-update being disabled (Option B) would prevent automatic updates, but the core issue is the outdated version itself, which requires manual intervention to update.

Exam trap

The trap here is that candidates see the status 'Online' and assume the IR is fully functional, overlooking that version incompatibility can cause operational failures even when the IR appears connected.

How to eliminate wrong answers

Option A is wrong because the exhibit clearly shows the integration runtime status as 'Online', not 'Offline', so the IR is technically reachable. Option B is wrong because while auto-update being disabled can lead to an outdated version, the question asks for the most likely cause of the copy activity failures, and the direct cause is the outdated version (Option C), not the disabled auto-update setting itself. Option D is wrong because self-contained interactive authoring is a feature for authoring and debugging in the self-hosted IR environment, and disabling it does not cause the IR to become unavailable for copy activities; it only affects authoring capabilities.

7
Drag & Dropmedium

Drag and drop the steps to set up Azure Data Factory pipeline with parameterization and dynamic expressions into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Create the pipeline, define parameters, use them in activities and linked services, then trigger with values.

8
MCQhard

Refer to the exhibit. A data engineer runs a Synapse Spark job that fails with the error shown. Which configuration change is most likely to resolve the issue?

A.Change the executor cores to 1
B.Switch to a different Spark pool with the same configuration
C.Reduce the number of executors to 1 for Job1
D.Increase executor memory to 4g for Job1
AnswerD

Increasing executor memory provides more heap space, which directly addresses the OutOfMemoryError.

Why this answer

The error 'Java heap space' indicates the executor memory is insufficient. Job1 uses 2g memory with 2 executors, while Job2 succeeds with 4g and 4 executors. Increasing executor memory or adding executors can resolve the OOM error.

9
MCQhard

A multinational corporation uses Azure Data Lake Storage Gen2 to store petabytes of parquet files partitioned by date and hour. Data scientists report that queries on the last 7 days of data take over 30 minutes, while queries on older data are fast. The storage account uses the default Azure Blob Storage hierarchical namespace. Which action will MOST improve query performance on recent data?

A.Convert the parquet files to CSV format to reduce metadata overhead
B.Enable soft delete on the storage account to reduce read latency
C.Optimize the partition layout by partitioning by date first, then by hour, to reduce the number of partitions scanned for recent data
D.Apply Z-order clustering on the parquet files using Azure Databricks
AnswerC

Recent data queries scan fewer partitions, improving performance.

Why this answer

Option C is correct because partitioning by date first, then by hour, ensures that queries filtering on the last 7 days scan only the relevant date partitions, drastically reducing the amount of data read. In Azure Data Lake Storage Gen2, the hierarchical namespace allows partition pruning at the directory level, so a date-first layout minimizes the number of partitions scanned for recent data, directly addressing the performance bottleneck.

Exam trap

The trap here is that candidates often confuse partition layout optimization with data format or clustering techniques, overlooking that the hierarchical namespace in ADLS Gen2 makes directory-level partition pruning the most impactful lever for time-range queries.

How to eliminate wrong answers

Option A is wrong because converting parquet to CSV would increase file size and read overhead, as CSV lacks columnar compression and predicate pushdown capabilities, making queries slower, not faster. Option B is wrong because soft delete is a data protection feature that adds metadata overhead for deleted objects and does not reduce read latency; it actually increases storage costs and can degrade performance due to additional index lookups. Option D is wrong because Z-order clustering in Azure Databricks optimizes data layout within a partition for multi-dimensional queries, but it does not reduce the number of partitions scanned; the primary issue is scanning too many partitions, not intra-partition data skew.

10
MCQhard

A company is running a Spark job on Azure Databricks that processes 500 GB of data daily. The job frequently fails with 'OutOfMemoryError' during shuffles. The cluster uses 10 workers of type Standard_DS3_v2 (14 GB memory each). Which configuration change should you make to improve stability without over-provisioning?

A.Set spark.sql.shuffle.partitions to a higher value, e.g., 500.
B.Increase the driver memory to 28 GB.
C.Increase the number of workers to 20.
D.Reduce spark.sql.shuffle.partitions to 100.
AnswerA

Reduces data per partition, easing memory.

Why this answer

The 'OutOfMemoryError' during shuffles indicates that individual partitions are too large for the executor memory. Increasing `spark.sql.shuffle.partitions` to 500 reduces the amount of data per partition, lowering memory pressure during shuffle operations. This directly addresses the error without adding more hardware.

Exam trap

The trap here is that candidates often assume adding more workers (Option C) is the only way to fix memory errors, but the question tests understanding that partition size, not just cluster size, is the root cause of shuffle OOM errors.

How to eliminate wrong answers

Option B is wrong because increasing driver memory does not help with executor-side shuffle memory issues; the driver is not involved in shuffle data processing. Option C is wrong because adding more workers increases parallelism but does not reduce the size of each partition unless the number of partitions is also increased; it would over-provision resources without fixing the root cause. Option D is wrong because reducing `spark.sql.shuffle.partitions` to 100 would make each partition larger, worsening the memory pressure and increasing the likelihood of OutOfMemoryError.

11
Drag & Dropmedium

Drag and drop the steps to implement incremental data loading using Azure Data Factory into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Incremental loading requires a watermark column. The pipeline retrieves the last watermark, copies data changed since then, then updates the watermark for next run.

12
Multi-Selecthard

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Select 3 answers
A.Integration with Power BI for real-time dashboards
B.Need for complex transformations and machine learning model integration
C.Volume of data per second (throughput)
D.Requirement for exactly-once semantics
E.Maximum allowed latency for late-arriving data
AnswersB, C, D

Databricks supports complex ML pipelines natively.

Why this answer

Option B is correct because Azure Databricks provides native support for complex transformations (e.g., windowed aggregations, multi-step ETL) and seamless integration with machine learning libraries (e.g., MLflow, Spark MLlib), which are not natively available in Azure Stream Analytics. Stream Analytics uses a SQL-like query language and is optimized for simpler, declarative transformations, making Databricks the better choice when advanced analytics or ML model scoring is required in real-time pipelines.

Exam trap

The trap here is that candidates often assume Power BI integration or late-arriving data handling are unique to one service, when in fact both services support these features, and the key differentiators are throughput scalability, exactly-once semantics, and the ability to perform complex transformations with ML integration.

13
MCQmedium

When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?

A.Set the option 'mergeSchema' to 'true' on write
B.Set the option 'overwriteSchema' to 'true' on write
C.Manually alter the table schema using ALTER TABLE
D.Ignore schema changes and use 'failOnDataLoss' flag
AnswerA

Why this answer

Option A is correct because in Delta Lake, schema evolution is automatically handled by setting the 'mergeSchema' option to 'true' on write operations. This allows new columns to be added or existing column types to be safely widened without manual intervention, preserving existing data and metadata integrity.

Exam trap

The trap here is that candidates often confuse 'mergeSchema' with 'overwriteSchema', mistakenly thinking both handle schema evolution similarly, but 'overwriteSchema' replaces the entire schema and can cause data loss, while 'mergeSchema' safely merges new columns or type changes.

Why the other options are wrong

B

overwriteSchema replaces the entire schema and can cause data loss.

C

Manual approach is error-prone and not recommended for automated pipelines.

D

failOnDataLoss is for streaming jobs, not for schema evolution.

14
MCQeasy

You are designing a batch processing pipeline that reads CSV files from Azure Blob Storage, performs aggregations using Azure Databricks, and writes results to Azure Synapse Analytics. The pipeline must handle schema drift (new columns appearing in source files). Which approach should you recommend?

A.Use Azure Data Factory mapping data flows with schema drift enabled, mapping to a fixed sink schema.
B.Define a fixed schema in the source and ignore any new columns.
C.Use Spark with mergeSchema option when reading, and write using a Delta table to evolve schema automatically.
D.Use Azure Stream Analytics to pre-process and enforce schema.
AnswerC

Handles schema drift automatically.

Why this answer

Option C is correct because Spark's `mergeSchema` option, when used with Delta Lake, automatically evolves the schema to accommodate new columns in CSV files. This allows the batch pipeline to handle schema drift without manual intervention, and writing to a Delta table ensures the schema evolution is persisted and compatible with downstream writes to Azure Synapse Analytics.

Exam trap

The trap here is that candidates often confuse schema drift handling with schema enforcement, assuming that a fixed sink schema or streaming pre-processing can accommodate dynamic schema changes, when in fact only a schema-on-read approach like Spark's `mergeSchema` with Delta Lake provides the necessary flexibility for batch pipelines.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory mapping data flows with schema drift enabled can handle new columns, but mapping to a fixed sink schema would discard or fail on those new columns, defeating the purpose of handling drift. Option B is wrong because defining a fixed schema and ignoring new columns directly contradicts the requirement to handle schema drift, leading to data loss or pipeline failures. Option D is wrong because Azure Stream Analytics is designed for real-time streaming data, not batch processing, and it enforces a fixed schema rather than evolving it dynamically.

15
MCQmedium

You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?

A.The data being transferred is skewed, causing the sink to be overwhelmed.
B.The corporate firewall or network device is closing idle TCP connections to the SQL Server database.
C.The SQL Server database is experiencing high CPU utilization during the pipeline execution window.
D.The self-hosted integration runtime is running out of memory during peak loads.
AnswerB

Firewalls often drop idle connections after a timeout period. When the pipeline uses a connection from the pool that has been idle, the connection is no longer valid, causing a timeout. This explains the intermittent nature.

Why this answer

The intermittent nature of the timeout, combined with the fact that the SHIR VM can connect to SQL Server via SSMS, strongly suggests that the corporate firewall or a network intermediary (such as a load balancer or NAT device) is closing idle TCP connections. ADF pipelines may hold connections open between activities or during long-running data transfers, and if no keep-alive packets are sent within the firewall's idle timeout window (commonly 4–30 minutes), the firewall drops the TCP session. When ADF attempts to reuse that connection, it receives a 'Connection timed out' error because the socket is no longer valid.

Exam trap

The trap here is that candidates assume the error is due to resource exhaustion (CPU, memory, or data skew) because those are common causes of intermittent failures, but the specific 'Connection timed out' error points to a network-layer issue, not a server-side performance bottleneck.

How to eliminate wrong answers

Option A is wrong because data skew would cause performance issues like slow writes or out-of-memory errors on the sink, not a TCP connection timeout to the source SQL Server. Option C is wrong because high CPU utilization on SQL Server would manifest as query timeouts or slow performance, not a TCP-level connection timeout (which occurs before any query is sent). Option D is wrong because SHIR running out of memory would produce out-of-memory exceptions or pipeline failures with different error codes, not a TCP connection timeout to the database.

16
Multi-Selecteasy

Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)

Select 2 answers
A.Copy
B.Databricks Notebook
C.Assign
D.Execute Pipeline
AnswersA, B

Why this answer

A is correct because the Copy activity is a fundamental data movement activity in Azure Data Factory (ADF) that allows you to ingest data from a wide range of supported source stores (e.g., Azure Blob Storage, SQL Server) to sink stores (e.g., Azure Synapse Analytics, Azure Data Lake Storage). It uses the underlying integration runtime to perform the actual data transfer, supporting both built-in connectors and self-hosted runtimes for on-premises sources. This makes it a core, valid activity for building ETL/ELT pipelines.

Exam trap

The trap here is that candidates may confuse the 'Execute Pipeline Activity' (a valid control flow activity) with the incomplete 'Execute Pipeline' option, or mistakenly think 'Assign' is a real activity due to its similarity to variable assignment in programming languages, when in fact ADF uses 'Set Variable' for that purpose.

Why the other options are wrong

C

No such activity in ADF; the correct term is 'Set Variable'.

D

This is a valid activity but the question asks for two, and the first two are more common.

17
MCQeasy

You are a data engineer at a retail company. You have designed a near real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, which receives clickstream events from the company's e-commerce website. The output is written to an Azure SQL Database table for reporting. Each event includes fields: UserId, ProductId, EventType (e.g., 'click', 'purchase'), and Timestamp. The requirement is to calculate the number of purchases per product in a 5-minute tumbling window and update a SQL table. The Stream Analytics job has been running for a week, but the reporting team notices that the purchase counts in SQL are consistently lower than expected compared to a direct count from Event Hubs. You suspect that late-arriving events are being dropped. The job's configuration includes a 5-minute tumbling window with no late arrival policy. What should you do to fix the issue without losing data?

A.Modify the query to use a larger tumbling window (e.g., 10 minutes) and add a late arrival policy with a 5-minute grace period to allow late events to be included.
B.Modify the query to use TIMESTAMP BY on the EventHubs enqueued time instead of the event's Timestamp field.
C.Change the tumbling window to a hopping window with a 1-minute hop size to increase the frequency of output updates.
D.Add a second Stream Analytics job to process late-arriving events separately and union the results.
AnswerA

A larger window with a late arrival policy captures late-arriving events.

Why this answer

Option A is correct because the current 5-minute tumbling window has no late arrival policy, so any event that arrives after the window ends is dropped. By increasing the window size to 10 minutes and adding a 5-minute late arrival grace period, you allow events that arrive up to 5 minutes late to still be included in the correct window, matching the actual purchase count from Event Hubs.

Exam trap

The trap here is that candidates may think increasing the window size or changing the window type (hopping) will fix the issue, but the core problem is the lack of a late arrival policy to handle events that arrive after the window closes.

How to eliminate wrong answers

Option B is wrong because using TIMESTAMP BY on the Event Hubs enqueued time does not solve the late arrival issue; it only changes the timestamp used for windowing, but late events still arrive after the window closes and would be dropped without a late arrival policy. Option C is wrong because changing to a hopping window with a 1-minute hop size increases output frequency but does not address late-arriving events; late events would still be dropped if they arrive after the window end time. Option D is wrong because adding a second Stream Analytics job to process late events separately is unnecessarily complex and introduces data duplication and reconciliation challenges; the correct approach is to use a late arrival policy within a single job.

18
MCQeasy

A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?

A.Increase the number of worker nodes in the cluster
B.Convert the CSV files to Parquet format
C.Coalesce the small files into larger files using a Databricks notebook
D.Use Delta Lake caching to store the data in memory
AnswerC

Reduces file count and improves read performance.

Why this answer

Option C is correct because coalescing the millions of small CSV files into larger files reduces the metadata overhead and I/O operations when reading from Azure Blob Storage. Databricks can then process fewer, larger files more efficiently, as each task handles a substantial data chunk rather than incurring the cost of opening and closing many small files.

Exam trap

The trap here is that candidates often assume performance issues are always solved by scaling out (Option A) or by switching formats (Option B), but the DP-203 exam specifically tests the understanding that small file overhead is a distinct problem requiring file consolidation.

How to eliminate wrong answers

Option A is wrong because simply adding more worker nodes does not address the root cause of small file overhead; it may even exacerbate the problem by increasing the number of tasks that each try to read a small file, leading to more scheduler and I/O contention. Option B is wrong because converting CSV to Parquet improves compression and columnar read performance but does not reduce the number of files; the overhead of opening millions of small Parquet files remains similar to CSV. Option D is wrong because Delta Lake caching stores data in memory after it is read, but it does not reduce the initial read overhead of millions of small files; the first read still suffers from the same small file penalty.

19
MCQhard

Refer to the exhibit. A data engineer notices that Spark jobs on this cluster are running slower than expected. The cluster is using spot instances with fallback. Which factor is most likely causing the performance degradation?

A.The spark.sql.adaptive settings are misconfigured
B.The cluster is using spot instances which may be frequently reclaimed
C.The node type Standard_DS3_v2 is too small
D.The autoscale configuration limits max workers to 8
AnswerB

Spot instances are cheaper but can be terminated at any time, causing job failures or delays due to recomputation.

Why this answer

Spot instances can be preempted, causing delays. The configuration sets 'first_on_demand' to 1, meaning only 1 node is on-demand, and the rest are spot. Spot instances can be reclaimed, leading to recomputation and slower performance.

The adaptive query execution settings are generally beneficial, not harmful.

20
Multi-Selectmedium

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Select 2 answers
A.Azure Event Hubs
B.Azure SQL Database
C.Apache Kafka on HDInsight
D.Delta Lake (on Azure Databricks)
E.Azure Data Lake Storage Gen2
AnswersA, D

Common ingestion for batch and streaming.

Why this answer

Azure Event Hubs is correct because it is a fully managed, real-time data ingestion service that can capture streaming data and store it in Azure Data Lake Storage Gen2 or Blob Storage, enabling a unified storage layer for both batch and streaming pipelines. It supports schema evolution through Avro or JSON serialization, allowing downstream consumers to adapt to schema changes without breaking existing processes.

Exam trap

The trap here is that candidates often confuse Azure Data Lake Storage Gen2 as a processing technology rather than a storage layer, or they overlook that Event Hubs is the streaming ingestion service that pairs with Delta Lake for unified batch/streaming and schema evolution.

21
MCQmedium

You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?

A.Use PolyBase INSERT...SELECT with rowstore table
B.Use CREATE TABLE AS SELECT (CTAS) with the hash-distributed column
C.Use COPY INTO command with a round-robin distribution
D.Use Azure Data Factory Copy activity with staging enabled
AnswerB

Why this answer

Option B is correct because CTAS with a hash-distributed column loads data directly into the target table with the same distribution scheme, avoiding data movement and minimizing resource contention. This pattern is optimized for bulk loading large datasets into a hash-distributed fact table, as it leverages the Synapse SQL pool's MPP architecture to parallelize the operation without blocking concurrent queries.

Exam trap

The trap here is that candidates often assume PolyBase or COPY INTO are always the fastest for bulk loading, but they overlook that CTAS with the correct distribution key avoids the costly data redistribution step required by other methods.

Why the other options are wrong

A

INSERT...SELECT logs each row and may cause concurrency issues; not as efficient as CTAS for large loads.

C

COPY INTO is efficient but round-robin distribution may cause data movement later; CTAS with hash distribution avoids extra steps.

D

Copy activity with staging uses PolyBase under the hood but adds overhead; CTAS is more direct.

22
MCQhard

You are designing a data processing solution for a retail company. The solution must ingest streaming sales data from point-of-sale (POS) systems and batch uploads from stores that are offline. The total data volume is 5 TB daily. The solution must allow real-time dashboards and periodic batch processing. Which combination of services and ingestion patterns is most cost-effective and scalable?

A.Use Azure IoT Hub for POS streaming and Azure Blob Storage for offline store uploads, then process with Stream Analytics and Data Factory
B.Use Azure Event Hubs with Kafka protocol for all incoming data. Use Stream Analytics for real-time dashboards and Event Hubs Capture to land data in ADLS for batch processing
C.Use Azure Data Lake Storage for all data, then use Azure Databricks structured streaming for real-time and batch
D.Use Azure Stream Analytics directly on POS data and store offline uploads in Blob Storage, then batch process with U-SQL
AnswerB

Why this answer

Option B is correct because Azure Event Hubs with Kafka protocol provides a unified ingestion endpoint for both streaming POS data and batch offline uploads, eliminating the need for separate services. Stream Analytics enables real-time dashboards, while Event Hubs Capture automatically lands data into Azure Data Lake Storage for cost-effective batch processing, making this the most scalable and cost-effective solution for 5 TB daily.

Exam trap

The trap here is that candidates often assume IoT Hub is required for streaming data, but Event Hubs with Kafka protocol is more cost-effective and scalable for high-volume POS data, and they overlook Event Hubs Capture as a built-in mechanism for batch landing.

Why the other options are wrong

A

Two separate ingestion services increase management overhead and cost; IoT Hub is designed for device-to-cloud, not necessarily POS systems.

C

ADLS is storage, not an ingestion service; Databricks structured streaming can read from ADLS but not directly ingest streaming data without a queue.

D

Stream Analytics cannot directly ingest from POS systems without a messaging layer; offline store uploads still need ingestion.

23
MCQmedium

Refer to the exhibit. A data engineer notices that the target SQL table contains duplicate rows after a pipeline run. Which change to the pipeline configuration would prevent duplicates?

A.Remove the 'preCopyScript'
B.Change 'writeBatchSize' to 5000
C.Add a 'Upsert' setting in SqlSink with a key column
D.Set 'recursive' to false
AnswerC

Upsert ensures that rows are updated or inserted based on a key, preventing duplicates.

Why this answer

The preCopyScript truncates the table before copy, but if the pipeline is run multiple times and truncation fails or is skipped, duplicates can occur. Using upsert semantics or adding a watermark could help, but the simplest fix is to ensure truncation is reliable. However, the best practice is to use a merge/upsert pattern.

Among the options, adding a surrogate key and using upsert is most effective.

24
Matchingmedium

Match each storage redundancy option to its description in Azure Storage.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Three synchronous copies within a single data center

Three copies across multiple availability zones in a region

Geo-redundant storage with read access in secondary region

Geo-zone-redundant storage with read access in secondary region

Why these pairings

These redundancy options are key for data durability and availability.

25
Multi-Selectmedium

You are designing a data processing solution that requires exactly-once processing semantics for streaming data. Which two Azure services support exactly-once processing? (Choose two.)

Select 2 answers
A.Azure Event Hubs
B.Azure Stream Analytics
C.Azure Databricks with Structured Streaming
D.Azure Data Lake Storage Gen2
AnswersB, C

Why this answer

Azure Stream Analytics supports exactly-once processing by using checkpointing and event sourcing to ensure that each event is processed exactly once, even in the event of failures or restarts. It achieves this through its internal state management and the use of checkpoint offsets in the output sink, guaranteeing no duplicate or missed events.

Exam trap

The trap here is that candidates often confuse the ingestion guarantee of Event Hubs (at-least-once) with the processing guarantee of Stream Analytics, or mistakenly think that a storage service like Data Lake Storage Gen2 inherently provides processing semantics.

Why the other options are wrong

A

Event Hubs provides at-least-once delivery; exactly-once requires processing logic.

D

It is a storage service, not a processing engine.

26
MCQeasy

Which Azure service is primarily used for orchestrating data pipelines in a cloud-native ETL workflow?

A.Azure Data Factory
B.Azure Synapse Analytics
C.Azure HDInsight
D.Azure Databricks
AnswerA

Why this answer

Azure Data Factory (ADF) is the correct answer because it is a cloud-native, serverless data integration service specifically designed for orchestrating and automating data pipelines. ADF provides a code-free visual interface, supports over 90 built-in connectors, and enables complex ETL/ELT workflows with control flow, data flow, and trigger-based scheduling, making it the primary orchestration tool in Azure.

Exam trap

The trap here is that candidates confuse Azure Synapse Analytics (which includes pipeline capabilities) as the primary orchestrator, but Synapse pipelines are actually built on Azure Data Factory, and the exam expects you to identify ADF as the dedicated, cloud-native orchestration service.

Why the other options are wrong

B

Synapse is an analytics platform that includes pipelines but is not primarily an orchestration-only service.

C

HDInsight is a managed Hadoop/Spark cluster, not a pipeline orchestrator.

D

Databricks is a collaborative data engineering environment, not an orchestration service.

27
MCQmedium

A company uses Azure Data Factory to orchestrate an ETL pipeline that copies data from an on-premises SQL Server to Azure Synapse Analytics. The pipeline runs hourly and uses a self-hosted integration runtime. Recently, the pipeline started failing with timeout errors. The on-premises SQL Server is healthy and the network is stable. What is the most likely cause and solution?

A.The self-hosted integration runtime version is outdated; update it to the latest version
B.The copy activity is not using staging; enable staging through Azure Blob Storage
C.The self-hosted integration runtime is under-provisioned; scale up the VM or add more nodes
D.The source query timeout in the copy activity is too low; increase it to 3600 seconds
AnswerC

Under-provisioned IR can cause timeouts; scaling resolves the issue.

Why this answer

Option B is correct because timeouts often occur when the self-hosted IR is overloaded or has insufficient resources, and scaling up or adding nodes resolves the issue. Option A is wrong because the integration runtime version is automatically updated. Option C is wrong because the source query timeout is set to 120 seconds by default and increasing it may mask the problem.

Option D is wrong because staging is used for large data transfers, but the issue is likely IR performance.

28
Matchingmedium

Match each Azure service tier to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Hierarchical namespace for Azure Data Lake Storage

Optimized for frequent data access

Optimized for infrequent access with lower cost

Lowest cost for rarely accessed data

Why these pairings

These tiers apply to Azure Blob Storage and Data Lake Storage.

29
Multi-Selectmedium

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Select 2 answers
A.Partition data by date and hour to improve query performance
B.Implement Auto-Tune for Spark workloads in Azure Synapse Analytics
C.Process all data synchronously to ensure consistency
D.Use a single large cluster for all workloads to simplify management
E.Use a single node for orchestration to reduce complexity
AnswersA, B

Partitioning reduces data scanned and improves throughput.

Why this answer

Partitioning data by date and hour (Option A) is appropriate because it enables partition elimination, where queries only scan relevant partitions rather than the entire dataset. This directly reduces latency and improves throughput by minimizing I/O and compute resources needed for time-range queries, which is critical for meeting strict SLAs in data processing solutions.

Exam trap

The trap here is that candidates confuse synchronous processing with data consistency guarantees, overlooking that distributed systems can achieve consistency via idempotent writes or checkpointing without sacrificing latency and throughput.

30
Multi-Selecthard

You are designing a stream processing solution using Azure Stream Analytics. The job must reference a static lookup table (product catalog) stored in Azure Blob Storage. The catalog is updated once daily. The job should automatically pick up the latest version without restarting. Which two configurations are required? (Choose two.)

Select 2 answers
A.Configure the reference input with a static blob path
B.Set the reference input's 'Path pattern' to include date and time placeholders
C.Enable 'Automatic refresh' and set the refresh rate to 1 day
D.Use Azure Event Grid to trigger job restart on blob update
E.Store the reference data in Azure SQL Database instead of Blob Storage
AnswersB, C

Why this answer

Option B is correct because Azure Stream Analytics reference data inputs support path pattern placeholders like {date} and {time} to dynamically resolve the latest blob file. This allows the job to automatically load a new version of the static lookup table when the blob is updated, without requiring a job restart. The path pattern must be structured to match the naming convention of the uploaded file, such as 'catalog/{date}/{time}/products.csv'.

Exam trap

The trap here is that candidates often think a static path or Event Grid restart is needed, but the question specifically tests the combination of dynamic path patterns and automatic refresh to achieve zero-downtime updates in Azure Stream Analytics.

Why the other options are wrong

A

Static path does not trigger auto-refresh; you need a pattern to detect new blobs.

D

Restarting the job is not required; auto-refresh avoids restart.

E

While SQL Database is an option, the question specifies Blob Storage; auto-refresh works with Blob Storage.

31
Matchingmedium

Match each Azure data integration tool to its typical use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Query external data in Azure Storage using T-SQL

High-throughput data ingestion into Synapse SQL

Orchestrate data movement and transformation

Complex data engineering with notebooks

Why these pairings

These tools serve different data integration needs.

32
MCQhard

You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?

A.Use 'Passthrough' partitioning
B.Use 'PartitionBy' with the GROUP BY column
C.Increase the number of SUs to handle skew
D.Use 'INTO' with a 'PARTITION BY' clause
AnswerB

Why this answer

Option B is correct because when performing a GROUP BY on a column that is not the partition key, you must use the PARTITION BY clause in the query to ensure that all rows with the same grouping value are processed by the same Stream Analytics node. This guarantees consistency and correctness of the aggregation, as it avoids data being split across multiple nodes without proper alignment.

Exam trap

The trap here is that candidates often confuse 'Passthrough' partitioning with automatic handling of GROUP BY, or they think increasing SUs can fix data skew, but the core requirement is explicit repartitioning via PARTITION BY to align the data with the grouping key.

Why the other options are wrong

A

Passthrough keeps the original partition scheme, which may not align with the GROUP BY column.

C

Scaling SUs does not fix partitioning alignment issues.

D

INTO is for output, not for repartitioning within the query.

33
Multi-Selectmedium

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

Select 2 answers
A.Out of order tolerance window: 5 minutes; Late arrival tolerance window: 1 hour
B.Out of order tolerance window: 1 hour; Late arrival tolerance window: 5 minutes
C.Watermark delay: 1 hour; Out of order tolerance: 5 minutes
D.Use Event Hubs capture to handle late events; no additional configuration needed
AnswersA, C

Why this answer

Azure Stream Analytics uses two distinct temporal policies to handle event timing issues. The 'Late arrival tolerance window' defines how long the system waits for events that arrive after their timestamp (up to 1 hour in this scenario), while the 'Out of order tolerance window' specifies the maximum time difference allowed for events that arrive out of sequence (up to 5 minutes). Option A correctly configures both policies to match the requirements.

Exam trap

The trap here is confusing the two tolerance windows (late arrival vs. out-of-order) or mistaking Spark-specific terminology like 'watermark delay' for Azure Stream Analytics policies.

Why the other options are wrong

B

This swaps the policies; late arrival should be larger than out-of-order.

D

Event Hubs capture is for storing raw events, not for handling out-of-order or late arrival in Stream Analytics.

34
MCQmedium

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

A.Use Azure Data Factory to copy the JSON data into Azure SQL Database, then use T-SQL to transform.
B.Use Azure Data Factory with SSIS to transform and load into dedicated SQL pool.
C.Load data into a Spark DataFrame in Synapse notebooks, transform, and write back.
D.Create external tables on the JSON files using PolyBase, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write transformed Parquet files.
AnswerD

Minimizes movement by querying in place.

Why this answer

Option D is correct because it uses PolyBase external tables and CETAS to transform JSON data directly in Azure Data Lake Storage Gen2, minimizing data movement by leveraging the compute power of the dedicated SQL pool or serverless SQL pool closest to the data. This approach reads JSON in place, transforms it into Parquet format, and writes the star schema tables back to the data lake without copying data to an intermediate store.

Exam trap

The trap here is that candidates often assume Spark notebooks (Option C) are always the best for JSON transformation, but PolyBase with CETAS is more efficient for minimizing data movement because it processes data in-place using SQL compute without loading entire datasets into memory.

How to eliminate wrong answers

Option A is wrong because it copies JSON data into Azure SQL Database first, incurring unnecessary data movement and network transfer, and uses T-SQL in a separate compute environment rather than leveraging compute closest to the data lake. Option B is wrong because it uses SSIS, which requires an Azure-SSIS Integration Runtime and moves data through a separate orchestration layer, adding latency and cost without utilizing Synapse-native processing. Option C is wrong because while Spark DataFrames in Synapse notebooks can process JSON, they require spinning up a Spark pool and loading data into memory, which involves more data movement and overhead compared to the serverless or dedicated SQL pool PolyBase approach that processes data directly in the storage layer.

35
MCQeasy

You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?

A.Azure Integration Runtime
B.Self-hosted Integration Runtime
C.Managed Virtual Network Integration Runtime
D.Azure Data Factory Gateway
AnswerB

Why this answer

A self-hosted integration runtime (IR) is required to connect Azure Data Factory to on-premises SQL Server because it provides the compute environment for data movement between on-premises networks and Azure. It must be installed on a machine inside the corporate firewall, enabling secure communication via outbound HTTPS (port 443) to Azure. This is the only IR type that can access private, on-premises data sources directly.

Exam trap

The trap here is that candidates often confuse the Self-hosted Integration Runtime with the Azure Integration Runtime, not realizing that only the self-hosted variant can bridge on-premises and cloud networks, while the Azure IR is restricted to cloud-to-cloud scenarios.

Why the other options are wrong

A

Azure IR runs in the cloud and cannot access on-premises networks directly.

C

Managed VNet IR is for secure access to Azure resources, not on-premises.

D

While historically called Gateway, the correct term is Self-hosted Integration Runtime.

36
MCQmedium

A data engineer is designing a batch processing pipeline that reads data from Azure Blob Storage, transforms it using Azure Databricks, and writes the output to Azure Synapse Analytics. The source files are in CSV format and arrive daily at 02:00 UTC. The transformation must be idempotent and the pipeline should handle late-arriving data (up to 2 hours). What is the best approach to trigger the pipeline?

A.Storage event trigger using Azure Event Grid
B.Schedule trigger set to 02:00 UTC daily
C.Tumbling window trigger with window size of 1 day and a late arrival window of 2 hours
D.Event trigger on blob creation in the container
AnswerC

Ensures idempotency and handles late data by allowing up to 2 hours delay.

Why this answer

Option C is correct because a tumbling window trigger in Azure Data Factory allows you to define a fixed-size window (1 day) and a late arrival window (2 hours), which ensures idempotent processing by automatically rerunning the window for late-arriving data within the specified delay. This matches the requirement for daily batch processing at 02:00 UTC while handling data arriving up to 2 hours late.

Exam trap

Microsoft often tests the distinction between schedule triggers (fixed time) and tumbling window triggers (window-based with late arrival handling), where candidates mistakenly choose a simple schedule trigger because they overlook the late-arriving data requirement.

How to eliminate wrong answers

Option A is wrong because a Storage event trigger using Azure Event Grid fires on every blob creation event, which would cause duplicate processing for late-arriving data and does not guarantee idempotency without custom deduplication logic. Option B is wrong because a Schedule trigger set to 02:00 UTC daily cannot handle late-arriving data; it runs only at the scheduled time and misses files that arrive after the trigger execution. Option D is wrong because an Event trigger on blob creation in the container is event-driven and will process each blob individually, leading to non-idempotent behavior and potential out-of-order processing for late-arriving files.

37
Multi-Selectmedium

You are designing a data transformation pipeline using Azure Databricks. The pipeline reads from Azure Data Lake Storage Gen2, performs aggregations, and writes to a Synapse dedicated SQL pool. Which three configurations should you implement to optimize performance and minimize cost? (Choose three.)

Select 3 answers
A.Enable Delta Lake on the storage account
B.Use Auto Optimize and Optimized Writes on Delta tables
C.Enable Photon engine for Spark SQL operations
D.Use a single-node cluster to reduce cost
E.Disable autoscaling to avoid cost variability
F.Use default Spark shuffle partitions (200)
AnswersA, B, C

Why this answer

Option A is correct because enabling Delta Lake on the storage account allows you to use Delta tables, which provide ACID transactions, scalable metadata handling, and unified batch/streaming capabilities. This is essential for reliable and performant data transformations in Azure Databricks, especially when reading from ADLS Gen2 and writing to Synapse.

Exam trap

The trap here is that candidates often assume cost savings come from reducing cluster size (single-node) or disabling autoscaling, but in practice these choices hurt performance and can increase total cost due to longer runtimes and resource contention.

Why the other options are wrong

D

Single-node cluster cannot handle large data volumes efficiently.

E

Autoscaling helps minimize cost by scaling down during idle periods.

F

Default may not be optimal; tuning shuffle partitions is recommended.

38
MCQmedium

Match the Azure service to its primary data processing use case. Drag each service on the left to the correct use case on the right. Services: Azure Databricks, Azure Stream Analytics, Azure Data Factory, Azure Synapse Analytics Use Cases: - Real-time event processing - Orchestration of ETL pipelines - Big data analytics with Spark - Enterprise data warehousing

A.Azure Databricks - Big data analytics with Spark
B.Azure Stream Analytics - Real-time event processing
C.Azure Data Factory - Orchestration of ETL pipelines
D.Azure Synapse Analytics - Enterprise data warehousing

Why this answer

Azure Databricks is used for big data analytics with Spark. Azure Stream Analytics is for real-time event processing. Azure Data Factory is for orchestrating ETL pipelines.

Azure Synapse Analytics is for enterprise data warehousing.

Exam trap

Candidates might confuse Azure Databricks with Azure Synapse Analytics because both can run Spark. However, Databricks is primarily a collaborative Spark environment, while Synapse is a data warehouse with integrated analytics.

Why the other options are wrong

A

Correct pair

B

Correct pair

C

Correct pair

D

Correct pair

39
MCQeasy

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

A.Use a TRUNCATE statement before each insert.
B.Use a MERGE statement with a unique key to upsert data.
C.Use a staging table and then swap partitions with the target table.
D.Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.
AnswerC

Atomic swap ensures idempotency.

Why this answer

Option C is correct because using a staging table with partition swapping ensures idempotent writes by atomically replacing the target partition with a fully loaded staging partition. This avoids duplicates even if the job restarts, as the swap operation is transactional and the staging table can be truncated before each run. In Azure Synapse dedicated SQL pool, partition switching is a metadata-only operation that provides consistency without data movement.

Exam trap

The trap here is that candidates often choose MERGE (Option B) thinking it inherently provides idempotency, but they overlook that MERGE in Synapse dedicated SQL pool is not atomic across retries and can still cause duplicates if the job fails after partial execution, whereas partition switching provides true atomic replacement.

How to eliminate wrong answers

Option A is wrong because TRUNCATE before each insert would remove all existing data in the table, which is destructive and not suitable for incremental or partial loads; it also does not handle concurrent access or partial failures gracefully. Option B is wrong because MERGE with a unique key can still produce duplicates if the job restarts mid-operation (e.g., after inserts but before updates), and MERGE in Synapse dedicated SQL pool is not fully atomic for large-scale upserts due to potential deadlocks and transaction log overhead. Option D is wrong because CREATE TABLE AS SELECT (CTAS) with a unique constraint does not prevent duplicates on restart—CTAS creates a new table each time, and the unique constraint only enforces uniqueness within that single execution, not across retries; additionally, CTAS does not provide a mechanism to swap or replace existing data atomically.

40
MCQmedium

Refer to the exhibit. A user with Storage Blob Data Reader role on the container rawdata cannot list files under /2023/07/. What is the most likely reason?

A.The directory ACL does not grant 'execute' permission to the user
B.The user does not have Storage Blob Data Contributor role
C.The user is not the owner of the directory
D.The container name is misspelled
AnswerA

To list directory contents, the user needs execute permission on the directory. 'other' has no permissions, so the user (not being owner or group) is denied.

Why this answer

The ACL shows that 'other' has no permissions (---). The user does not have explicit ACL entries, so they fall under 'other'. Without execute permission (--x) on the directory, they cannot traverse it.

Although they have Reader role at container level, POSIX ACLs on the directory restrict access.

41
Drag & Dropmedium

Drag and drop the steps to implement Slowly Changing Dimension (SCD) Type 2 in Azure Synapse Analytics dedicated SQL pool into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

SCD Type 2: stage data, merge to find changes, expire old records, insert new versions, and update attributes (if needed).

42
MCQeasy

A company ingests streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time to detect anomalies and stored in Azure Data Lake Storage Gen2 for historical analysis. The solution must minimize latency and avoid duplicate processing. Which Azure service should be used for processing?

A.Azure Data Factory
B.Azure Databricks with Structured Streaming
C.Azure Functions with Event Hubs trigger
D.Azure Stream Analytics
AnswerD

Azure Stream Analytics provides low-latency stream processing with exactly-once semantics and integrates with Event Hubs and Data Lake Storage.

Why this answer

Azure Stream Analytics is the correct choice because it is purpose-built for near real-time stream processing with sub-second latency, directly integrates with Event Hubs as input and Data Lake Storage Gen2 as output, and provides built-in exactly-once delivery semantics to avoid duplicate processing. It also supports temporal windowing and anomaly detection functions natively, making it ideal for this IoT anomaly detection scenario.

Exam trap

The trap here is that candidates often choose Azure Databricks with Structured Streaming because of its flexibility and popularity, but they overlook the specific requirement for minimal latency and built-in exactly-once processing, which Azure Stream Analytics handles more efficiently without the overhead of a Spark cluster.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory is a batch-oriented ETL orchestration service, not designed for near real-time streaming or sub-second latency processing. Option B is wrong because Azure Databricks with Structured Streaming introduces higher latency due to Spark job initialization and micro-batch processing, and requires additional configuration for exactly-once semantics, making it less optimal for minimal latency and duplicate avoidance. Option C is wrong because Azure Functions with Event Hubs trigger processes events one at a time or in small batches, lacks native windowing and anomaly detection operators, and can lead to duplicate processing if not carefully managed with checkpointing and idempotent logic.

Ready to test yourself?

Try a timed practice session using only Data Processing questions.