Knowledge + Practice

CCNA Develop data processing Questions

72 of 297 questions · Page 4/4 · Develop data processing · Answers revealed

Practice these questions Domain overview All questions

226

MCQmedium

Your company uses Azure Data Lake Storage Gen2 as a data lake. You need to process CSV files that arrive in a 'raw' container, transform them into Parquet format, and write them to a 'curated' container. The transformation includes filtering out rows with null values in the 'customer_id' column and adding a partition column 'year' based on the 'order_date'. You use Azure Synapse Pipelines. Which activity should you use for the transformation?

A.Stored procedure activity

B.Notebook activity with PySpark

C.Copy data activity

D.Data flow activity

AnswerD

Data flows provide visual transformation with built-in mapping.

Why this answer

Option C is correct because a data flow activity in Azure Synapse Pipelines can perform transformations like filtering and adding computed columns, and can write to ADLS Gen2 in Parquet format. Option A is wrong because Copy activity only copies data without transformation. Option B is wrong because Notebook activity requires Spark code; data flow is simpler for this scenario.

Option D is wrong because Stored Procedure activity runs SQL, not file transformations.

Practice this question →

227

Multi-Selectmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored as Parquet files partitioned by date. Which TWO of the following statements are true regarding querying this data? (Select TWO.)

Select 2 answers

A.You can use the filepath() function in the query to retrieve the partition column values.

B.Partition elimination is automatically applied when filtering on the partition column in the WHERE clause.

C.You must create an external table to query Parquet files; OPENROWSET is not supported.

D.You can create indexes on the serverless SQL pool to improve query performance.

E.You can only query a single file at a time; wildcards are not supported.

AnswersA, B

The filepath function returns the partition path values.

Why this answer

Option A is correct because the `filepath()` function in a serverless SQL pool query returns the file path of the row being read. When data is partitioned by date in Azure Data Lake Storage Gen2, the partition column values are embedded in the folder structure (e.g., `/year=2023/month=01/day=15/`). Using `filepath(1)`, `filepath(2)`, etc., you can extract these values directly in the query without needing to parse the path manually.

Exam trap

The trap here is that candidates often assume serverless SQL pools behave like dedicated SQL pools, leading them to think indexes are needed or that external tables are mandatory, when in fact serverless pools are schema-on-read and rely on file metadata and statistics for performance.

Practice this question →

228

MCQmedium

Refer to the exhibit. You deploy this ARM template to create a managed integration runtime in Azure Synapse Analytics. You notice that the integration runtime shows as 'Running' but copy activities using it are slow. The data volume is 500 GB per run. What is the most likely cause of the poor performance?

A.The integration runtime type should be 'Self-hosted' for better performance.

B.The node size (Standard_D3_v2) is too small for the data volume.

C.The number of nodes (2) is insufficient for 500 GB of data.

D.The location is set to 'AutoResolve' which may cause data movement across regions.

AnswerB

Standard_D3_v2 has limited resources for large data processing.

Why this answer

Option B is correct because the node size Standard_D3_v2 has limited memory and CPU for large data volumes. A larger node size (e.g., Standard_D8_v3) would improve performance. Option A is wrong because 2 nodes is reasonable; increasing nodes could help but the node size is more critical.

Option C is wrong because the location 'AutoResolve' is fine. Option D is wrong because the integration runtime type 'Managed' is appropriate for Azure-based copy.

Practice this question →

229

MCQmedium

You are reviewing a Spark job definition in Azure Synapse Analytics. The job aggregates sales data. The job runs successfully but takes longer than expected. You notice that dynamic allocation is disabled and the executor instances are fixed at 10. The cluster has a maximum of 20 nodes. What is the most likely reason for the slow performance?

A.The file path is incorrect, causing data read errors.

B.The job cannot scale out beyond 10 executors because dynamic allocation is disabled.

C.The job is not parallelized because of a single partition.

D.The executor memory is too low for the aggregation.

AnswerB

With dynamic allocation off, the job is limited to 10 executors.

Why this answer

With dynamic allocation disabled and executor instances fixed at 10, the Spark job cannot utilize additional cluster resources even though the cluster supports up to 20 nodes. This means the job is artificially constrained to 10 executors, limiting parallelism and causing slower performance despite available compute capacity.

Exam trap

The trap here is that candidates may overlook the explicit configuration detail (dynamic allocation disabled, fixed 10 executors) and instead focus on generic performance issues like memory or partitioning, missing the direct scaling limitation.

How to eliminate wrong answers

Option A is wrong because an incorrect file path would cause job failures or data read errors, not simply slower performance; the job runs successfully. Option B is wrong because it is actually the correct answer. Option C is wrong because a single partition would cause extreme underutilization and likely very slow processing, but the question states the job aggregates sales data and runs successfully, implying some parallelism exists; the fixed executor count is the more direct bottleneck.

Option D is wrong because while low executor memory can cause spilling to disk and slowdowns, the question specifically highlights disabled dynamic allocation and fixed executors as the observed configuration, making insufficient scaling the primary issue.

Practice this question →

230

MCQmedium

You are monitoring an Azure Data Factory pipeline that runs every hour. The pipeline uses a Copy activity to copy data from Azure SQL Database to Azure Blob Storage. Recently, the pipeline has been failing with a 'Timeout' error. The source SQL database has a large number of records. What should you do to resolve the timeout?

A.Enable staging and use PolyBase or COPY statement for the copy activity.

B.Decrease the 'writeBatchSize' to 1000.

C.Increase the 'timeout' value in the copy activity settings.

D.Use a query with 'queryTimeout' set to 7200 seconds.

AnswerA

Staging with PolyBase/COPY allows data to be copied in parallel and avoids timeouts.

Why this answer

Option A is correct because enabling staging with PolyBase or the COPY statement offloads the data transfer to Azure Data Lake or Blob Storage, bypassing the bottleneck of the Copy activity's default data movement. This approach is specifically designed for large-scale data loads from Azure SQL Database, as it uses the database's bulk export capabilities and avoids the timeout by not relying on the Copy activity's internal query execution.

Exam trap

The trap here is that candidates often assume increasing timeouts (options C or D) will fix the issue, but the real problem is the Copy activity's default command timeout limitation, which requires a fundamentally different data movement approach like staging with PolyBase or COPY statement.

How to eliminate wrong answers

Option B is wrong because decreasing 'writeBatchSize' to 1000 reduces the number of rows written per batch to the sink, which does not address the source-side timeout; the timeout occurs during the data read from Azure SQL Database, not during the write to Blob Storage. Option C is wrong because increasing the 'timeout' value in the copy activity settings only extends the overall activity duration but does not resolve the underlying issue of the source query exceeding the default command timeout (typically 30 seconds) when reading a large number of records. Option D is wrong because setting 'queryTimeout' to 7200 seconds in a query only applies to the query execution on the source database, but the Copy activity's default command timeout for the source dataset is separate and still limited; moreover, the real solution is to use a bulk export mechanism like PolyBase or COPY statement, not just extending the query timeout.

Practice this question →

231

MCQhard

Refer to the exhibit. You are running the KQL query in Azure Data Explorer. The query returns no results, but you know there is data in the table T. What is the most likely issue?

A.The bin function is used incorrectly; should be bin(Timestamp, 1h) but placed in the wrong clause

B.The between operator syntax is incorrect; should be 'between (startTime..endTime)' without spaces around the dots

C.The datetime format is incorrect; use ISO 8601

D.The table T does not exist in the database

AnswerB

Spaces around the '..' can cause the range to be misparsed.

Why this answer

Option B is correct because the KQL `between` operator requires the range syntax `between(datetime1..datetime2)` with no spaces around the two dots. The query uses `between (startTime .. endTime)` with spaces, which is invalid syntax and causes the query to return no results even though data exists in table T.

Exam trap

Microsoft often tests the exact syntax of KQL operators like `between`, where candidates overlook the requirement for no spaces around the two dots, assuming whitespace is allowed as in other languages.

How to eliminate wrong answers

Option A is wrong because the `bin` function is used correctly in the `where` clause to round timestamps; the issue is not about `bin` placement. Option C is wrong because the datetime format in the query uses a valid format (e.g., '2023-01-01 00:00:00') and KQL accepts various datetime formats, including the one shown. Option D is wrong because the question explicitly states that table T exists and contains data, so the table not existing is not the issue.

Practice this question →

232

Multi-Selectmedium

Which TWO options are correct approaches to handle schema drift in Azure Data Factory Mapping Data Flows?

Select 2 answers

A.Use a conditional split to route rows with different schemas to separate sinks.

B.Define a rigid schema in the source dataset and reject rows that don't match.

C.Disable schema drift to improve performance.

D.Enable 'Allow schema drift' in the source transformation.

E.Use a derived column transformation to provide default values for missing columns.

AnswersD, E

This allows the data flow to handle changing columns.

Why this answer

Option D is correct because enabling 'Allow schema drift' in the source transformation is the primary mechanism in Mapping Data Flows to handle incoming columns that are not defined in the dataset schema. This setting allows the data flow to dynamically adapt to changes in the source data structure, such as new or missing columns, without requiring manual schema updates.

Exam trap

The trap here is that candidates often confuse handling schema drift with data routing or error handling, and they overlook that enabling schema drift is the foundational step that must be taken before any other transformations can work with the drifted columns.

Practice this question →

233

MCQmedium

You have an Azure Data Factory pipeline that executes a stored procedure in Azure SQL Database. The pipeline fails with an error indicating that the stored procedure ran out of memory. What change should you make to the pipeline to resolve this?

A.Add a retry policy to the stored procedure activity.

B.Increase the pipeline activity timeout.

C.Use a Self-Hosted Integration Runtime instead of Azure IR.

D.Scale up the Azure SQL Database to a higher service tier.

AnswerD

Higher service tiers provide more memory for the database.

Why this answer

The error indicates that the stored procedure ran out of memory, which is a resource limitation at the database level, not a transient failure or timeout issue. Scaling up the Azure SQL Database to a higher service tier (e.g., from Standard to Premium or increasing DTU/vCore count) provides more memory and compute resources, directly resolving the out-of-memory condition.

Exam trap

The trap here is that candidates confuse pipeline-level retries or timeouts with database-level resource constraints, assuming that retrying or waiting longer will fix a memory exhaustion error, which is a hard resource limit that requires scaling the database.

How to eliminate wrong answers

Option A is wrong because a retry policy only re-executes the activity on transient failures (e.g., network blips), but an out-of-memory error is a persistent resource constraint that will recur on retry. Option B is wrong because increasing the pipeline activity timeout extends the duration the pipeline waits for completion, but does not address the underlying memory shortage in the database. Option C is wrong because using a Self-Hosted Integration Runtime shifts data movement or activity execution to an on-premises or VM-based runtime, but does not affect the memory allocation of the Azure SQL Database where the stored procedure runs.

Practice this question →

234

MCQhard

You are optimizing an Azure Synapse serverless SQL pool query that queries Parquet files in Azure Data Lake Storage. The query takes longer than expected. You notice that the query reads more data than necessary. What is the most effective way to reduce the amount of data scanned?

A.Split large Parquet files into smaller files of 100 MB each

B.Create external tables with explicit schema and partition by a frequently filtered column

C.Use SELECT with column pruning to only retrieve necessary columns

D.Increase the query's resource allocation by using a larger service level objective

AnswerB

External tables with partition elimination allow serverless SQL to skip entire partitions when filters are applied, reducing data scanned.

Why this answer

Option B is correct because partitioning external tables in Azure Synapse serverless SQL allows the query engine to perform partition elimination, reading only the subdirectories that match the filter criteria. This directly reduces the amount of data scanned from Parquet files in ADLS, addressing the core issue of reading unnecessary data.

Exam trap

The trap here is that candidates often confuse column pruning (reducing columns) with partition pruning (reducing rows), or assume that file size optimization alone reduces data volume, when in fact partition elimination is the key technique for minimizing scanned data in serverless SQL pools.

How to eliminate wrong answers

Option A is wrong because splitting large Parquet files into smaller files does not reduce the total data scanned; it may even increase overhead due to more file open operations. Option C is wrong because column pruning reduces the columns read, not the rows; the query still scans all partitions and files, so it does not address reading more data than necessary when filtering is involved. Option D is wrong because increasing the service level objective (SLO) allocates more resources but does not change the amount of data scanned; it only speeds up the scan, which is not the most effective way to reduce data volume.

Practice this question →

235

MCQeasy

You are developing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. Which authentication method should you use to ensure that the queries use the identity of the caller and adhere to Azure role-based access control (RBAC) permissions?

A.Microsoft Entra ID pass-through authentication.

B.Storage account key.

C.Shared access signature (SAS) token.

D.Service principal with a secret.

AnswerA

Microsoft Entra ID pass-through uses the caller's identity and enforces RBAC permissions.

Why this answer

Microsoft Entra ID pass-through authentication (option A) is correct because it allows the serverless SQL pool to use the caller's identity when accessing Azure Data Lake Storage Gen2. This ensures that Azure RBAC permissions (e.g., Storage Blob Data Reader) assigned to the user are evaluated for each query, providing fine-grained access control without exposing storage account keys or tokens.

Exam trap

The trap here is that candidates often confuse 'service principal' (a fixed identity) with 'user identity' and select option D, not realizing that pass-through authentication is the only method that preserves the caller's individual RBAC permissions.

How to eliminate wrong answers

Option B (Storage account key) is wrong because it uses a shared secret that grants full administrative access to the storage account, bypassing RBAC and the caller's identity entirely. Option C (Shared access signature token) is wrong because it delegates access based on a pre-signed URI with fixed permissions and expiry, not the caller's identity, and does not enforce RBAC. Option D (Service principal with a secret) is wrong because it authenticates as a fixed application identity rather than the individual caller, so RBAC permissions are evaluated against the service principal, not the user who submitted the query.

Practice this question →

236

MCQmedium

Your organization uses Azure Synapse Analytics dedicated SQL pool to store sales data. You need to design a data loading process for a nightly batch that inserts new rows and updates existing rows based on the business key. The table has a clustered columnstore index. Which approach minimizes table fragmentation?

A.Use UPDATE for existing rows and INSERT for new rows.

B.Use DELETE and INSERT statements in a single transaction.

C.Use a MERGE statement to perform upserts.

D.Create a staging table, load data, then use CTAS and partition switching to replace the target partition.

AnswerD

CTAS rebuilds the partition, minimizing fragmentation.

Why this answer

Option C is correct because using staging table with CREATE TABLE AS SELECT (CTAS) and then switching the partition swaps out the entire partition, avoiding individual row modifications that fragment columnstore. Option A is wrong because DELETE/INSERT on large batches causes fragmentation. Option B is wrong because MERGE on columnstore causes fragmentation.

Option D is wrong because UPDATE on columnstore causes fragmentation.

Practice this question →

237

MCQhard

You are troubleshooting a Synapse Pipeline that runs a Copy activity from an on-premises SQL Server to Azure Synapse Dedicated SQL Pool. The pipeline fails with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed.' The on-premises SQL Server has no firewall restrictions. What is the most likely cause?

A.Staging is not enabled for the Copy activity.

B.The destination table in Synapse has a different schema.

C.The SQL Server credentials in the linked service are incorrect.

D.The self-hosted integration runtime is not configured properly.

AnswerD

A self-hosted integration runtime is required to connect from Azure to on-premises networks; misconfiguration is a common cause of source-side failures.

Why this answer

The error indicates a failure on the source side. Since the on-premises SQL Server has no firewall restrictions, the most likely cause is that the self-hosted integration runtime is not installed or configured correctly, preventing connectivity from the cloud to the on-premises network. Option A is plausible but less likely given the error message.

Option C is not a typical cause for a source-side error. Option D is about staging, which is not mentioned.

Practice this question →

238

MCQeasy

You are processing streaming data from IoT devices using Azure Stream Analytics. The data includes temperature readings and device IDs. You need to calculate the average temperature per device over a 5-minute window, sliding every 1 minute. Which window function should you use?

A.Hop window

B.Session window

C.Sliding window

D.Tumbling window

AnswerA

Hop windows overlap and advance every hop interval.

Why this answer

A Hop window in Azure Stream Analytics allows you to specify a window size (5 minutes) and a hop size (1 minute), creating overlapping windows that slide forward every minute. This matches the requirement to calculate the average temperature per device over a 5-minute period, recalculated every minute, as the hop window outputs results at each hop interval while retaining data across overlapping windows.

Exam trap

The trap here is that candidates confuse 'sliding' with 'hopping' — a Sliding window in Stream Analytics is event-driven and does not produce periodic outputs, whereas a Hop window is time-driven and explicitly supports overlapping fixed-size windows with a hop interval.

How to eliminate wrong answers

Option B is wrong because a Session window groups events based on inactivity gaps (session timeout), not fixed time intervals, and would not produce consistent 5-minute windows sliding every 1 minute. Option C is wrong because a Sliding window in Stream Analytics outputs results only when an event occurs (e.g., for each new event), not at fixed time intervals, and does not support a predefined hop size. Option D is wrong because a Tumbling window is a series of fixed-size, non-overlapping contiguous time windows (e.g., every 5 minutes), which cannot produce overlapping windows that slide every 1 minute.

Practice this question →

239

MCQeasy

You are developing a real-time data processing solution using Azure Stream Analytics. The input is an Azure Event Hubs stream with JSON data containing a 'timestamp' field. You need to output the average temperature per device every minute using a tumbling window. Which query should you use?

A.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, SlidingWindow(minute, 1)

B.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, TumblingWindow(minute, 1)

C.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, SessionWindow(minute, 1, 1)

D.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, HopWindow(minute, 1, 1)

AnswerB

Tumbling window of 1 minute produces non-overlapping windows.

Why this answer

Option B is correct because a tumbling window is a fixed, non-overlapping time window that groups events into distinct time segments. Using `TumblingWindow(minute, 1)` with `TIMESTAMP BY Timestamp` ensures that the average temperature per device is computed over each one-minute interval without overlap, which matches the requirement of 'every minute'.

Exam trap

The trap here is that candidates confuse `SlidingWindow` or `HopWindow` with `TumblingWindow`, not realizing that only `TumblingWindow` produces non-overlapping, fixed-interval outputs required for a simple per-minute average.

How to eliminate wrong answers

Option A is wrong because `SlidingWindow` produces a continuous output for every event, not fixed intervals, and would not give a single average per minute. Option C is wrong because `SessionWindow` groups events based on inactivity gaps, not fixed time boundaries, and would not produce a consistent per-minute result. Option D is wrong because `HopWindow` creates overlapping windows with a hop size smaller than the window size, leading to multiple outputs per minute and not a single non-overlapping aggregation.

Practice this question →

240

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Lake. The data is ingested from multiple sources and needs to be deduplicated based on a composite key (source_id, record_id). New data may have duplicates within the same batch. Which write mode and table property should you use to handle this efficiently?

A.Use 'append' mode and perform a MERGE operation after write to deduplicate.

B.Use 'overwrite' mode and enable 'delta.autoOptimize.optimizeWrite' = true.

C.Use 'ignore' mode and set 'delta.autoCompact' = true.

D.Use 'error' mode and enable 'delta.merge.onSchemaMismatch' = true.

AnswerA

Append with a subsequent MERGE allows custom dedup logic on composite key.

Why this answer

Option A is correct because using 'append' mode writes all incoming data as new files, and then performing a MERGE operation (upsert) based on the composite key (source_id, record_id) allows you to efficiently deduplicate both within the batch and against existing data. This approach leverages Delta Lake's ACID transactions and avoids the cost of rewriting entire partitions, making it suitable for handling duplicates from multiple sources.

Exam trap

Microsoft often tests the misconception that 'overwrite' mode or table properties like 'autoOptimize' can handle deduplication, but the correct approach requires explicit deduplication logic (like MERGE) because Delta Lake does not enforce unique constraints natively.

How to eliminate wrong answers

Option B is wrong because 'overwrite' mode replaces the entire table or partition, which is inefficient for deduplication and would lose existing data not in the current batch; enabling 'delta.autoOptimize.optimizeWrite' only improves file layout, not deduplication logic. Option C is wrong because 'ignore' mode silently skips writes that would cause a duplicate based on the Delta table's schema or constraints, but Delta Lake does not enforce unique constraints natively, so duplicates would still be written; 'delta.autoCompact' only merges small files, not deduplicates. Option D is wrong because 'error' mode fails the write if any data conflicts (e.g., schema mismatch), which is not a deduplication strategy; 'delta.merge.onSchemaMismatch' is not a valid Delta Lake table property—the correct property for schema evolution is 'delta.autoMerge.enabled' or 'mergeSchema' in the DataFrame writer option.

Practice this question →

241

Multi-Selecthard

You are designing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, and the output is to Azure Synapse Analytics. The solution must guarantee exactly-once delivery to Synapse. Which THREE configurations are required? (Choose three.)

Select 3 answers

A.Configure the output to use batch mode for writing.

B.Define a watermark strategy in the query to handle late-arriving events.

C.Set the late arrival tolerance window to zero.

D.Use a job with a unique identifier column in the output to enable deduplication.

E.Ensure the output table in Synapse has a primary key to support upsert operations.

AnswersB, D, E

Ensures correct windowing.

Why this answer

Options A, C, and E are correct. Exactly-once semantics require a unique identifier for deduplication, a watermark to handle late events, and output to a table with a primary key for upsert. Option B is incorrect because late arrival tolerance is not required for exactly-once.

Option D is incorrect because batch mode is not supported for exactly-once output to Synapse.

Practice this question →

242

MCQmedium

Refer to the exhibit. You have created an external table in Azure Synapse serverless SQL pool as shown. You run a query: SELECT ProductID, SUM(Amount) FROM dbo.ExternalSales WHERE SaleDate > '2024-01-01' GROUP BY ProductID. The query is slow and scans all files in the /sales/ folder, which contains data from 2023 and 2024. The files are partitioned by year and month in the folder structure, e.g., /sales/year=2023/month=01/. What should you do to improve query performance?

A.Recreate the external table with a partition definition on SaleDate column using the folder structure

B.Recreate the external table with a partition on ProductID

C.Create statistics on the SaleDate column

D.Change the file format to CSV to improve read performance

AnswerA

By defining partitions using the folder structure, serverless SQL can skip partitions that don't match the filter.

Why this answer

Option A is correct because the query performance is slow due to full file scanning. By recreating the external table with a partition definition on the SaleDate column that maps to the folder structure (e.g., /sales/year=2023/month=01/), Azure Synapse serverless SQL pool can perform partition elimination, reading only the relevant partitions for the WHERE clause filter (SaleDate > '2024-01-01'). This drastically reduces the amount of data scanned, improving query speed.

Exam trap

The trap here is that candidates often confuse creating statistics (which helps cardinality estimation but not data skipping) with partition elimination (which physically reduces data scanned), or they assume any column partition will work without matching the folder structure.

How to eliminate wrong answers

Option B is wrong because partitioning on ProductID does not align with the folder structure (which is partitioned by year and month), so it would not enable partition elimination for the date filter; it would still scan all files. Option C is wrong because creating statistics on SaleDate helps the query optimizer estimate cardinality but does not reduce the amount of data scanned; the query would still read all files without partition pruning. Option D is wrong because CSV files are typically slower to read than Parquet due to lack of compression and columnar storage; changing to CSV would worsen performance, not improve it.

Practice this question →

243

MCQhard

Refer to the exhibit. You submit a Spark job in Azure Synapse Analytics using the Azure CLI. The job runs slowly during the shuffle phase. The input data is about 200 GB. Which configuration change would best improve performance for this shuffle-heavy workload?

A.Increase the number of executors to 4.

B.Change executor size to 'Large' to increase memory per executor.

C.Increase spark.sql.shuffle.partitions to 800.

D.Decrease spark.sql.shuffle.partitions to 200 to reduce overhead.

AnswerC

More partitions reduce the size of each partition, speeding up shuffle.

Why this answer

Option B is correct because for a 200 GB input, 400 shuffle partitions may be too few, causing large partitions and slow shuffles. Increasing partitions to a higher number (e.g., 800 or more) can improve parallelism. Option A is wrong because executor size 'Small' might be insufficient; increasing executor size could help but the question asks about configuration change, and partition count is more directly related to shuffle performance.

Option C is wrong because decreasing partitions would make partitions larger. Option D is wrong because executor count of 2 is low; increasing would help but the partition count is the key issue for shuffle.

Practice this question →

244

Multi-Selecteasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to support both batch and near-real-time data ingestion. Which TWO of the following methods can you use to ingest data into a dedicated SQL pool? (Select TWO.)

Select 2 answers

A.CREATE TABLE AS SELECT (CTAS) from external tables.

B.PolyBase with T-SQL commands.

C.COPY INTO command.

D.Azure Logic Apps with SQL connector.

E.Azure Data Factory with a copy activity using native sink.

AnswersA, B

CTAS can load data into a dedicated SQL pool from external tables.

Why this answer

CREATE TABLE AS SELECT (CTAS) from external tables is correct because it allows you to load data from external storage (e.g., Azure Blob Storage or Azure Data Lake Storage) into a dedicated SQL pool in a single, parallelized operation. This method leverages the MPP (Massively Parallel Processing) architecture of Synapse SQL pools to efficiently ingest large volumes of data for batch processing.

Exam trap

The trap here is that candidates often confuse the COPY INTO command (valid only for serverless SQL pools) with PolyBase or CTAS, or mistakenly think Azure Data Factory's native sink can directly write to a dedicated SQL pool without PolyBase or staging.

Practice this question →

245

MCQeasy

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query returns fewer rows than expected. What should you check first?

A.Ensure the external table has the correct schema definition.

B.Check that the Azure AD identity has read permissions on the storage account.

C.Check the compression codec used in the Parquet files.

D.Verify the file path and pattern in the OPENROWSET query.

AnswerD

Incorrect file path or pattern can cause missing files or partitions.

Why this answer

Option D is correct because when using OPENROWSET in Azure Synapse serverless SQL pool to query Parquet files, the most common reason for fewer rows than expected is an incorrect file path or pattern. If the path or pattern is too restrictive (e.g., missing a wildcard or pointing to a subfolder instead of the root), the query will only read a subset of the files, resulting in fewer rows. This is the first thing to verify before investigating schema or permissions issues.

Exam trap

The trap here is that candidates often jump to schema or permission issues first, but the most frequent cause of missing rows in serverless SQL pool queries is an overly restrictive file path or pattern in the OPENROWSET query.

How to eliminate wrong answers

Option A is wrong because an incorrect schema definition would typically cause data type conversion errors or NULL values, not a reduction in row count; the query would still read all rows but might fail to parse them. Option B is wrong because if the Azure AD identity lacked read permissions, the query would fail entirely with an authorization error, not return fewer rows. Option C is wrong because the compression codec (e.g., snappy, gzip) does not affect the number of rows returned; Parquet files are self-describing and the serverless SQL pool automatically handles decompression regardless of codec.

Practice this question →

246

MCQhard

You are running a data transformation pipeline in Azure Synapse Spark that writes output to Delta tables. You notice that the job eventually slows down and then fails with an out-of-memory error. The input data size is 1 TB, and the cluster has 10 nodes with 16 GB memory each. What is the most likely cause?

A.The driver node does not have enough memory to collect the results

B.The data is not partitioned properly, leading to large partitions that exceed executor memory

C.The Delta table is being written in non-optimized format causing memory pressure

D.The transformation involves a wide dependency causing excessive shuffle

AnswerB

Unpartitioned data can result in a few large partitions that cause OOM. Increasing parallelism or repartitioning can help.

Why this answer

The most likely cause is that the data is not partitioned properly, leading to large partitions that exceed executor memory. In Azure Synapse Spark, each executor has a limited memory (16 GB per node in this cluster), and if a single partition is too large to fit in memory, the task processing that partition will fail with an out-of-memory error. Proper partitioning ensures that data is evenly distributed across executors, preventing any single partition from overwhelming available memory.

Exam trap

The trap here is that candidates often confuse out-of-memory errors with driver-side collection (Option A) or shuffle-related issues (Option D), but the specific context of writing to Delta tables points to executor memory exhaustion from oversized partitions, not driver memory or shuffle overhead.

How to eliminate wrong answers

Option A is wrong because the driver node collects results only for actions like `collect()` or `show()`, but writing to Delta tables does not require collecting results to the driver; the failure is on executor tasks, not the driver. Option C is wrong because Delta tables are inherently optimized (using Parquet format with ACID transactions), and writing in non-optimized format is not a concept; memory pressure is caused by partition size, not the table format. Option D is wrong because while wide dependencies (e.g., groupBy, join) can cause excessive shuffle, the question specifically describes a slowdown and out-of-memory error during writing, which is more directly tied to partition size rather than shuffle overhead.

Practice this question →

247

Multi-Selectmedium

Which TWO of the following are valid ways to handle late-arriving data in a streaming solution with Azure Stream Analytics? (Choose two.)

Select 2 answers

A.Reprocess the entire stream from the beginning when late data is detected.

B.Implement a custom Azure Function as a 'LateDataHandler' in the query.

C.Use a reference data input to store late-arriving events.

D.Configure the 'late arrival tolerance' window in the event ordering settings up to 21 days.

E.Use a temporal join to combine the late-arriving event with the historical window.

AnswersD, E

Stream Analytics allows setting a late arrival tolerance window to handle events that arrive after the event time.

Why this answer

Option D is correct because Azure Stream Analytics allows you to configure a 'late arrival tolerance' window in the event ordering settings, which can be set up to a maximum of 21 days. This window defines how long the service will wait to accommodate events that arrive after their timestamp, reordering them within that tolerance before processing. Option E is correct because a temporal join (e.g., using LATERAL or JOIN with DATEDIFF) can combine a late-arriving event with historical data from a reference or stream window, enabling you to retroactively correct aggregations or state.

Exam trap

The trap here is that candidates confuse the 'late arrival tolerance' with a simple delay setting, not realizing it is a reordering buffer up to 21 days, and they overlook temporal joins as a valid pattern for handling late data, instead assuming only external functions or full reprocessing are options.

Practice this question →

248

MCQmedium

You have an Azure Stream Analytics job that reads from an Event Hub and writes to Azure SQL Database. The job processes high-velocity IoT sensor data. You notice that the output to SQL Database is slower than expected and the job's watermark delay is increasing. What should you do to improve throughput?

A.Partition the output by a column like DeviceId.

B.Disable late arrival and out-of-order event handling.

C.Increase the Streaming Units (SU) of the job.

D.Decrease the window size in the query.

AnswerA

Partitioning allows parallel writes to SQL.

Why this answer

Partitioning the output by a column like DeviceId allows Azure Stream Analytics to write to multiple SQL Database tables or use partitioned tables, enabling parallel writes. This reduces contention and improves throughput because the job can distribute the load across multiple write operations, directly addressing the bottleneck caused by high-velocity IoT sensor data overwhelming a single output stream.

Exam trap

The trap here is that candidates often assume increasing compute resources (Streaming Units) always solves performance issues, but they overlook that the bottleneck is frequently at the output sink, requiring architectural changes like partitioning rather than scaling.

How to eliminate wrong answers

Option B is wrong because disabling late arrival and out-of-order event handling does not improve output throughput; it only changes how events are timestamped and may cause data loss or inaccuracies without addressing the write bottleneck. Option C is wrong because increasing Streaming Units (SU) allocates more compute resources to the job, but if the bottleneck is at the SQL Database output (e.g., write limits or lack of partitioning), adding SUs will not improve throughput and may even increase backpressure. Option D is wrong because decreasing the window size in the query reduces the amount of data aggregated per window, but it does not affect the rate at which output rows are written to SQL Database; the bottleneck remains at the output sink.

Practice this question →

249

MCQeasy

You need to transform data in Azure Databricks using Apache Spark. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. Which method should you use to read the data into a Spark DataFrame?

A.spark.read.parquet('abfss://container@storage.dfs.core.windows.net/path')

B.spark.read.format('delta').load('abfss://container@storage.dfs.core.windows.net/path')

C.spark.read.csv('abfss://container@storage.dfs.core.windows.net/path')

D.spark.read.json('abfss://container@storage.dfs.core.windows.net/path')

AnswerB

Delta format correctly reads the table including transaction log.

Why this answer

Option B is correct because the data is stored in Delta Lake format, which requires using the 'delta' format reader in Spark to properly read the transaction log and schema. The `spark.read.format('delta').load()` method is the standard way to read Delta tables, leveraging the Delta Lake protocol for ACID transactions and time travel capabilities.

Exam trap

The trap here is that candidates may assume Delta Lake files are just Parquet files and use `spark.read.parquet()`, missing the critical role of the Delta transaction log for consistency and ACID compliance.

How to eliminate wrong answers

Option A is wrong because `spark.read.parquet()` reads only Parquet files and ignores Delta Lake's transaction log, leading to stale or inconsistent data. Option C is wrong because `spark.read.csv()` is for CSV files, not Delta Lake format. Option D is wrong because `spark.read.json()` is for JSON files, not Delta Lake format.

Practice this question →

250

Multi-Selectmedium

Which TWO actions can you take to optimize query performance in Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.Use hash distribution on a low-cardinality column

B.Use round-robin distribution for fact tables

C.Use replicated tables for small dimension tables

D.Create materialized views for common aggregations

E.Increase the DWU setting after every query

AnswersC, D

Replicated tables eliminate data shuffling for joins with fact tables.

Why this answer

Correct: A and D. Using replicated tables avoids data movement for small dimension tables, and materialized views pre-compute aggregations. B is wrong because hash distribution on a high-cardinality column is generally good.

C is wrong because increasing DWU improves performance but is not a design optimization; it's a scaling action. E is wrong because round-robin distribution is not optimal for star schemas.

Practice this question →

251

MCQeasy

You are using Azure Data Factory to copy data from an Azure SQL Database to Azure Synapse dedicated SQL pool. The copy activity uses PolyBase as the copy method. The activity fails with the error 'Operation not supported: PolyBase cannot write to a table with clustered columnstore index'. What should you do to resolve this error?

A.Use an external table as the sink instead of a regular table

B.Create the target table as a heap or with clustered index

C.Enable staging with blob storage and use 'Allow PolyBase'

D.Change the copy method from PolyBase to Bulk Insert

AnswerB

PolyBase does not support writing to a clustered columnstore index directly. The table must be a heap or have a clustered index.

Why this answer

PolyBase in Azure Data Factory cannot write directly to a table that has a clustered columnstore index (CCI). The sink table must be a heap or have a clustered index for PolyBase to work. Option B correctly identifies this requirement, as creating the target table as a heap or with a clustered index resolves the error.

Exam trap

The trap here is that candidates often assume staging with blob storage (Option C) or switching to Bulk Insert (Option D) are the only workarounds, but the question specifically tests the PolyBase requirement that the sink table must not have a clustered columnstore index.

How to eliminate wrong answers

Option A is wrong because using an external table as the sink would require additional setup and is not a direct fix for the PolyBase CCI limitation; PolyBase can write to external tables, but the error specifically occurs when writing to a regular table with CCI. Option C is wrong because enabling staging with blob storage and 'Allow PolyBase' is used for staging-based PolyBase loads, but it does not bypass the requirement that the final sink table must not have a CCI. Option D is wrong because changing the copy method from PolyBase to Bulk Insert would work but is not the optimal or recommended fix; the question asks what should be done to resolve the error, and the correct approach is to adjust the table schema to support PolyBase, not to abandon PolyBase entirely.

Practice this question →

252

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Lake. You need to ensure ACID transactions and schema enforcement. Which feature should you enable?

A.Auto Loader

B.Delta Lake format

C.Photon engine

D.Unity Catalog

AnswerB

Delta Lake provides ACID transactions, schema enforcement, and time travel.

Why this answer

Delta Lake is the correct choice because it provides ACID transactions (atomicity, consistency, isolation, durability) and schema enforcement (schema-on-write) on top of cloud storage like Azure Data Lake Storage. These features are inherent to the Delta Lake format, which uses a transaction log to track changes and enforce data integrity, making it ideal for reliable data processing in Azure Databricks.

Exam trap

Microsoft often tests the distinction between features that provide data governance (Unity Catalog) versus features that provide data reliability at the storage layer (Delta Lake), leading candidates to confuse Unity Catalog's metadata management with Delta Lake's transactional guarantees.

How to eliminate wrong answers

Option A is wrong because Auto Loader is a feature for incrementally ingesting new files from cloud storage, not for providing ACID transactions or schema enforcement. Option C is wrong because the Photon engine is a high-performance vectorized query engine that accelerates query execution but does not manage ACID transactions or schema constraints. Option D is wrong because Unity Catalog is a centralized metadata and governance layer for managing data assets, permissions, and lineage, but it does not directly enforce ACID transactions or schema enforcement at the table level.

Practice this question →

253

MCQmedium

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The copy activity uses a preCopyScript to truncate the destination table before writing. During a recent run, the copy activity failed after the truncation, leaving the destination table empty. You need to prevent data loss in future failures. What should you modify?

A.Add a retry policy to the copy activity.

B.Enable fault tolerance in the copy activity source.

C.Use a staging table and then a stored procedure to swap tables.

D.Remove the preCopyScript and set writeBatchSize to 0.

AnswerC

Staging table ensures atomicity: copy to staging, then swap with destination.

Why this answer

Option C is correct because using a staging table prevents data loss: copy to staging, then swap. Option A is wrong because the fault tolerance setting only handles errors but does not roll back truncation. Option B is wrong because turning off truncation would not ensure atomicity.

Option D is wrong because retry would re-execute truncation and fail again.

Practice this question →

254

MCQhard

You are working with a Delta Lake table in Azure Databricks. The table is updated frequently with new data and occasionally with updates to existing rows. You need to optimize read performance for queries that filter on a specific date column. The table is partitioned by date. Which optimization technique should you apply?

A.Run OPTIMIZE on the table.

B.Run ZORDER BY on the date column.

C.Run ANALYZE STATISTICS on the table.

D.Run VACUUM to clean up old versions.

AnswerB

Z-ordering co-locates column data, improving data skipping for filters on that column.

Why this answer

Option B is correct because Z-ordering reduces the amount of data scanned for filter columns. Option A is wrong because OPTIMIZE only compacts small files, not reorder data for filtering. Option C is wrong because VACUUM removes old files, not optimizes read.

Option D is wrong because ANALYZE STATISTICS computes stats but does not physically reorganize data.

Practice this question →

255

Multi-Selectmedium

Which TWO components are required to set up a streaming data pipeline using Azure Synapse Analytics? (Select two.)

Select 2 answers

A.Azure Data Factory

B.Azure Event Hubs

C.Azure Analysis Services

D.Azure Blob Storage

E.Azure Synapse Pipelines (or Spark)

AnswersB, E

Event Hubs ingests streaming events.

Why this answer

Option A is correct because Event Hubs is a common ingestion source for streaming. Option C is correct because Azure Synapse Pipelines (or the underlying Spark engine) can process streaming data. Option B is wrong because Data Factory is for batch.

Option D is wrong because Analysis Services is for OLAP. Option E is wrong because Blob Storage is a destination, not a required component.

Practice this question →

256

Multi-Selecthard

You are optimizing an Azure Synapse Analytics pipeline that uses mapping data flows. The pipeline runs slowly when processing 100 GB of data. Which THREE settings should you adjust to improve performance?

Select 3 answers

A.Set 'partition option' to 'Round robin' with a higher number of partitions.

B.Set a tumbling window trigger to run the pipeline every 5 minutes.

C.Increase the 'Compute type' to 'Memory Optimized' and increase the number of cores.

D.Enable 'Data flow debug' to monitor execution details.

E.Use 'Optimize shuffle' in the data flow settings.

AnswersA, C, E

Round robin partitioning distributes data evenly across partitions, improving parallelism.

Why this answer

Options A, C, and D are correct. Partitioning improves parallelism. Data flow debug is for development only.

Optimize shuffle reduces data movement. Compute type and core count affect resource allocation. Option B (debug) is for testing.

Option E (tumbling window) is for streaming.

Practice this question →

257

MCQeasy

You are using Azure Synapse Analytics to process streaming data from Azure Event Hubs. The data must be written to a Delta Lake table in ADLS Gen2 with exactly-once semantics. Which processing engine should you use?

A.Azure Databricks with Structured Streaming

B.Azure Synapse serverless SQL pool

C.Azure Synapse Pipeline with Mapping Data Flow

D.Azure Stream Analytics

AnswerC

Mapping Data Flows in Synapse can achieve exactly-once semantics when writing to Delta Lake.

Why this answer

Option C is correct because Azure Synapse Pipeline with Mapping Data Flow supports Delta Lake as a sink and can be configured to use Spark-based execution for streaming data from Event Hubs, enabling exactly-once semantics through checkpointing and idempotent writes. Mapping Data Flow runs on Spark clusters within Synapse, providing the necessary transactional guarantees for Delta Lake tables.

Exam trap

The trap here is that candidates often assume Azure Databricks is the only option for Delta Lake with exactly-once semantics, overlooking that Azure Synapse Pipeline with Mapping Data Flow provides equivalent functionality within the Synapse ecosystem, which is the focus of the DP-203 exam.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming can write to Delta Lake with exactly-once semantics, but it is not a native Azure Synapse Analytics component; the question specifies using Azure Synapse Analytics, making Databricks an external service. Option B is wrong because Azure Synapse serverless SQL pool is designed for on-demand querying of data in data lakes, not for processing streaming data or writing to Delta Lake tables with exactly-once semantics. Option D is wrong because Azure Stream Analytics does not natively support Delta Lake as an output sink; it writes to Azure Blob Storage, ADLS Gen2, or Event Hubs in formats like Parquet or Avro, but lacks the transactional capabilities required for exactly-once semantics in Delta Lake.

Practice this question →

258

MCQmedium

Your organization has an Azure Data Factory pipeline that executes a series of activities to transform data. One of the activities is an Azure Databricks notebook that should run only if the previous activity succeeds. You need to configure the pipeline to handle failures gracefully and send an email alert if the Databricks activity fails. What should you do?

A.Add a failure output path from the Databricks activity to a Web activity that calls an email API.

B.Configure a retry policy and a timeout for the Databricks activity.

C.Use a Schedule trigger to run the pipeline and check for failures using Azure Monitor.

D.Set a dependency condition on the Databricks activity to 'Succeeded' and add a Send Email activity on the success path.

AnswerA

You can route failure output to a Web activity to send an email via Logic Apps or Azure Functions.

Why this answer

Option C is correct because you can add a failure path to the Databricks activity and attach a Web activity to send email. Option A is wrong because dependency conditions are on success, not failure. Option B is wrong because a failure policy on the activity does not send alerts.

Option D is wrong because a Schedule trigger does not handle failures.

Practice this question →

259

Multi-Selecteasy

Which TWO features of Azure Databricks help manage data governance and security for sensitive data?

Select 2 answers

A.Structured Streaming

B.Secret Scopes

C.Auto Loader

D.Delta Live Tables

E.Unity Catalog

AnswersB, E

Secret Scopes securely store and manage access tokens and keys.

Why this answer

Secret Scopes (B) allow secure storage and referencing of sensitive credentials (e.g., API keys, database passwords) in Azure Databricks, preventing hardcoding in notebooks. Unity Catalog (E) provides fine-grained access control, data lineage, and centralized metadata management across workspaces, enabling governance of sensitive data through policies and auditing.

Exam trap

The trap here is that candidates confuse data processing features (Structured Streaming, Auto Loader, Delta Live Tables) with governance/security tools, because all are part of the Databricks ecosystem but serve fundamentally different purposes.

Practice this question →

260

MCQhard

Refer to the exhibit. You are an Azure data engineer responsible for ensuring that all storage accounts used in data pipelines enforce HTTPS traffic. You apply the Azure Policy definition shown above. Later, a data engineer creates a new storage account with 'Enable secure transfer' set to Disabled. What will happen when the policy is evaluated?

A.The storage account will be created, but the policy will be evaluated later during a compliance scan.

B.The storage account will be created with HTTPS enabled automatically.

C.The storage account creation will be denied and the request will fail.

D.The storage account will be created, but an audit event will be logged.

AnswerC

The policy denies the creation if the condition is met.

Why this answer

Option A is correct because the policy denies creation of storage accounts that do not have HTTPS enforced (supportsHttpsTrafficOnly = false). The policy is evaluated at creation time and will deny the request. Option B is wrong because the policy has deny effect, not audit.

Option C is wrong because the policy does not set the property automatically. Option D is wrong because the policy is evaluated during creation, not after.

Practice this question →

261

MCQhard

You have a production pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage using a self-hosted integration runtime. The pipeline fails intermittently with a 'Connection closed' error. The data volume is 50 GB per run. What should you first troubleshoot to resolve this issue?

A.Increase the memory and CPU resources on the self-hosted integration runtime machine and check network stability.

B.Increase the 'connection timeout' setting in the linked service to 30 minutes.

C.Change the copy activity to use staged copy with Azure Blob Storage as an intermediate store.

D.Disable fault tolerance in the copy activity to improve performance.

AnswerA

The self-hosted IR needs sufficient resources for large data transfers; 'Connection closed' often indicates resource exhaustion or network interruptions.

Why this answer

Option A is correct because the self-hosted IR's memory and network stability are common causes of 'Connection closed' errors with large data volumes. Option B (disable fault tolerance) would actually make the issue worse. Option C (change to staging copy) might help but is not the first step.

Option D (increase TTL) addresses only idle connections, not active data transfers.

Practice this question →

262

Multi-Selecthard

Which THREE factors should you consider when choosing between Azure Data Factory Mapping Data Flows and Azure Synapse Spark pools for data transformation?

Select 3 answers

A.Scheduling: Only Data Flows can be scheduled via triggers.

B.Ease of use: Mapping Data Flows provide a visual designer, while Spark requires code.

C.Data volume: Data Flows are limited to 100 GB, while Spark can handle petabytes.

D.Integration with other services: Data Flows can use integration runtimes, while Spark is limited to Synapse.

E.Debugging: Data Flows have a debug session limit of 8 hours, while Spark pools have no debug limit.

AnswersB, D, E

Data Flows are no-code, Spark requires coding.

Why this answer

Option B is correct because Azure Data Factory Mapping Data Flows offer a visual, no-code designer for building data transformations, which lowers the barrier for users who are not proficient in programming. In contrast, Azure Synapse Spark pools require writing code in languages like PySpark, Scala, or SQL, making them more suitable for developers comfortable with coding. This distinction directly addresses ease of use as a key factor in choosing between the two services.

Exam trap

The trap here is that candidates assume Mapping Data Flows have a hard data volume limit (like 100 GB) or that Spark pools cannot be scheduled, when in fact both services are highly scalable and can be orchestrated via triggers, and the key differentiator is the coding versus visual interface.

Practice this question →

263

MCQeasy

You are designing a data processing solution using Azure Databricks. The data is stored in Delta Lake format. You need to ensure that when you read the latest version of the table, you only see committed data and not uncommitted transactions. Which isolation level should you use?

A.WriteSerializable

B.Serializable

C.ReadUncommitted

D.SnapshotIsolation

AnswerD

Delta Lake uses Snapshot isolation to read the latest committed version.

Why this answer

Snapshot isolation is the correct choice because it provides a consistent view of the table by reading only the latest committed data, ignoring any uncommitted transactions. In Delta Lake, snapshot isolation ensures that readers see a snapshot of the table at a specific version, which includes only committed changes, making it ideal for read consistency without blocking concurrent writes.

Exam trap

The trap here is that candidates confuse write isolation levels (like WriteSerializable) with read isolation levels, or assume that Serializable is always the safest choice for consistency, when in fact Snapshot isolation is the specific Delta Lake mechanism for reading only committed data without blocking.

How to eliminate wrong answers

Option A (WriteSerializable) is wrong because it is a write isolation level that ensures serializable isolation for write operations, but it does not control read behavior to exclude uncommitted data. Option B (Serializable) is wrong because it is a general isolation level that prevents dirty reads, non-repeatable reads, and phantom reads, but it is not specifically designed to guarantee that only committed data is visible in Delta Lake; it also imposes more overhead than needed. Option C (ReadUncommitted) is wrong because it allows reading uncommitted data (dirty reads), which directly violates the requirement to see only committed data.

Practice this question →

264

MCQmedium

You are troubleshooting a failed Azure Synapse Pipeline execution. The pipeline uses a Copy activity to load data from an on-premises SQL Server to Azure Data Lake Storage Gen2. The error indicates a 'Connection timeout' to the on-premises source. The Integration Runtime is Self-Hosted and has been running successfully for months. What is the most likely cause?

A.The SQL Server authentication credentials have expired.

B.The Self-Hosted Integration Runtime is not installed.

C.The on-premises network configuration has changed.

D.The Azure Storage account firewall is blocking access.

AnswerC

Network changes can block connectivity to the SQL Server.

Why this answer

Option B is correct because a change in the on-premises network configuration (e.g., firewall rules, DNS) could block the Integration Runtime from reaching the SQL Server. Option A is wrong because the Integration Runtime was running, so installation is not the issue. Option C is wrong because the storage account is the destination, not the source.

Option D is wrong because the Integration Runtime was successful before, so credentials are likely valid.

Practice this question →

265

MCQhard

You have a Data Factory pipeline that runs a U-SQL script in Azure Data Lake Analytics. The script processes terabytes of data and outputs to a CSV file. The pipeline is failing with the error: 'The job failed with UserError: Script execution failed.' You need to troubleshoot the issue. Which approach should you take first?

A.Change the output format to Parquet to reduce file size.

B.Review the job logs in Azure Data Lake Analytics to identify the specific script error.

C.Migrate the U-SQL script to Azure Synapse Spark pool.

D.Increase the degree of parallelism for the U-SQL job.

AnswerB

Job logs provide detailed error messages that pinpoint the issue.

Why this answer

The most effective first step is to examine the detailed job logs in Data Lake Analytics, which contain the actual script error. Increasing parallelism or changing output format may not address the underlying script error. Moving to Azure Synapse is a larger architectural change.

Practice this question →

266

Multi-Selecteasy

You are developing a data pipeline in Azure Data Factory that ingests data from multiple on-premises SQL Server databases to Azure Data Lake Storage Gen2. The data volume is about 1 TB per day. You need to ensure the pipeline can handle the volume and provide monitoring and alerting. Which THREE components should you include?

Select 3 answers

A.Azure Synapse Analytics pipeline

B.Self-hosted integration runtime

C.Azure Monitor

D.Power BI

E.Data flow activity

AnswersB, C, E

Required to connect to on-premises SQL Server.

Why this answer

Correct answers: A, C, and E. Self-hosted integration runtime is required for on-premises connectivity. Azure Monitor provides monitoring and alerting.

Data flow can be used for data transformation at scale. Option B is wrong because Azure Synapse Analytics is not used in this pipeline. Option D is wrong because Power BI is for visualization, not data ingestion.

Practice this question →

267

MCQmedium

A data engineer is tasked with optimizing a Spark job in Azure Synapse Analytics that processes 10 TB of data daily. The job currently uses 50 executors with 4 cores each. The performance is bottlenecked by shuffle operations. The engineer wants to reduce shuffle data size. Which technique should be applied?

A.Coalesce the number of partitions before the shuffle.

B.Increase the number of shuffle partitions.

C.Broadcast all tables to avoid shuffles.

D.Use column pruning to select only required columns before shuffle.

AnswerD

Reduces data volume.

Why this answer

Option B is correct because column pruning reduces the amount of data shuffled by eliminating unnecessary columns. Option A is incorrect because increasing parallelism may increase shuffle overhead. Option C is incorrect because coalescing reduces partitions but does not reduce shuffle data.

Option D is incorrect because broadcasting is for small tables, not 10 TB.

Practice this question →

268

MCQhard

You are designing a data processing solution for a healthcare organization. The solution must process streaming data from IoT devices and store it in Azure Data Lake Storage Gen2. The data must be available for both real-time dashboards and historical analysis. You need to minimize operational overhead. What should you do?

A.Ingest data via Azure Functions and write to Data Lake Storage; use Power BI to query Data Lake

B.Use Azure Stream Analytics to output to both Power BI and Data Lake Storage

C.Use Azure Databricks with Structured Streaming to write to Data Lake Storage and use Power BI DirectQuery

D.Ingest data to Azure Event Hubs, then use Event Hubs Capture to store in Data Lake Storage; use Power BI with Event Hubs

AnswerB

Stream Analytics is serverless, supports real-time output to Power BI and batch writes to Data Lake.

Why this answer

Using Azure Stream Analytics with output to both Power BI (real-time) and Data Lake Storage (historical) is a common pattern with minimal overhead. Option A is wrong because Azure Functions would require custom code for batching. Option B is wrong because Event Hubs doesn't provide real-time dashboards.

Option C is wrong because Azure Databricks would require cluster management.

Practice this question →

269

MCQmedium

Refer to the exhibit. You run the KQL query in Azure Data Explorer. What is the output?

A.Name: Alice, Age: 30; Name: Charlie, Age: 35

B.Id: 1, Name: Alice, Age: 30; Id: 3, Name: Charlie, Age: 35

C.Name: Alice, Age: 30; Name: Bob, Age: 25; Name: Charlie, Age: 35

D.Name: Alice, Age: 30; Name: Bob, Age: 25; Name: Charlie, Age: 35

AnswerA

Only Alice and Charlie have Age >25.

Why this answer

The KQL query uses the `take` operator to return a specified number of rows from the table. Since the query does not include an `order by` clause, the rows returned are non-deterministic but will be the first rows encountered in the data shard. In this case, the query `take 2` returns two rows, which are Alice (Age 30) and Charlie (Age 35), as shown in the exhibit.

The `project` operator then selects only the Name and Age columns, so the output is exactly those two rows with those columns.

Exam trap

The trap here is that candidates often assume `take` returns the first N rows in the order they appear in the table (like a top-N query without ordering), but without an explicit `order by`, the rows are non-deterministic and depend on data shard layout, leading to confusion when the output does not match the expected sequence.

How to eliminate wrong answers

Option B is wrong because it includes the Id column, but the query uses `project Name, Age` which explicitly excludes the Id column. Option C is wrong because it shows three rows (Alice, Bob, Charlie), but the query uses `take 2` which limits the output to exactly two rows. Option D is wrong for the same reason as Option C — it shows three rows instead of two, and also includes Bob who is not in the result set.

Practice this question →

270

MCQmedium

Your organization uses Azure Synapse Analytics. You need to design a data transformation pipeline that processes streaming data from Azure Event Hubs, performs aggregations over a 5-minute tumbling window, and loads the results into a dedicated SQL pool table. Which Azure service should you use to implement the streaming transformation?

A.Azure Stream Analytics

B.Azure Data Factory

C.Apache Spark for Azure Synapse

D.Azure Functions

AnswerA

Azure Stream Analytics can ingest from Event Hubs, perform tumbling window aggregations, and output to Synapse SQL pool.

Why this answer

Option A is correct because Azure Stream Analytics is the appropriate service for real-time stream processing with windowed aggregations. Option B is wrong because Azure Data Factory is for batch orchestration. Option C is wrong because Spark Structured Streaming is for big data workloads but less integrated with SQL pools.

Option D is wrong because Azure Functions is not designed for streaming windowed aggregations.

Practice this question →

271

Multi-Selecteasy

A company uses Azure Data Lake Storage Gen2 as the data lake. The data engineering team needs to ensure that sensitive data such as credit card numbers are masked when queried by non-admin users. The solution must be implemented within the data lake without moving data to another store. Which TWO features should they use? (Choose two.)

Select 2 answers

A.Azure Policy to audit access

B.Azure SQL Database dynamic data masking

C.Azure Synapse Serverless SQL with dynamic data masking

D.Microsoft Purview data classification and labeling

E.Azure Storage blob-level access policies

AnswersC, D

Can mask data in queries over external tables.

Why this answer

Options B and C are correct. Azure Synapse Serverless SQL can create external tables over Data Lake storage and apply dynamic data masking. Azure Purview can classify sensitive columns for policy.

Option A is incorrect because ADLS does not have built-in masking. Option D is incorrect because Azure SQL Database is a different store. Option E is incorrect because Azure Policy does not mask data.

Practice this question →

272

MCQmedium

You are designing a data lakehouse architecture in Azure using Delta Lake. The solution needs to process batch and streaming data from multiple sources, including IoT devices and CRM systems. You need to ensure data quality by enforcing schema validation and handling schema evolution. You also need to provide a unified catalog for querying. Which service should you use?

A.Azure Purview

B.Azure Data Lake Storage Gen2

C.Azure Synapse Analytics serverless SQL pool

D.Azure Databricks Unity Catalog

AnswerD

Provides schema enforcement, evolution, and a unified catalog.

Why this answer

Option C is correct because Azure Databricks Unity Catalog provides a unified governance solution for data and AI, including schema enforcement and evolution for Delta Lake. Option A is wrong because Azure Purview is for data discovery and lineage, not for schema enforcement. Option B is wrong because Azure Synapse Analytics is a query engine but does not provide the same schema management features as Unity Catalog.

Option D is wrong because Azure Data Lake Storage is storage only.

Practice this question →

273

MCQeasy

You are designing a data processing solution in Azure Databricks to transform streaming data from Azure Event Hubs. The data must be aggregated in 1-minute tumbling windows and written to Azure Synapse Analytics. Which Spark API should you use?

A.RDD API

B.Structured Streaming

C.Spark Streaming (DStreams)

D.DataFrame API with batch processing

AnswerB

Supports windowed aggregations and streaming sinks.

Why this answer

Structured Streaming is the correct choice because it provides native support for event-time-based aggregations, such as 1-minute tumbling windows, and integrates seamlessly with Azure Event Hubs as a streaming source and Azure Synapse Analytics as a streaming sink using the `foreachBatch` or `writeStream` API. It offers exactly-once semantics and automatic state management for windowed operations, which are essential for reliable streaming ETL.

Exam trap

The trap here is that candidates confuse the older Spark Streaming (DStreams) API with Structured Streaming, assuming both are equally capable for event-time windows, but DStreams lack native event-time support and are deprecated in favor of Structured Streaming.

How to eliminate wrong answers

Option A is wrong because the RDD API operates at a low level without built-in support for event-time windowing, stateful aggregation, or streaming sinks like Azure Synapse Analytics, requiring manual implementation of checkpointing and fault tolerance. Option C is wrong because Spark Streaming (DStreams) uses micro-batch processing with a DStream API that is now in maintenance mode and lacks native event-time handling, making tumbling window aggregations more complex and less efficient compared to Structured Streaming. Option D is wrong because the DataFrame API with batch processing is designed for static data, not continuous streaming; it cannot process unbounded data from Event Hubs in real time or maintain state for tumbling windows without additional custom orchestration.

Practice this question →

274

Multi-Selectmedium

Which TWO are valid ways to process data in Azure Synapse Analytics?

Select 2 answers

A.Use Logic Apps to run data transformations.

B.Use Azure Functions to process data in a serverless manner.

C.Use Synapse SQL pool to run T-SQL queries.

D.Use Power BI to transform data.

E.Use Synapse Spark notebooks to run Scala code.

AnswersC, E

Synapse SQL pool provides distributed query processing.

Why this answer

Option C is correct because Synapse SQL pool (formerly SQL DW) is a dedicated or serverless SQL engine within Azure Synapse Analytics that allows you to run T-SQL queries for data transformation, loading, and querying. It is a first-class compute resource designed for large-scale data warehousing workloads, making T-SQL queries a valid and primary method for processing data in Synapse.

Exam trap

The trap here is that candidates confuse general Azure services (Logic Apps, Functions, Power BI) with native Synapse Analytics processing capabilities, forgetting that only Synapse SQL and Synapse Spark are first-class compute engines within the service.

Practice this question →

275

MCQmedium

You are developing a data processing pipeline in Azure Synapse Analytics. The pipeline uses a mapping data flow to transform data from Azure Data Lake Storage Gen2 to a dedicated SQL pool. The data flow includes a Derived Column transformation that uses the expression: `iif(isNull(Column1), 'Default', Column1)`. However, the transformation is not handling NULL values correctly. What is the most likely cause?

A.The Derived Column transformation does not support the iif function; use a Conditional Split instead.

B.The column data type is not string; convert Column1 to string first.

C.The function names in the expression are case-sensitive; use 'isNull' instead of 'isnull'.

D.The expression must use ternary operator syntax: `Column1 == null ? 'Default' : Column1`.

AnswerC

Mapping data flows are case-sensitive; the correct function name is 'isNull'.

Why this answer

Option C is correct because the `iif` function in Azure Synapse Analytics mapping data flows is case-sensitive. The expression uses `isNull` (with a capital 'N'), but the correct function name is `isNull` (with a capital 'N' and lowercase 'ull'? Actually, the correct function is `isNull` with a capital 'N' and lowercase 'ull'? Wait, the expression in the question uses `isNull(Column1)` which is correct; the issue is that the question says the expression uses `iif(isNull(Column1), 'Default', Column1)` but the answer option C says use 'isNull' instead of 'isnull'. The trap is that the function name is case-sensitive; the correct function is `isNull` (capital 'N'), not `isnull` (all lowercase).

The expression in the question already uses `isNull` with capital 'N', but the answer option C suggests using 'isNull' instead of 'isnull' — this implies the candidate might have typed `isnull` (all lowercase) which would fail. The most likely cause is that the function name was typed incorrectly with wrong casing, as mapping data flows are case-sensitive for function names.

Exam trap

The trap here is that candidates assume function names in Azure Synapse mapping data flows are case-insensitive like in SQL, but they are actually case-sensitive, causing a seemingly correct expression to fail due to a subtle casing error.

How to eliminate wrong answers

Option A is wrong because the Derived Column transformation fully supports the `iif` function for conditional logic; a Conditional Split is not required for simple null handling. Option B is wrong because the `iif` function can handle any data type, and converting to string is unnecessary; the issue is not about data type conversion. Option D is wrong because mapping data flows do not support ternary operator syntax (`? :`); they use the `iif` function for conditional expressions.

Practice this question →

276

MCQeasy

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2 using Azure Data Factory. Which approach is the most efficient?

A.Use a lookup activity to retrieve the last watermark value, then copy only new records with a filter.

B.Use a tumbling window trigger with a data flow that processes all data each time.

C.Use a mapping data flow with a full load and then use Azure Databricks to deduplicate.

D.Copy the entire table every time and use Azure Synapse serverless SQL to filter duplicates.

AnswerA

Watermark pattern is efficient and well-supported.

Why this answer

Option A is correct because it uses a lookup activity to retrieve the last watermark value (e.g., a timestamp or incrementing key), then copies only new or changed records via a filter in the Copy activity. This minimizes data movement and processing time, making it the most efficient incremental loading approach in Azure Data Factory.

Exam trap

Microsoft often tests the misconception that any trigger-based or full-load approach can be adapted for incremental loading, but the key is minimizing data movement; candidates may overlook the watermark pattern and choose a full-load option because they think deduplication later solves the problem.

How to eliminate wrong answers

Option B is wrong because a tumbling window trigger with a data flow that processes all data each time performs a full load on every run, ignoring incremental logic and wasting resources. Option C is wrong because performing a full load followed by deduplication in Azure Databricks is inefficient; it moves all data repeatedly and adds unnecessary compute overhead. Option D is wrong because copying the entire table every time and using Azure Synapse serverless SQL to filter duplicates still transfers all data, incurring high egress costs and defeating the purpose of incremental loading.

Practice this question →

277

MCQeasy

You are a data engineer at a retail company. You need to develop a data processing solution in Azure Synapse Analytics that reads sales transactions from Parquet files stored in Azure Data Lake Storage Gen2, transforms the data by aggregating daily sales per store, and writes the results to a dedicated SQL pool table for reporting. The transformation logic must be reusable and maintained in a source control system. You want to minimize administrative overhead and leverage serverless resources where possible. Which approach should you recommend?

A.Use Azure Data Factory with Mapping Data Flows to read Parquet files, perform aggregations, and write to the dedicated SQL pool.

B.Use a serverless SQL pool to query the Parquet files via OPENROWSET, then use CETAS to write the aggregated results to the dedicated SQL pool using an external table.

C.Create an Azure Synapse Spark notebook that reads Parquet files, performs aggregation using PySpark, and writes the results to the dedicated SQL pool using the Spark Synapse connector.

D.Use PolyBase in a dedicated SQL pool to create external tables over the Parquet files, then use INSERT...SELECT to load aggregated data into the target table.

AnswerB

Serverless SQL pool can read Parquet natively, and CETAS allows writing to dedicated SQL pool via external table.

Why this answer

Option C is correct because a serverless SQL pool can query Parquet files directly, and the CREATE EXTERNAL TABLE AS SELECT (CETAS) statement can transform and store results in a dedicated SQL pool table using PolyBase. This approach uses serverless resources for transformation and avoids managing Spark pools. Option A is wrong because dedicated SQL pool cannot directly read Parquet without PolyBase or external tables.

Option B is wrong because Azure Data Factory with Mapping Data Flows runs on Spark clusters, incurring more overhead. Option D is wrong because using a notebook in a Spark pool requires provisioning a Spark pool, which adds administrative overhead.

Practice this question →

278

MCQeasy

You need to process a large number of small files (each < 1 MB) from Azure Blob Storage in Azure Synapse Analytics. The processing is I/O-bound due to many small file operations. Which approach should you use to improve performance?

A.Use wildcard paths to read multiple files at once.

B.Enable optimized write on the Spark session.

C.Convert the files to a binary format like Avro before processing.

D.Use 'spark.sql.files.maxPartitionBytes' to coalesce small files into larger partitions.

AnswerD

This configuration merges small files into larger partitions, reducing overhead.

Why this answer

Option D is correct because `spark.sql.files.maxPartitionBytes` controls the maximum number of bytes packed into a single partition when reading files. By increasing this value, Spark coalesces many small files into fewer, larger partitions, reducing the overhead of task scheduling and I/O operations. This directly addresses the I/O-bound bottleneck caused by processing numerous small files in Azure Synapse Analytics.

Exam trap

The trap here is that candidates confuse file format conversion (Avro) or write optimization with read-side partition coalescing, failing to recognize that the core issue is the number of partitions created during file scanning, not the data format or write behavior.

How to eliminate wrong answers

Option A is wrong because wildcard paths only simplify file selection but do not reduce the number of partitions or I/O operations; each small file still becomes a separate partition by default. Option B is wrong because 'optimized write' is a Delta Lake feature that improves write performance by reducing small file output, but it does not help when reading existing small files from Blob Storage. Option C is wrong because converting to Avro changes the serialization format but does not inherently reduce the number of file read operations; the small file problem persists regardless of format.

Practice this question →

279

MCQmedium

You are designing a data pipeline to ingest streaming data from IoT devices into Azure Synapse Analytics. The data must be available for querying with minimal latency, but you also need to handle spikes in throughput without data loss. Which service should you use as the ingestion layer?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure IoT Hub

D.Azure Event Hubs

AnswerD

Event Hubs is optimized for high-throughput streaming data ingestion with buffering.

Why this answer

Azure Event Hubs is the correct choice because it is a fully managed, real-time data ingestion service optimized for high-throughput streaming data from millions of IoT devices. It provides at-least-once delivery, supports partitioning for massive scale, and integrates natively with Azure Synapse Analytics via the Synapse Pipeline or Event Hubs Capture to handle throughput spikes without data loss.

Exam trap

The trap here is that candidates often confuse Azure IoT Hub with Event Hubs, assuming IoT Hub is the default for all IoT streaming, but IoT Hub is designed for device management and lower-throughput telemetry, whereas Event Hubs is the dedicated high-throughput ingestion service for analytics pipelines.

How to eliminate wrong answers

Option A is wrong because Azure Data Lake Storage Gen2 is a hierarchical file storage service designed for batch and analytical workloads, not for real-time streaming ingestion; it lacks native event streaming and buffering capabilities. Option B is wrong because Azure Blob Storage is an object storage service for unstructured data, not built for low-latency, high-throughput event ingestion; it would require additional services like Event Hubs to capture streaming data. Option C is wrong because Azure IoT Hub is a managed service for bi-directional communication with IoT devices, but it is not optimized for high-volume streaming ingestion into Synapse; it is better suited for device management and command/control scenarios, and its default throughput is lower than Event Hubs for pure data ingestion.

Practice this question →

280

Multi-Selectmedium

Which TWO strategies reduce data movement in Azure Synapse Analytics pipelines? (Choose two.)

Select 2 answers

A.Use Data Flow to transform data before loading

B.Use Stored Procedure activity to insert data

C.Use serverless SQL pool to query data in place

D.Use Copy activity to move data from source to staging

E.Use PolyBase to load data from external tables

AnswersC, E

Queries without moving data.

Why this answer

Option C is correct because serverless SQL pools in Azure Synapse Analytics allow you to query data directly from files in Azure Data Lake Storage or other external sources without moving the data into a dedicated SQL pool. This eliminates data movement entirely by using the compute resources of the serverless pool to process queries in place, leveraging the OPENROWSET or CREATE EXTERNAL TABLE syntax to read data from its original location.

Exam trap

The trap here is that candidates confuse 'reducing data movement' with 'optimizing data movement'—they may think Data Flow or Copy activity with staging reduces movement when in fact they still move data, whereas serverless SQL and PolyBase query data in place without relocation.

Practice this question →

281

MCQeasy

You are designing a batch processing solution in Azure Data Factory. The source is an Azure Blob Storage container with CSV files. The target is an Azure SQL Database. The pipeline must run daily and incrementally load only new or changed rows. Which Data Factory feature should you use?

A.Mapping Data Flow with a watermark column

B.Stored Procedure activity to run a merge statement

C.Copy activity with a full load every time

D.Change data capture (CDC) resource

AnswerD

CDC enables incremental loading by tracking changes.

Why this answer

Option D is correct because Change Data Capture (CDC) in Azure Data Factory enables incremental loading by capturing insert, update, and delete operations from the source. For Azure Blob Storage CSV files, CDC can be implemented using the 'LastModifiedDate' or a watermark column to identify new or changed rows, avoiding full reloads and reducing latency.

Exam trap

The trap here is that candidates confuse CDC with a full load or assume a stored procedure can handle incremental logic without source-side change tracking, missing that CDC is the only native incremental load feature for Blob Storage in Azure Data Factory.

How to eliminate wrong answers

Option A is wrong because Mapping Data Flow with a watermark column is not a built-in feature for CDC; it requires custom logic to track changes and does not natively handle incremental loads from Blob Storage. Option B is wrong because a Stored Procedure activity to run a merge statement is a target-side operation that assumes the source already provides changed data, but it does not solve the problem of identifying new or changed rows in the source CSV files. Option C is wrong because a Copy activity with a full load every time contradicts the requirement for incremental loading, leading to unnecessary data transfer and performance degradation.

Practice this question →

282

MCQmedium

You are building a data processing solution that requires exactly-once semantics when writing to Azure Event Hubs from Azure Stream Analytics. Which output format should you configure?

A.JSON

B.CSV

C.Parquet

D.Avro

AnswerD

Avro format in Stream Analytics provides exactly-once semantics when writing to Event Hubs.

Why this answer

Azure Stream Analytics supports exactly-once semantics when writing to Event Hubs only when using Avro serialization. This is because Avro provides a compact binary format with embedded schema, enabling Stream Analytics to track and deduplicate events precisely during output, which is required for exactly-once delivery. JSON, CSV, and Parquet do not support the necessary metadata and checkpointing mechanisms for exactly-once guarantees in this specific integration.

Exam trap

The trap here is that candidates often assume JSON is the default or most compatible format for streaming outputs, but they overlook that exactly-once semantics require a binary format with embedded schema support, which only Avro provides in this specific Azure Stream Analytics to Event Hubs integration.

How to eliminate wrong answers

Option A is wrong because JSON is a text-based format that does not support the schema evolution and binary encoding required for Stream Analytics to enforce exactly-once semantics to Event Hubs. Option B is wrong because CSV lacks schema information and binary encoding, making it impossible for Stream Analytics to guarantee exactly-once delivery due to potential data loss or duplication during serialization. Option C is wrong because Parquet is a columnar storage format optimized for analytics and batch processing, not for real-time streaming output to Event Hubs, and Stream Analytics does not support exactly-once semantics with Parquet output to Event Hubs.

Practice this question →

283

MCQeasy

You are building a data pipeline in Azure Data Factory to copy data from an on-premises SQL Server database to Azure Blob Storage. The pipeline must run daily and handle incremental updates. The on-premises SQL Server table has a LastModifiedDate column that is updated when a row changes. What is the most efficient way to implement incremental loads?

A.Enable Change Data Capture (CDC) on the SQL Server database and use an ADF mapping data flow to read changes.

B.Use a Lookup activity to get the maximum LastModifiedDate from the destination, then use a Copy activity with a query that filters rows where LastModifiedDate > that value.

C.Use a tumbling window trigger with a window size of 1 day and copy all data from the source each time.

D.Perform a full load every day and use a Delete activity to remove duplicates.

AnswerB

This is the standard watermark pattern for incremental loads.

Why this answer

Option B is correct because it uses a Lookup activity to retrieve the maximum LastModifiedDate from the destination (Azure Blob Storage), then passes that value as a parameter to a Copy activity that queries only rows where LastModifiedDate exceeds it. This minimizes data transfer by reading only new or changed rows, and it avoids the overhead of enabling Change Data Capture or performing full loads, making it the most efficient approach for incremental loads in Azure Data Factory.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing CDC (Option A) or full loads (Options C and D), when a simple watermark-based query using the existing LastModifiedDate column is the most efficient and straightforward approach for incremental loads in Azure Data Factory.

How to eliminate wrong answers

Option A is wrong because enabling Change Data Capture (CDC) on SQL Server introduces additional overhead and complexity, and using an ADF mapping data flow for CDC is less efficient than a simple query-based incremental load when a LastModifiedDate column exists. Option C is wrong because using a tumbling window trigger with a 1-day window and copying all data each time performs a full load every run, which is inefficient and wasteful for incremental updates. Option D is wrong because performing a full load daily and then using a Delete activity to remove duplicates is extremely inefficient, as it transfers all data every day and requires additional processing to identify and delete duplicates, negating the benefits of incremental loading.

Practice this question →

284

MCQmedium

Refer to the exhibit. You have an Azure Data Factory pipeline that copies trade data from Azure Blob Storage to Azure SQL Database. The pipeline runs every hour and truncates the destination table before each copy. However, users report that data is missing during the copy window. What is the most likely cause?

A.The output dataset is not correctly configured

B.The writeBatchSize is too low, causing timeouts

C.The preCopyScript truncates the table before the copy completes, causing a period with no data

D.The source dataset is set to recursive, which includes unwanted files

AnswerC

Table is empty during copy.

Why this answer

Option B is correct. The preCopyScript truncates the table before copying, so if the copy fails or takes time, the table is empty. Option A is wrong because writeBatchSize does not cause data loss.

Option C is wrong because recursive does not affect data loss. Option D is wrong because outputs and inputs are correct.

Practice this question →

285

Multi-Selecthard

You are tuning a Spark job in Azure Synapse Analytics that processes large Parquet files. The job currently takes too long due to data skew. Which three actions can improve performance? (Choose three.)

Select 3 answers

A.Use bucketing on the join key when writing intermediate data.

B.Add a salt key to the join column to distribute the load.

C.Use coalesce to reduce the number of partitions.

D.Increase executor memory to handle larger partitions.

E.Repartition the data on the skewed column.

AnswersA, B, E

Bucketing pre-partitions data to avoid shuffle and reduce skew.

Why this answer

Options A, B, and E are correct. Repartitioning redistributes data. Salting adds a random key to break skew.

Bucketing pre-partitions data. Option C is wrong because coalesce reduces partitions but does not address skew. Option D is wrong because increasing executor memory may help but does not directly solve skew.

Practice this question →

286

MCQeasy

Your team is developing a data processing solution using Azure Databricks. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. You need to ensure that when multiple jobs concurrently write to the same Delta table, the operations are atomic and consistent. Which Delta Lake feature should you use?

A.Enable Optimized Write on the Delta table.

B.Enable Auto Optimize on the Delta table.

C.Rely on Delta Lake's built-in ACID transactions.

D.Use Dynamic Partition Pruning in your Spark jobs.

AnswerC

Delta Lake provides ACID transactions, ensuring atomic and consistent concurrent writes.

Why this answer

Delta Lake provides built-in ACID (Atomicity, Consistency, Isolation, Durability) transactions that guarantee atomic and consistent concurrent writes. When multiple jobs write to the same Delta table, Delta Lake uses a transaction log (stored as JSON files in the `_delta_log` directory) to serialize writes, ensuring that each write is either fully committed or rolled back, preventing partial updates or data corruption.

Exam trap

The trap here is that candidates confuse performance-tuning features (Optimized Write, Auto Optimize, Dynamic Partition Pruning) with transactional guarantees, assuming they provide atomicity or consistency when they only address file layout or query speed.

How to eliminate wrong answers

Option A is wrong because Optimized Write is a performance feature that reduces the number of small files written by coalescing partitions, but it does not provide atomicity or consistency guarantees for concurrent writes. Option B is wrong because Auto Optimize is a Delta Lake feature that automatically compacts small files and optimizes file layout, but it does not handle transactional concurrency or atomicity. Option D is wrong because Dynamic Partition Pruning is a Spark SQL optimization that improves query performance by skipping irrelevant partitions during joins, not a mechanism for ensuring atomic or consistent concurrent writes.

Practice this question →

287

MCQeasy

You are designing a data processing solution in Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 and you need to perform transformations using Apache Spark. The security requirements mandate that all data in transit must be encrypted and that the storage account must not be accessible from the public internet. What should you configure?

A.Enable the storage account firewall and add a private endpoint for Azure Databricks to use.

B.Disable TLS on the storage account and use a shared access signature (SAS) token for authentication.

C.Use Azure Databricks with VNet injection and configure a service endpoint for the storage account.

D.Configure the storage account to use HTTPS only and enable firewall rules to allow only Azure services.

AnswerA

Private endpoint ensures private connectivity and encryption in transit.

Why this answer

Option A is correct because it satisfies both security requirements: encrypting data in transit and preventing public internet access. A private endpoint uses Azure Private Link to connect Azure Databricks to the storage account over the Microsoft backbone network, ensuring all traffic stays within the Azure network and is encrypted via TLS. The storage account firewall is then configured to deny all public traffic, so only the private endpoint can access the storage account, meeting the 'not accessible from the public internet' mandate.

Exam trap

The trap here is confusing service endpoints (which still use public endpoints) with private endpoints (which use private IPs and fully isolate the resource from the internet), leading candidates to pick Option C thinking VNet injection plus service endpoints provides complete public internet isolation.

How to eliminate wrong answers

Option B is wrong because disabling TLS removes encryption in transit, violating the requirement that all data in transit must be encrypted; SAS tokens provide authentication but do not encrypt the channel. Option C is wrong because a service endpoint still exposes the storage account to the public internet (it only restricts source IPs via the firewall), so it does not meet the 'not accessible from the public internet' requirement; VNet injection alone does not enforce private connectivity. Option D is wrong because enabling HTTPS only and allowing only Azure services still leaves the storage account accessible from the public internet (Azure services can originate from public IPs), failing the requirement to block all public internet access.

Practice this question →

288

MCQeasy

You need to execute a T-SQL stored procedure in Azure Synapse dedicated SQL pool that performs a large data load. The stored procedure takes approximately 45 minutes to run. You want to monitor the progress and see the statement text currently being executed. Which dynamic management view (DMV) should you query?

A.sys.dm_workload_management_workload_groups_stats

B.sys.dm_pdw_exec_requests (with status='running')

C.sys.dm_pdw_waits

D.sys.dm_pdw_exec_requests

AnswerB

Filtering by status='running' shows currently executing requests with the command text.

Why this answer

B is correct because `sys.dm_pdw_exec_requests` with the filter `WHERE status = 'running'` returns the currently executing requests in an Azure Synapse dedicated SQL pool, including the full statement text in the `command` column. This allows you to see the exact T-SQL being executed during the 45-minute data load, enabling progress monitoring.

Exam trap

The trap here is that candidates often choose D without the `status='running'` filter, thinking it shows current execution, but it returns all historical requests, requiring an additional filter to isolate the active one.

How to eliminate wrong answers

Option A is wrong because `sys.dm_workload_management_workload_groups_stats` provides statistics about workload group resource utilization (e.g., CPU, memory), not the currently executing statement text. Option C is wrong because `sys.dm_pdw_waits` shows wait information (e.g., locks, waits on resources) but does not include the statement text of running requests. Option D is wrong because `sys.dm_pdw_exec_requests` without the `status='running'` filter returns all requests (completed, failed, cancelled, etc.), not just the currently executing one, so you would not see the active statement text without additional filtering.

Practice this question →

289

MCQhard

Refer to the exhibit. You have a Mapping Data Flow in Azure Data Factory that reads JSON files from a folder partitioned by year/month/day. The source setting includes a row limit of 10,000. The sink writes Parquet files with a file pattern and partition columns. You notice that the job processes only the first 10,000 rows from the entire dataset instead of 10,000 rows per partition. How should you modify the data flow to achieve row limit per partition?

A.Remove the rowLimit from source settings. In the Optimize tab of the source, set 'Partition option' to 'Set partitioning' with 'Value' as 'year,month'. Then add a 'Top N' transformation after the source to limit rows per partition.

B.Add a 'RowNumber' transformation after the source, then filter rows where row number <= 10000 per partition.

C.In the source settings, change rowLimit to a dynamic expression that uses partition variables to limit per partition.

D.Use a 'Distinct' transformation with 'All columns' to remove duplicate rows, which inherently limits rows.

AnswerA

This approach partitions the data first and then applies a per-partition row limit using Top N.

Why this answer

The row limit in the source settings applies to the entire dataset, not per partition. To achieve row limit per partition, you need to set the row limit in a derived column or filter transformation after partitioning. However, the correct approach is to remove the row limit from source and instead use a 'Top N' transformation after a 'Partition' transformation, but Data Flow does not have a Top N per partition.

The practical solution is to use a 'Window' transformation to rank rows within each partition and then filter. Alternatively, you can use a 'SurrogateKey' to assign row numbers per partition. The correct answer here is to enable 'Sampling' within each partition using the 'Optimize' tab, but the closest option is to apply the row limit in a filter transformation after partitioning.

Option A correctly suggests using the 'Optimize' tab to set 'Partition' and then apply a 'Top N' filter. Option B is wrong because row limit cannot be set per partition in source settings. Option C is wrong because Data Flows do not have a 'RowNumber' transformation.

Option D is wrong because 'Distinct' does not limit rows.

Practice this question →

290

MCQhard

You are developing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion into a dedicated SQL pool. You need to ensure that data from streaming sources is available for queries within 5 seconds. Which approach should you use?

A.Use Azure Stream Analytics with a custom SQL function that writes directly to the dedicated SQL pool

B.Use Azure Databricks with Structured Streaming, write to Data Lake Storage, and then use PolyBase to load into SQL pool

C.Use Azure Data Factory with tumbling window triggers to load data from Event Hubs every 5 seconds

D.Use Event Hubs Capture to write to Data Lake Storage, then use PolyBase to load into the SQL pool every 5 seconds

AnswerA

Stream Analytics can achieve sub-second latency and write directly to SQL pool via a stored procedure.

Why this answer

Using Azure Stream Analytics with a custom function to write to the dedicated SQL pool via a stored procedure can achieve low latency. Option A is wrong because Azure Data Factory is batch-only. Option B is wrong because Event Hubs Capture writes in batches (minutes).

Option D is wrong because Spark Structured Streaming to Data Lake and then PolyBase introduces additional latency.

Practice this question →

291

MCQeasy

You need to process a large number of CSV files stored in Azure Data Lake Storage Gen2 using Azure Databricks. The files are nested in multiple folders, and the schema varies slightly between files. You want to automatically infer the schema and handle schema evolution. Which read option should you use?

A.spark.read.option("mergeSchema","true").csv(path)

B.spark.read.format("delta").load(path)

C.spark.read.option("inferSchema","true").csv(path)

D.spark.read.csv(path)

AnswerA

Infers and merges schemas.

Why this answer

Option C (spark.read.format("csv").option("mergeSchema","true").load(path)) is correct because mergeSchema enables automatic schema inference and merging across files with different schemas. Option A loads without schema evolution. Option B infers schema but does not merge.

Option D is for Delta Lake, not CSV.

Practice this question →

292

MCQmedium

You are designing a data processing solution for a media company that uses Azure Synapse Analytics. The solution must process video metadata stored in Azure Cosmos DB and combine it with user interaction data from Azure Data Lake Storage Gen2. The combined data must be stored in a dedicated SQL pool for reporting. The data volume is moderate, and the processing should be done using T-SQL. Which approach should you use?

A.Use Azure Data Factory with Copy activities to load data from Cosmos DB and ADLS Gen2 into the dedicated SQL pool, then use a stored procedure to merge.

B.Use Azure Databricks to read from Cosmos DB and ADLS Gen2, transform with Spark SQL, and write to the dedicated SQL pool using JDBC.

C.Use the serverless SQL pool to create external tables on Cosmos DB and ADLS Gen2, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write the combined data to a table in the dedicated SQL pool.

D.Use PolyBase to create external tables on Cosmos DB and ADLS Gen2, then use INSERT...SELECT to load into dedicated SQL pool.

AnswerC

Leverages T-SQL and serverless pool for in-place querying.

Why this answer

Option A is correct because Azure Synapse Analytics serverless SQL pool can query Cosmos DB via Azure Cosmos DB analytical store and ADLS Gen2 via OPENROWSET, and then you can use CETAS to write results to the dedicated SQL pool. Option B is wrong because Azure Data Factory would require copy activities and may not be as efficient. Option C is wrong because Azure Databricks uses Spark, not T-SQL.

Option D is wrong because PolyBase cannot directly query Cosmos DB.

Practice this question →

293

MCQeasy

You are using Azure Synapse Pipelines to perform an incremental load from Azure SQL Database to Azure Synapse Analytics. You need to identify rows that have changed since the last load. Which approach should you use?

A.Compare the current data with a snapshot using T-SQL MERGE.

B.Truncate and reload the entire table daily.

C.Use a watermark column such as LastModifiedDate.

D.Enable Change Data Capture (CDC) on the source table.

AnswerD

CDC captures all changes and supports incremental load.

Why this answer

Option D is correct because Change Data Capture (CDC) on the source Azure SQL Database captures insert, update, and delete operations in change tables, enabling Azure Synapse Pipelines to efficiently identify only the changed rows since the last load. This approach minimizes data movement and processing overhead compared to full or snapshot-based comparisons, making it the recommended pattern for incremental loads in Synapse Pipelines.

Exam trap

The trap here is that candidates often choose the watermark column approach (Option C) because it seems simpler, but they overlook that CDC is the only option that natively captures all DML changes (including deletes) without requiring schema modifications or custom logic to handle edge cases like out-of-order updates.

How to eliminate wrong answers

Option A is wrong because comparing current data with a snapshot using T-SQL MERGE requires storing a full snapshot and performing row-by-row comparison, which is resource-intensive and does not leverage Synapse Pipelines' native incremental load capabilities. Option B is wrong because truncate and reload the entire table daily defeats the purpose of incremental loading, causing unnecessary full data transfer and processing, and is not a valid incremental approach. Option C is wrong because using a watermark column such as LastModifiedDate only captures updates to rows that have a timestamp updated, but it cannot detect deletes or changes to rows where the timestamp is not maintained, and it requires the source to reliably update the column on every change.

Practice this question →

294

MCQhard

Refer to the exhibit. You created an external table in Azure Synapse Analytics serverless SQL pool to query Parquet files. Queries return no rows even though the files exist. What is the most likely issue?

A.The CREDENTIAL is missing

B.The FILE_FORMAT is incorrectly specified

C.The DATA_COMPRESSION is not supported for Parquet

D.The LOCATION path in the external table is relative to the data source, but the data source points to the wrong container or folder

AnswerD

The data source points to 'sales' container, table location adds 'parquet/sales', likely the files are not there.

Why this answer

Option B is correct. The external table LOCATION is 'parquet/sales/' but the external data source points to the root 'sales' container. The combined path is 'sales/parquet/sales/', which may be wrong.

Option A is wrong because compression is supported. Option C is wrong because credential is defined. Option D is wrong because the file format is correct.

Practice this question →

295

MCQhard

You are using Azure Stream Analytics to process real-time temperature data from IoT devices. The output must be written to Azure SQL Database. The job has been running successfully for weeks, but recently you notice that the output data has duplicate rows. The input events are unique. The job uses a windowed aggregation (TumblingWindow). What is the most likely cause of duplicates?

A.The job is not handling late-arriving events.

B.The job is being restarted and reprocessing data.

C.The input event hub is receiving duplicate events.

D.The tumbling window size is too small.

AnswerB

Restart can cause reprocessing and duplicate output without idempotent writes.

Why this answer

Option D is correct because when a Stream Analytics job recovers from a failure, it reprocesses input from the last checkpoint, which can cause duplicate output if the output is not idempotent. Option A is wrong because window size affects aggregation but not duplicates. Option B is wrong because late events can cause out-of-order results but not duplicates if handled correctly.

Option C is wrong because the input is unique, so duplicates are not from input.

Practice this question →

296

MCQmedium

You are building a streaming pipeline in Azure Synapse Analytics to ingest real-time sensor data from IoT devices. The data must be processed with a 2-second latency and stored in a dedicated SQL pool for reporting. The source emits JSON messages with a nested structure. Which approach should you use to ingest and transform the data?

A.Use Azure Synapse Spark with structured streaming to read from Event Hubs, flatten the JSON using Spark SQL, and write to the dedicated SQL pool.

B.Use Azure Stream Analytics to ingest data from Azure Event Hubs, apply a query to flatten the JSON, and output directly to the dedicated SQL pool.

C.Use Azure Data Factory to run a tumbling window trigger that reads from Event Hubs every 2 seconds and copies data to the dedicated SQL pool.

D.Use Azure Databricks with Auto Loader to ingest data from Event Hubs and write to the dedicated SQL pool.

AnswerB

Stream Analytics is designed for real-time processing, supports nested JSON via WITH clause, and can output to Synapse SQL pool with low latency.

Why this answer

Option B is correct because Azure Stream Analytics can ingest from Event Hubs, flatten JSON in real-time, and write to Synapse SQL pool via the built-in output adapter. Option A (Databricks Auto Loader) is for batch/streaming but not optimal for sub-2-second latency and nested JSON flattening without additional code. Option C (Spark structured streaming) is possible but more complex.

Option D (Azure Data Factory) is for batch/orchestration, not real-time streaming.

Practice this question →

297

MCQeasy

You need to transform data in Azure Synapse Analytics using a language that supports procedural logic and error handling. Which option should you use?

A.T-SQL stored procedures

B.CREATE VIEW

C.PolyBase

D.CREATE EXTERNAL TABLE

AnswerA

Supports procedural logic and error handling.

Why this answer

T-SQL stored procedures are the correct choice because they support procedural logic (e.g., IF/ELSE, loops, TRY/CATCH) and error handling within Azure Synapse Analytics dedicated SQL pools. This allows you to encapsulate complex data transformation logic, handle runtime errors gracefully, and manage transactions, which is not possible with declarative objects like views or external tables.

Exam trap

The trap here is that candidates confuse PolyBase's ability to query external data with the ability to perform procedural transformations, overlooking that PolyBase is a query engine, not a programming construct for logic and error handling.

How to eliminate wrong answers

Option B is wrong because CREATE VIEW creates a read-only virtual table that cannot contain procedural logic or error handling; it is purely declarative. Option C is wrong because PolyBase is a data virtualization technology for querying external data sources (e.g., Azure Blob Storage) using T-SQL, but it does not support procedural logic or error handling itself. Option D is wrong because CREATE EXTERNAL TABLE defines a schema for external data but provides no procedural capabilities or error handling; it is a metadata object for PolyBase queries.

Practice this question →

← PreviousPage 4 of 4 · 297 questions total

Ready to test yourself?

Try a timed practice session using only Develop data processing questions.

Start 20-question session