Knowledge + Practice

CCNA Develop data processing Questions

75 of 297 questions · Page 3/4 · Develop data processing · Answers revealed

Practice these questions Domain overview All questions

151

MCQeasy

You have a pipeline in Azure Data Factory that copies data from on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection timed out' error. You have already verified that the Integration Runtime is running and the SQL Server firewall allows connections from the Integration Runtime. What should you check next?

A.Ensure the Integration Runtime is registered and online

B.Check if the Blob Storage endpoint is accessible from the Integration Runtime

C.Check if the SQL Server is configured to allow remote connections and that TCP/IP is enabled

D.Verify that the SQL Server login credentials are correct

AnswerC

Timeout often indicates network blocking or SQL Server not listening on TCP/IP.

Why this answer

Option C is correct because the 'Connection timed out' error, despite the Integration Runtime being running and the firewall allowing connections, typically indicates that SQL Server is not listening on the expected TCP port. This often happens when TCP/IP is disabled in SQL Server Configuration Manager or remote connections are not enabled. Without TCP/IP enabled, the Integration Runtime cannot establish a network connection to the SQL Server instance, leading to a timeout.

Exam trap

The trap here is that candidates assume a 'Connection timed out' error is always a firewall or network issue, overlooking the SQL Server-side protocol configuration that must be explicitly enabled for remote TCP connections.

How to eliminate wrong answers

Option A is wrong because the question states that the Integration Runtime is already verified as running, so re-checking its registration and online status is redundant and does not address the timeout. Option B is wrong because the error is a connection timeout to SQL Server, not to Blob Storage; the pipeline fails before data transfer begins, so Blob Storage accessibility is irrelevant at this stage. Option D is wrong because incorrect login credentials would result in an authentication error (e.g., 'Login failed'), not a 'Connection timed out' error, which is a network-level issue.

Practice this question →

152

MCQhard

You are designing a near-real-time data processing solution for a retail company. The source is a Kafka cluster on-premises. The target is an Azure Synapse Dedicated SQL Pool. The solution must handle up to 10,000 events per second with less than 5-minute latency. Which Azure service should you use to ingest the data?

A.Azure Event Hubs (with Kafka protocol support)

B.Azure Data Lake Storage Gen2

C.Azure IoT Hub

D.Azure Stream Analytics

AnswerA

Event Hubs supports Kafka protocol and can ingest 10K events/sec with low latency.

Why this answer

Azure Event Hubs with Kafka protocol support is the correct choice because it provides a fully managed, high-throughput data ingestion service that can handle up to 10,000 events per second with sub-second latency, and it natively supports the Kafka protocol, allowing direct integration with your on-premises Kafka cluster without custom code or additional gateways. This meets the near-real-time requirement (<5-minute latency) and scales to the specified throughput.

Exam trap

The trap here is that candidates often confuse Azure Stream Analytics as an ingestion service, but it is a processing engine that requires an ingestion layer (like Event Hubs) first, and they may overlook that Azure Event Hubs natively supports the Kafka protocol, making it the direct replacement for Kafka ingestion in Azure.

How to eliminate wrong answers

Option B (Azure Data Lake Storage Gen2) is wrong because it is a hierarchical file store designed for batch analytics and data lake storage, not a real-time event ingestion service; it cannot natively consume Kafka streams or provide sub-5-minute latency for streaming data. Option C (Azure IoT Hub) is wrong because it is optimized for device-to-cloud telemetry from IoT devices, not for high-throughput event streams from a Kafka cluster, and it imposes device identity and throttling limits that are unsuitable for 10,000 events per second from a non-IoT source. Option D (Azure Stream Analytics) is wrong because it is a stream processing engine that requires an input source (like Event Hubs) to ingest data; it cannot directly ingest from Kafka on-premises and is not an ingestion service itself.

Practice this question →

153

MCQeasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion. Which Azure service should you use to ingest streaming data into Synapse Analytics?

A.Azure Data Factory

B.Azure Blob Storage

C.Azure Event Hubs

D.Azure Analysis Services

AnswerC

Event Hubs is designed for streaming data ingestion and works with Synapse.

Why this answer

Option A is correct because Azure Event Hubs is a big data streaming platform and event ingestion service that integrates with Synapse. Option B is wrong because Azure Data Factory is primarily for batch data integration. Option C is wrong because Azure Blob Storage is a storage service, not an ingestion service.

Option D is wrong because Azure Analysis Services is for semantic modeling, not streaming ingestion.

Practice this question →

154

MCQmedium

You are designing a data processing solution for a global company. Data must be processed in near real-time and aggregated by region. You need to minimize latency for downstream consumers. Which Azure service should you use for stream processing?

A.Azure Batch

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Synapse Pipelines

AnswerB

Stream Analytics provides real-time stream processing with SQL-like queries.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed stream processing engine designed for near real-time analytics on high-volume data streams. It can ingest data from sources like Azure Event Hubs or IoT Hub, apply SQL-based transformations, and output aggregated results to sinks such as Azure Synapse or Power BI with sub-second latency, meeting the requirement for minimal downstream latency.

Exam trap

The trap here is that candidates often confuse Azure Data Factory or Synapse Pipelines with stream processing because they support 'real-time' triggers, but these services are fundamentally batch-oriented and cannot achieve the sub-second latency required for continuous stream aggregation.

How to eliminate wrong answers

Option A is wrong because Azure Batch is a batch processing service for running large-scale parallel jobs, not designed for near real-time stream processing; it introduces significant latency due to job scheduling and queuing. Option C is wrong because Azure Data Factory is an ETL and data orchestration service focused on batch data movement and transformation, lacking native support for continuous stream processing. Option D is wrong because Azure Synapse Pipelines are built on the same orchestration engine as Data Factory and are intended for batch-oriented workflows, not for real-time stream aggregation.

Practice this question →

155

MCQhard

You are designing a data pipeline in Azure Data Factory to load incremental changes from an on-premises SQL Server database to Azure Synapse Analytics. The source table has over 1 billion rows and a datetime column 'LastUpdated' that is indexed but not always increasing. The requirement is to capture all changes with minimal latency and no missed rows. Which approach should you recommend?

A.Enable Change Data Capture (CDC) on the source table.

B.Use SQL Server change tracking with a watermark table.

C.Use a custom staging table with triggers to capture changes.

D.Use a watermark column (LastUpdated) and query rows where LastUpdated > last run time.

AnswerB

Change tracking captures changes reliably and works with all editions.

Why this answer

Option D is correct because change tracking reliably captures all changes including deletes and out-of-order updates, and the watermark mechanism ensures no rows are missed. Option A is incorrect because it can miss rows when timestamps are out of order. Option B is incorrect because it requires schema changes (triggers) and may impact source performance.

Option C is incorrect because it requires costly schema changes.

Practice this question →

156

MCQmedium

Your company uses Azure Synapse Analytics to run a data warehouse. You have a dedicated SQL pool with a hash-distributed fact table named Sales. The distribution column is ProductID. You notice that queries against the Sales table are slow due to data skew. After analysis, you find that a few products (e.g., ProductID 100, 200) account for 80% of the rows. You need to optimize query performance without redesigning the entire table. You also need to minimize data movement during queries. Which action should you take?

A.Change the distribution to round-robin.

B.Increase the number of distributions to 120.

C.Change the distribution to replicate for the Sales table.

D.Create non-clustered indexes on the ProductID column.

AnswerA

Round-robin distributes rows evenly, eliminating skew.

Why this answer

Option B is correct. Round-robin distribution distributes data evenly across distributions, eliminating skew. However, it may increase data movement for joins.

Given the severe skew, round-robin is a reasonable trade-off. Option A is wrong because adding more distributions does not fix skew. Option C is wrong because changing to replicate distribution is not suitable for large fact tables.

Option D is wrong because creating non-clustered indexes does not address distribution skew.

Practice this question →

157

MCQhard

Your company uses Azure Synapse Analytics to run a large-scale batch processing job every night. The job currently runs on a dedicated SQL pool and takes 4 hours. Management wants to reduce the runtime to under 2 hours without increasing cost. The job involves heavy compute operations with no data movement limitations. What should you do?

A.Create materialized views on frequently queried tables.

B.Increase the service level objective (DWU) of the dedicated SQL pool.

C.Implement workload management to prioritize the job.

D.Enable result-set caching on the dedicated SQL pool.

AnswerD

Result-set caching reduces compute for repeated queries without increasing cost.

Why this answer

Option C is correct because result-set caching stores query results in SSD storage, reducing compute time for repeated queries. Option A is wrong because increasing the service level (DWU) would increase cost. Option B is wrong because materialized views require storage but also incur compute for refresh.

Option D is wrong because workload management separates resources but does not automatically reduce compute time without additional cost.

Practice this question →

158

MCQhard

You are optimizing a data pipeline in Azure Data Factory that uses a Copy activity to transfer data from an Azure SQL Database to a dedicated SQL pool in Azure Synapse Analytics. The source table has 500 million rows and the copy operation is taking too long. You need to reduce the copy duration. Which configuration change will have the most impact?

A.Enable staging and use PolyBase as the copy method for the sink.

B.Change the copy behavior to 'sequential' to reduce load on the source.

C.Increase the degree of copy parallelism (DOP) to the maximum value supported.

D.Split the source data into multiple smaller files and use multiple copy activities running in parallel.

AnswerA

Staging with PolyBase dramatically improves performance for large data loads.

Why this answer

Enabling staging with PolyBase as the copy method for the sink is the most impactful change because PolyBase leverages the massively parallel processing (MPP) architecture of Azure Synapse Analytics to load data in parallel directly into the dedicated SQL pool. This bypasses the single-threaded bottleneck of the standard INSERT-based copy method, dramatically reducing the time required to ingest 500 million rows.

Exam trap

The trap here is that candidates often assume increasing parallelism (DOP) or splitting data into multiple activities is always better, but they overlook that PolyBase's MPP integration with Synapse is the only option that fundamentally changes the data loading mechanism from a serial to a parallel bulk operation.

How to eliminate wrong answers

Option B is wrong because changing the copy behavior to 'sequential' would reduce parallelism and increase the copy duration, not reduce it. Option C is wrong because increasing the degree of copy parallelism (DOP) to the maximum value supported can cause resource contention and throttling on the source Azure SQL Database, often leading to diminishing returns or even slower performance. Option D is wrong because splitting the source data into multiple smaller files and using multiple copy activities running in parallel would require additional orchestration and staging, and without PolyBase or staging, each copy activity would still use the slow row-by-row INSERT method, making it less effective than a single PolyBase-based load.

Practice this question →

159

MCQeasy

Refer to the exhibit. You have created an external table in Azure Synapse Analytics serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. When you query the external table, you get an error that the external table is not accessible. What should you check first?

A.Check that the external table's LOCATION path is relative to the container and does not start with a slash.

B.Verify that the serverless SQL pool has been granted the 'Storage Blob Data Reader' role on the storage account.

C.Ensure that the external file format is correctly referencing the Parquet format.

D.Confirm that the Snappy compression codec is supported by the serverless SQL pool.

AnswerB

The serverless SQL pool needs read permissions on the storage account to access the data.

Why this answer

The error 'external table is not accessible' in Azure Synapse serverless SQL pool typically indicates an authorization failure when the SQL pool attempts to read the underlying Parquet files in Azure Data Lake Storage Gen2. Serverless SQL pool uses its own service identity to access storage, and it must be granted the 'Storage Blob Data Reader' role on the storage account at the storage account scope to have read permissions. Without this role assignment, the SQL pool cannot authenticate to the storage, resulting in the access error.

Exam trap

The trap here is that candidates often confuse 'external table not accessible' with file path or format issues, but the error message specifically points to a permissions/authorization problem, not a configuration or syntax error.

How to eliminate wrong answers

Option A is wrong because the LOCATION path in an external table for serverless SQL pool must be relative to the container and should not start with a slash; however, an incorrect path format would cause a 'file not found' or 'path does not exist' error, not an 'external table is not accessible' error which is specifically about permissions. Option C is wrong because if the external file format incorrectly references the Parquet format, the error would be about format mismatch or parsing failure (e.g., 'Cannot parse file'), not about table accessibility. Option D is wrong because Snappy compression is fully supported by serverless SQL pool for Parquet files; an unsupported codec would cause a decompression error, not an access-denied error.

Practice this question →

160

MCQhard

You are troubleshooting a slow-running pipeline in Azure Data Factory. The pipeline copies data from an on-premises SQL Server to Azure Synapse Analytics using a self-hosted integration runtime. The copy activity is using the 'Auto' copy method. You notice that network bandwidth is limited. Which configuration change would most likely improve performance?

A.Enable staging using Azure Blob Storage and use PolyBase to load into Synapse

B.Increase the Data Integration Units (DIU) for the copy activity

C.Change the copy method to 'Bulk insert'

D.Set the Fault Tolerance option to skip incompatible rows

AnswerA

Staging improves performance by using parallel uploads to Blob Storage.

Why this answer

Option A is correct because when network bandwidth is limited, staging data in Azure Blob Storage allows the copy activity to use PolyBase, which leverages Azure's internal high-speed network for the final load into Synapse. This bypasses the constrained on-premises-to-cloud link for the bulk of the data transfer, significantly improving throughput.

Exam trap

The trap here is that candidates assume increasing DIU or changing the copy method directly speeds up data movement, when in fact the real bottleneck is the network link, and only staging with PolyBase offloads the heavy data transfer to Azure's internal network.

How to eliminate wrong answers

Option B is wrong because Data Integration Units (DIU) control parallelism within the copy activity but do not address the underlying network bandwidth bottleneck; increasing DIU on a constrained link can actually worsen contention. Option C is wrong because 'Bulk insert' is the default method for loading into Synapse and does not change the data path; it still sends all data over the limited network connection. Option D is wrong because Fault Tolerance skips incompatible rows to avoid failures, but it has no impact on data transfer speed or network utilization.

Practice this question →

161

MCQhard

Refer to the exhibit. The pipeline fails because the source and sink datasets do not match. The source file is a CSV with columns: CustomerID, Name, City. The sink table dbo.Customer has columns: CustomerID, Name, City, CreatedDate (default GETDATE()). The pipeline uses auto-mapping. Why does the pipeline fail?

A.Auto-mapping fails because the source and sink have different column counts.

B.The recursive setting should be false for CSV files.

C.The writeBatchSize is too large for the sink.

D.The pre-copy script is not allowed for SQL sink.

AnswerA

Auto-mapping maps by name but expects the same number of columns or explicit mapping for extras.

Why this answer

Option C is correct because the source has 3 columns and the sink has 4 columns. With auto-mapping, the copy activity tries to map all source columns to sink columns by name but fails because the sink has an extra column (CreatedDate) that is not nullable and has no default value in the mapping context (the default is a SQL Server default constraint, but auto-mapping expects an explicit mapping or a NULL). The pre-copy script truncates the table, but that does not resolve the column count mismatch.

Option A is wrong because the pre-copy script is valid. Option B is wrong because recursive is fine for a single file. Option D is wrong because writeBatchSize is acceptable.

Practice this question →

162

MCQmedium

You need to process a large dataset stored as CSV files in Azure Data Lake Storage Gen2 using Azure Databricks. The processing involves several transformations and aggregations. You want to minimize shuffle operations. Which approach should you use?

A.Use Delta Lake and apply Z-ordering on the columns used in filters and aggregations

B.Cache the data in memory after reading

C.Use bucketing with a fixed number of buckets

D.Partition the data by a high-cardinality column

AnswerA

Z-ordering co-locates related data, reducing data shuffling.

Why this answer

Z-ordering in Delta Lake co-locates related data within files based on specified columns, which significantly reduces the amount of data scanned during filter and aggregation operations. By minimizing the data that needs to be read, Z-ordering inherently reduces shuffle operations because fewer partitions need to be exchanged across the cluster during transformations. This approach is specifically designed to optimize query performance on large datasets in Azure Databricks without increasing the number of shuffle stages.

Exam trap

The trap here is that candidates often confuse partitioning (which can increase shuffle) with Z-ordering (which reduces shuffle by improving data locality without creating new partitions), leading them to choose bucketing or high-cardinality partitioning as a solution for shuffle minimization.

How to eliminate wrong answers

Option B is wrong because caching data in memory after reading only speeds up repeated access to the same data but does not reduce shuffle operations during transformations or aggregations; shuffle is caused by data movement across partitions, not by I/O latency. Option C is wrong because bucketing with a fixed number of buckets can actually increase shuffle operations if the bucketing columns do not align with the join or aggregation keys, and it does not inherently minimize shuffle; it is primarily used for optimizing joins and aggregations when the number of buckets matches the cluster parallelism. Option D is wrong because partitioning by a high-cardinality column (e.g., a column with many unique values) creates many small partitions, which leads to excessive shuffle overhead and task scheduling inefficiency, increasing rather than minimizing shuffle operations.

Practice this question →

163

MCQmedium

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data from multiple sources. You need to ensure that the data is encrypted at rest and in transit, and that access to the data is audited. Which combination of Azure services should you use?

A.Azure Data Lake Storage (encrypted at rest), Azure Synapse Analytics (TDE and SSL), and Microsoft Purview

B.Azure Blob Storage (encrypted at rest), Azure HDInsight, and Azure Log Analytics

C.Azure SQL Database (TDE and SSL), Azure Analysis Services, and Microsoft Purview

D.Azure Data Lake Storage (encrypted at rest), Microsoft Fabric, and Azure Monitor

AnswerA

Azure Storage provides encryption at rest; Synapse supports TDE for at-rest and SSL for in-transit; Purview provides data lineage and auditing.

Why this answer

Option A is correct because Azure Data Lake Storage provides encryption at rest using Azure Storage Service Encryption (SSE) with 256-bit AES, and Azure Synapse Analytics supports Transparent Data Encryption (TDE) for at-rest encryption and SSL/TLS for in-transit encryption. Microsoft Purview enables data governance and auditing by capturing lineage, classification, and access activity logs, meeting the compliance requirements for sensitive financial data.

Exam trap

The trap here is that candidates often confuse Azure Monitor or Log Analytics with Microsoft Purview for auditing, but Microsoft Purview is the dedicated service for data governance, classification, and access auditing, while Azure Monitor is for operational monitoring and does not provide data-level audit trails.

How to eliminate wrong answers

Option B is wrong because Azure HDInsight does not natively enforce encryption at rest for data stored in its managed disks or external storage without additional configuration, and Azure Log Analytics focuses on monitoring and diagnostics rather than auditing data access at the granularity required for sensitive customer data. Option C is wrong because Azure Analysis Services does not provide built-in auditing of data access at the storage level; it is an analytical engine that relies on underlying data sources for encryption, and Microsoft Purview is correctly included but the combination lacks a scalable storage layer for multiple sources. Option D is wrong because Microsoft Fabric is a unified analytics platform that does not inherently provide the same level of granular access auditing as Microsoft Purview, and Azure Monitor is designed for infrastructure monitoring, not data access auditing.

Practice this question →

164

MCQmedium

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. The solution must query data stored in Parquet files in Azure Data Lake Storage Gen2. The queries are ad-hoc and vary greatly. Which feature should you use to optimize query performance for frequently accessed data partitions?

A.Implement workload management to prioritize queries.

B.Use OPENROWSET with explicit file path filtering.

C.Enable result-set caching on the serverless SQL pool.

D.Create materialized views on the Parquet files.

AnswerB

OPENROWSET with path filtering prunes partitions and improves performance.

Why this answer

Option C is correct because the OPENROWSET function with file path filtering can prune partitions by specifying the path to specific folders. Option A is wrong because materialized views are not supported in serverless SQL pool. Option B is wrong because result-set caching is for dedicated SQL pool.

Option D is wrong because workload management is for dedicated SQL pool.

Practice this question →

165

MCQmedium

You are deploying an Azure Synapse workspace using an ARM template. The template includes a Managed integration runtime with 'AutoResolve' location and a TTL of 10 minutes for data flows. After deployment, you notice that the first data flow execution takes a long time to start. What is the most likely cause?

A.The core count of 8 is insufficient for the data flow.

B.The TTL setting is too low, causing the cluster to be recreated frequently.

C.The AutoResolve location cannot be used for Managed IR.

D.The integration runtime type should be 'Self-Hosted' for data flows.

AnswerB

Low TTL leads to frequent cluster teardown and startup delays.

Why this answer

Option B is correct because the TTL (time-to-live) of 10 minutes means the cluster will be spun down after 10 minutes of inactivity, so the first execution after a period of inactivity must wait for the cluster to spin up. Option A is wrong because the location is AutoResolve and works. Option C is wrong because the core count is sufficient.

Option D is wrong because the template deploys the IR correctly.

Practice this question →

166

MCQhard

Refer to the exhibit. You have an Azure Data Factory pipeline that performs an incremental load from an Azure SQL Database source to a target Azure SQL Database. The pipeline uses a watermark column approach. After running the pipeline, you notice that the target table is empty. What is the most likely cause of this issue?

A.The dependency condition should be 'Completed' instead of 'Succeeded'.

B.The WatermarkQuery activity failed, causing the CopyData activity to be skipped.

C.The watermark query returns the maximum LastModified value, but the copy query uses the same value to filter, resulting in zero rows.

D.The CopyData activity runs before the WatermarkQuery activity completes.

AnswerC

The copy query filters for rows where LastModified > NewWatermark, but NewWatermark is the maximum, so no rows satisfy the condition. The previous watermark should be stored and used.

Why this answer

Option D is correct because the query uses the output of WatermarkQuery (which returns the maximum LastModified) as the filter condition, but it should filter for rows where LastModified is greater than the previous watermark, not the current maximum. This results in no rows being selected because no rows have LastModified > maximum. Option A is wrong because the pipeline succeeded.

Option B is wrong because the watermark query executes before the copy activity. Option C is wrong because the dependency condition is correct.

Practice this question →

167

MCQmedium

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data must be available for querying within 5 minutes of ingestion. Which processing approach should you recommend?

A.Use Azure Data Factory with a tumbling window trigger set to 5 minutes.

B.Use Azure Stream Analytics with a dedicated SQL pool output and configure a 1-minute window.

C.Use PolyBase to load data from Event Hubs into the dedicated SQL pool every 5 minutes.

D.Use Spark Structured Streaming in Azure Synapse to write micro-batches every 5 minutes.

AnswerB

Stream Analytics provides sub-minute latency and is designed for real-time ingestion into Synapse dedicated SQL pool.

Why this answer

Azure Stream Analytics is purpose-built for real-time stream processing and can output directly to a dedicated SQL pool. By configuring a 1-minute window, you ensure data is materialized in the SQL pool well within the 5-minute SLA, meeting the latency requirement with headroom.

Exam trap

The trap here is that candidates confuse batch-oriented tools (Data Factory, PolyBase) or general-purpose streaming frameworks (Spark Structured Streaming) with the dedicated, low-latency stream processing service (Stream Analytics) that is optimized for sub-minute latency to Synapse SQL pools.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory with a tumbling window trigger is a batch-oriented orchestration tool, not a streaming engine; it cannot process Event Hubs data in near real-time and introduces at least 5 minutes of latency before the trigger even fires. Option C is wrong because PolyBase is a bulk-load technology for reading external data sources like Azure Blob Storage or Data Lake, not for ingesting streaming data from Event Hubs; it cannot connect to Event Hubs directly. Option D is wrong because Spark Structured Streaming in Synapse writes micro-batches every 5 minutes, which meets the 5-minute SLA only at the boundary; any processing delay or checkpoint overhead could push latency beyond the requirement, and it lacks the native, low-latency integration with dedicated SQL pool that Stream Analytics provides.

Practice this question →

168

Multi-Selecthard

Which THREE of the following are best practices for optimizing performance of Delta Lake tables in Azure Synapse Analytics? (Choose three.)

Select 4 answers

A.Run the OPTIMIZE command to compact small files and improve read performance.

B.Periodically run VACUUM to remove old versions of files that are no longer needed.

C.Partition the table on columns that are used in WHERE clauses to enable partition pruning.

D.Partition on high-cardinality columns like UserID to maximize parallelism.

E.Use Z-order on columns that are frequently used in filter predicates.

AnswersA, B, C, E

OPTIMIZE reduces the number of small files, improving I/O.

Why this answer

Option A is correct because the OPTIMIZE command in Delta Lake compacts small files into larger ones, reducing the number of files that need to be read during queries. This improves read performance by minimizing file listing overhead and enabling more efficient I/O, especially in scenarios with many small writes.

Exam trap

The trap here is that candidates often confuse high-cardinality partitioning with parallelism, not realizing that excessive partitions cause metadata bloat and slow down queries, while Z-order is a complementary technique for non-partition columns.

Practice this question →

169

MCQmedium

You are running a Spark notebook in Azure Synapse Analytics that reads from a Delta table and writes to a Parquet file. The job fails with the error: 'AnalysisException: Table or view not found: bronze.sales'. The table exists in the lakehouse. What is the most likely cause?

A.The user does not have read permission on the table.

B.The table is not registered in the Spark metastore; it is only in the lakehouse.

C.The Parquet file format is incompatible with the Delta source.

D.The Delta table is corrupted.

AnswerB

In Synapse, lakehouse tables are stored in a separate catalog; the Spark session's default catalog may not include it, so the table is not found.

Why this answer

In Azure Synapse Analytics, a Spark notebook uses its own Spark metastore to resolve table references. If the table 'bronze.sales' exists only in the lakehouse (i.e., as a Delta table in the underlying storage) but is not registered in the Spark metastore, the Spark engine cannot find it and throws an AnalysisException. The error indicates a metadata resolution failure, not a permission or data corruption issue.

Exam trap

Microsoft often tests the misconception that a table existing in the lakehouse automatically makes it visible to Spark notebooks, when in fact the Spark metastore and lakehouse catalog are separate metadata layers that must be explicitly synchronized.

How to eliminate wrong answers

Option A is wrong because an AnalysisException for 'Table or view not found' is a metadata resolution error, not a permission error; a lack of read permission would typically produce a SecurityException or AccessDeniedException. Option C is wrong because Parquet and Delta are both columnar formats based on Parquet, and writing to Parquet from a Delta source does not cause incompatibility; the error occurs before any read/write operation begins. Option D is wrong because a corrupted Delta table would cause read failures (e.g., file not found, checksum mismatch) during data access, not a table-not-found error at the metadata level.

Practice this question →

170

MCQeasy

You have an Azure Databricks notebook that processes data from a Delta table. The notebook runs slowly due to many small files. You need to optimize the Delta table for faster reads. Which Delta Lake operation should you run?

A.Run CONVERT TO DELTA on the underlying Parquet files.

B.Run OPTIMIZE to compact small files.

C.Run DESCRIBE HISTORY to analyze file sizes.

D.Run VACUUM to delete old files.

AnswerB

OPTIMIZE compacts small files into larger ones, improving read performance.

Why this answer

Option C is correct because OPTIMIZE compacts small files into larger ones, improving read performance. Option A (VACUUM) removes old files but does not compact. Option B (DESCRIBE HISTORY) shows history.

Option D (CONVERT TO DELTA) converts Parquet but does not optimize.

Practice this question →

171

MCQhard

You are building a data processing pipeline in Azure Synapse Analytics. The pipeline should read data from Azure Data Lake Storage Gen2 (Parquet files), apply transformations using a mapping data flow, and write the results to a dedicated SQL pool table. The source data contains personally identifiable information (PII). You need to mask the PII columns (e.g., email) using a data masking function within the data flow. Which transformation should you use?

A.Derived Column transformation

B.Join transformation

C.Aggregate transformation

D.Pivot transformation

AnswerA

Derived Column can apply expressions, including hash functions like SHA2 for masking PII.

Why this answer

The Derived Column transformation in mapping data flows allows you to create new columns or modify existing ones using expressions, including built-in data masking functions like `mask()`, `maskEmail()`, or `substring()`. This is the correct transformation to apply PII masking on columns such as email addresses within the data flow pipeline before writing to the dedicated SQL pool.

Exam trap

The trap here is that candidates may confuse the Derived Column transformation with the Select transformation (which can also rename or drop columns but does not support expression-based masking), or assume that masking must be done in the sink (dedicated SQL pool) rather than within the data flow itself.

How to eliminate wrong answers

Option B (Join transformation) is wrong because it is used to combine rows from two sources based on a matching condition, not to mask or transform column values. Option C (Aggregate transformation) is wrong because it performs grouping and aggregation operations (e.g., SUM, COUNT) and does not support per-row data masking functions. Option D (Pivot transformation) is wrong because it rotates rows into columns for reshaping data, not for applying masking or transformations to individual column values.

Practice this question →

172

MCQmedium

You are building a real-time dashboard that displays sales data from an Azure SQL Database. The dashboard must refresh every 30 seconds with minimal latency. You need to choose the appropriate Azure service for data processing and visualization. Which service should you use?

A.Azure Analysis Services with a tabular model and a scheduled refresh every 30 seconds.

B.Azure Data Explorer (ADX) with a continuous export to Power BI.

C.Power BI with DirectQuery mode and configure automatic page refresh.

D.Azure Synapse Serverless SQL pool with Power BI import mode.

AnswerC

DirectQuery mode allows real-time queries to Azure SQL Database; automatic page refresh supports 30-second intervals.

Why this answer

Option A is correct because Power BI with DirectQuery mode can refresh at sub-minute intervals and query the database directly. Option B (Azure Analysis Services) is for tabular models, not real-time dashboards. Option C (Azure Data Explorer) is optimized for large-scale analytics, not for direct connection to Azure SQL Database.

Option D (Azure Synapse Serverless SQL) is for querying data lakes, not for real-time dashboards.

Practice this question →

173

MCQeasy

You are developing a data processing pipeline in Azure Databricks that processes streaming data from Azure Event Hubs. You need to ensure that the pipeline can recover from failures and process data exactly once. The pipeline writes to Delta Lake. Which approach should you use?

A.Use Azure Stream Analytics to process the stream and output to Delta Lake via Azure Data Lake Storage Gen2.

B.Use Structured Streaming with foreachBatch to write micro-batches to Delta Lake, and set the checkpoint location to Azure Data Lake Storage Gen2.

C.Use Structured Streaming with a Delta Lake sink and specify a checkpoint location on Azure Data Lake Storage Gen2.

D.Use Auto Loader to ingest streaming data from Event Hubs and write to Delta Lake with checkpointing.

AnswerC

Provides exactly-once semantics with checkpointing.

Why this answer

Option B is correct because Structured Streaming with checkpointing enables exactly-once processing by saving the offset and state. Delta Lake's ACID transactions ensure idempotent writes. Option A is wrong because foreachBatch with batch processing may lose exactly-once guarantees.

Option C is wrong because Azure Stream Analytics does not integrate directly with Delta Lake. Option D is wrong because Auto Loader is for batch file ingestion, not streaming.

Practice this question →

174

MCQeasy

Your team is building a real-time dashboard in Power BI that displays sales data from Azure Stream Analytics. The data must be updated every 5 seconds with low latency. Which output type should you configure in Stream Analytics to achieve this?

A.Power BI

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Event Hubs

AnswerA

Direct streaming to Power BI for low latency.

Why this answer

Power BI is the correct output type because Azure Stream Analytics has a native Power BI output connector that supports real-time streaming datasets, enabling sub-second latency for dashboards. This connector pushes data directly to Power BI's streaming API, which refreshes visuals automatically every 5 seconds as required.

Exam trap

The trap here is that candidates often confuse Azure Event Hubs as a direct output for Power BI, overlooking that Event Hubs is an intermediary and not a visualization endpoint, while Power BI's native Stream Analytics output is the only option that directly feeds the real-time dashboard with low latency.

How to eliminate wrong answers

Option B is wrong because Azure Blob Storage is a batch-oriented, file-based storage service that introduces latency due to file writes and lacks real-time push capabilities, making it unsuitable for sub-5-second updates. Option C is wrong because Azure SQL Database, while supporting row inserts, incurs higher latency due to transactional overhead and connection pooling, and Power BI's DirectQuery or import modes cannot achieve 5-second refresh rates from SQL without significant tuning. Option D is wrong because Azure Event Hubs is a message ingestion service, not a visualization endpoint; it would require an additional downstream consumer to push data to Power BI, adding latency and complexity.

Practice this question →

175

MCQhard

You are building a data processing solution using Azure Databricks. The solution must process streaming data from Azure Event Hubs, join it with a static reference table stored in Azure Data Lake Storage Gen2 (Parquet format), and write the output to Azure Synapse Analytics. The reference table is updated daily. Which approach minimizes latency and ensures data consistency?

A.Use Spark Structured Streaming with a streaming join and cache the reference table as a static DataFrame.

B.Use Spark Structured Streaming with foreachBatch to write to Synapse.

C.Use Spark Structured Streaming with a streaming join and load the reference table in each micro-batch.

D.Use a batch job that runs every hour to process the data.

AnswerA

Caching the reference table as static minimizes latency and ensures consistency.

Why this answer

Option D is correct because using Spark Structured Streaming with a streaming join and a static DataFrame for the reference table minimizes latency and ensures consistency by reading the reference table once. Option A is wrong because batch processing every hour adds latency. Option B is wrong because loading the reference table per batch increases overhead.

Option C is wrong because foreachBatch is useful for batch writes but does not improve join performance.

Practice this question →

176

MCQmedium

You are running a batch processing job using Azure Data Factory. The job reads from Azure Blob Storage, transforms data with a Data Flow, and writes to Azure Synapse Analytics. The job fails intermittently with the error: 'Operation on target WriteToSynapse failed: Cannot bulk load because the file could not be opened.' You need to resolve the issue with minimal downtime. What should you do?

A.Enable staging blob file deletion in the copy activity settings.

B.Increase the degree of copy parallelism in the Data Flow.

C.Use PolyBase to load data directly from Blob Storage without staging.

D.Switch to staging via SQL authentication instead of Managed Identity.

AnswerA

Deleting the staging blob after successful load prevents lock conflicts on retry.

Why this answer

The error indicates a transient lock on the staging blob file. Enabling staging blob file deletion in the copy activity settings allows the pipeline to clean up and retry. Increasing the degree of copy parallelism does not address file locking.

Using PolyBase with external tables requires schema changes. Switching to staging via SQL authentication does not solve the file lock issue.

Practice this question →

177

MCQeasy

In Azure Synapse Analytics serverless SQL pool, you query Parquet files stored in Azure Data Lake Storage Gen2. You notice that queries are slow. Which configuration change is most likely to improve performance?

A.Partition the data in Azure Data Lake Storage Gen2

B.Create a pipeline in Azure Synapse to preprocess the data

C.Use OPENROWSET with a properly defined schema and file format

D.Increase the DWU setting of the serverless SQL pool

AnswerC

Specifying schema and file format helps the query optimizer generate efficient execution plans.

Why this answer

Option C is correct because OPENROWSET with an explicitly defined schema and file format (e.g., FORMAT='PARQUET') enables the serverless SQL pool to bypass schema inference, which is a costly runtime operation. By providing a proper schema and file format, the query engine can directly read the Parquet metadata and column statistics, significantly reducing I/O and CPU overhead. This is the most direct and effective performance tuning change for querying Parquet files in a serverless SQL pool.

Exam trap

The trap here is that candidates confuse serverless SQL pool with dedicated SQL pool and assume that increasing DWU (a dedicated pool concept) will improve performance, or they think data partitioning alone is sufficient without addressing the schema inference overhead.

How to eliminate wrong answers

Option A is wrong because partitioning data in Azure Data Lake Storage Gen2 helps with file pruning and parallel reads, but it does not address the primary bottleneck of schema inference and metadata parsing that slows down serverless SQL pool queries. Option B is wrong because creating a preprocessing pipeline adds latency and complexity without fixing the root cause; it may even degrade performance if the pipeline introduces additional data movement or transformation overhead. Option D is wrong because serverless SQL pools do not have a DWU setting; DWU is a dedicated SQL pool (formerly SQL DW) concept, and serverless SQL pools scale automatically based on workload, so increasing a non-existent setting has no effect.

Practice this question →

178

MCQeasy

Refer to the exhibit. You are reviewing an ARM template snippet for an Azure Synapse Analytics workspace. The template defines an integration runtime. A colleague asks whether this integration runtime can be used to copy data from an on-premises SQL Server database to Azure Blob Storage. What should you answer?

A.No, you need an Azure integration runtime that is configured for on-premises access.

B.Yes, because the integration runtime is self-hosted and can access on-premises data sources.

C.No, you need a managed virtual network integration runtime for on-premises sources.

D.Yes, but only if the integration runtime is installed on a domain-joined machine.

AnswerB

Self-hosted IR is used for on-premises/private network data sources.

Why this answer

Option A is correct because a self-hosted integration runtime is designed for on-premises and VM-based data sources. Option B is wrong because managed VNet IR is for Azure-only sources. Option C is wrong because Azure IR is for cloud data sources.

Option D is wrong because it is possible with a self-hosted IR.

Practice this question →

179

Multi-Selecthard

Which THREE actions improve the performance of a Delta table in Azure Databricks? (Choose three.)

Select 3 answers

A.Run VACUUM to remove old versions and compact small files

B.Disable auto-optimize to reduce overhead

C.Run OPTIMIZE with ZORDER BY on frequently filtered columns

D.Increase the number of shuffle partitions

E.Partition the table by high-cardinality columns

AnswersA, C, E

Compaction reduces file count.

Why this answer

Options A, B, and E are correct. Z-ordering optimizes data layout, compaction reduces small files, and partitioning improves query pruning. Option C (Disable auto-optimize) would worsen performance.

Option D (Increase spark.sql.shuffle.partitions) may help but is not specific to Delta tables.

Practice this question →

180

MCQmedium

You are designing a streaming solution in Azure Synapse Analytics using the serverless SQL pool to query streaming data in real-time. The data is ingested via Azure Event Hubs and processed using Azure Stream Analytics. The output of Stream Analytics is written to Azure Data Lake Storage Gen2 in Delta Lake format. You need to ensure that the serverless SQL pool can query the latest data with minimal latency. Which approach should you use?

A.Ingest data directly from Event Hubs into serverless SQL pool using CETAS (CREATE EXTERNAL TABLE AS SELECT).

B.Use a materialized view in serverless SQL pool that refreshes every minute.

C.Load the streaming data into a dedicated SQL pool using a scheduled pipeline and then query it from serverless SQL pool.

D.Create an external table in serverless SQL pool that points to the Delta Lake folder and query it directly.

AnswerD

Serverless SQL pool supports Delta Lake format and can query it as soon as data is written.

Why this answer

Option D is correct because serverless SQL pool can directly query Delta Lake format files stored in Azure Data Lake Storage Gen2 by creating an external table with LOCATION pointing to the Delta folder. This allows real-time querying of the latest streaming data without any data movement or transformation, achieving minimal latency since Stream Analytics writes continuously to Delta Lake.

Exam trap

The trap here is that candidates may confuse serverless SQL pool with dedicated SQL pool and assume features like materialized views or scheduled pipelines are available, or they may think CETAS can directly ingest from Event Hubs, which is not supported.

How to eliminate wrong answers

Option A is wrong because CETAS creates a new external table by selecting data from a source, but it cannot ingest data directly from Event Hubs; it requires a source like an existing external table or file set, and it does not support real-time streaming ingestion. Option B is wrong because serverless SQL pool does not support materialized views; materialized views are a feature of dedicated SQL pool, not serverless. Option C is wrong because loading data into a dedicated SQL pool via a scheduled pipeline introduces batch latency (scheduled intervals), which contradicts the requirement for minimal latency in a streaming solution.

Practice this question →

181

MCQmedium

You are troubleshooting a slow-running Azure Data Factory pipeline that copies data from an Azure SQL Database to ADLS Gen2. The pipeline uses a copy activity with the default settings. The source table has 10 million rows. Which optimization should you apply first?

A.Set the 'parallel copies' property to 10.

B.Replace the copy activity with a mapping data flow.

C.Increase the data integration unit (DIU) to maximum.

D.Enable staged copy using an Azure Blob Storage staging location.

AnswerD

Staging allows data to be transferred via Blob Storage, which improves throughput for SQL to ADLS copies.

Why this answer

Option B is correct because enabling staging via Blob Storage improves performance for large cross-region or cross-service copies. Option A (parallel copies) might help but default settings already use parallelism. Option C (DIU increase) is a more expensive solution.

Option D (data flow) is heavier and not needed for simple copy.

Practice this question →

182

MCQeasy

You need to transform semi-structured JSON data into a tabular format for analysis in Azure Synapse Analytics. The data is stored in ADLS Gen2. Which feature should you use to query the JSON data directly without loading it into a table?

A.Use the OPENROWSET function in a serverless SQL pool.

B.Use Azure Data Factory to flatten the JSON and store as Parquet.

C.Create an external table using PolyBase.

D.Use the COPY INTO command in a dedicated SQL pool.

AnswerA

OPENROWSET with JSON file format allows querying JSON directly.

Why this answer

Option A is correct because OPENROWSET in Synapse serverless SQL can query JSON files directly. Option B (COPY INTO) loads data. Option C (PolyBase) requires external tables.

Option D (Azure Data Factory) is an orchestration tool.

Practice this question →

183

MCQmedium

You are developing a real-time data processing solution using Azure Stream Analytics. The input is from an Azure Event Hub, and the output is to an Azure Synapse Analytics dedicated SQL pool. You need to ensure exactly-once delivery semantics to the SQL pool. What should you configure?

A.Enable checkpointing in the query.

B.Use the Azure Synapse Analytics output adapter with exactly-once semantics.

C.Use a custom deserializer.

D.Configure event ordering by timestamp.

AnswerB

Stream Analytics supports exactly-once output to Synapse when configured correctly.

Why this answer

Option D is correct because Stream Analytics offers exactly-once output to Azure Synapse via the dedicated SQL pool output adapter. Option A is wrong because checkpointing is for recovery, not exactly-once. Option B is wrong because event ordering is about sequence, not delivery guarantees.

Option C is wrong because custom deserializers do not provide exactly-once.

Practice this question →

184

Multi-Selecthard

Which THREE Azure services can you use together to build a serverless, event-driven data processing pipeline that ingests data from Azure Blob Storage, transforms it using custom code, and loads it into Azure Cosmos DB?

Select 3 answers

A.Azure Functions

B.Azure Cosmos DB output binding (with Azure Functions)

C.Azure Batch

D.Azure Event Grid

E.Azure Logic Apps

AnswersA, B, D

Functions can be triggered by Blob events and run custom code.

Why this answer

Options A, C, and E are correct. Azure Functions can be triggered by Blob Storage events to run custom code, and the output can be written to Cosmos DB via the Cosmos DB output binding. Option B is wrong because Azure Batch is for large-scale parallel computing, not event-driven triggers.

Option D is wrong because Azure Logic Apps is more for orchestration, not custom code execution.

Practice this question →

185

Multi-Selectmedium

You are designing a batch processing solution in Azure Synapse Analytics using pipelines. The solution must load data from multiple sources (Azure Blob Storage, Azure SQL Database, and REST API) into a dedicated SQL pool. After loading, you need to run a stored procedure to aggregate the data. Which two activities should you include in the pipeline? (Choose two.)

Select 2 answers

A.Execute Pipeline activity

B.Copy activity

C.Azure Function activity

D.Stored procedure activity

E.Data Flow activity

AnswersB, D

Copy activity loads data from sources to SQL pool.

Why this answer

Options B and D are correct. Copy activity is used to copy data from sources. Stored procedure activity is used to execute the aggregation procedure.

Option A is wrong because Data Flow is for transformations, but the requirement is to load and then run a stored procedure. Option C is wrong because Execute Pipeline is for running another pipeline, not for executing a stored procedure. Option E is wrong because Azure Function activity is for custom code, not needed here.

Practice this question →

186

MCQhard

You are designing a batch processing solution in Azure Databricks that reads Parquet files from Azure Data Lake Storage Gen2, performs aggregations, and writes results to a Delta table. The data volume is expected to grow to 10 TB per batch. You need to minimize shuffle operations during the aggregation step. Which approach should you recommend?

A.Partition the Delta table by the aggregation key and use the same partitioning when reading.

B.Use bucketing on a different column to reduce partition skew.

C.Use a broadcast join to avoid shuffle on the aggregation.

D.Use coalesce to reduce the number of partitions before aggregation.

AnswerA

Partitioning on the aggregation key ensures data is already grouped, minimizing shuffle.

Why this answer

Option B is correct because partitioning the input data on the aggregation key ensures that each partition contains all rows for a given key, eliminating the need for a full shuffle. Option A is wrong because bucketing with a different column would not avoid shuffle on the aggregation key. Option C is wrong because coalesce reduces partitions but does not avoid shuffle.

Option D is wrong because broadcast join is for joining small tables, not for aggregations.

Practice this question →

187

MCQmedium

You are monitoring a production data pipeline in Azure Data Factory that runs hourly. You notice that the pipeline has been failing for the last 3 hours due to a timeout error when writing to the sink Azure SQL Database. The database is a General Purpose tier with 100 DTUs. The pipeline writes approximately 1 million rows per run. What is the most effective way to prevent the timeout?

A.Reduce the degree of copy parallelism in the copy activity.

B.Enable staged copy using Azure Blob Storage as an interim store.

C.Scale up the Azure SQL Database to a higher DTU tier.

D.Increase the writeBatchSize in the sink settings to reduce the number of batch inserts.

AnswerD

Larger batch size reduces the number of round trips and can prevent timeout.

Why this answer

Option D is correct because increasing the batch size reduces the number of round trips, which can speed up the write operation and reduce timeout risk. Option A is wrong because reducing parallelism might increase duration. Option B is wrong because increasing DTUs helps but is more costly and may not be necessary.

Option C is wrong because using a staging blob could add overhead and still timeout.

Practice this question →

188

MCQhard

You are troubleshooting a data processing job in Azure Synapse Pipelines that fails intermittently with the error: 'Operation on target Sink failed: The request was aborted: Could not create SSL/TLS secure channel.' The pipeline reads from Azure Blob Storage and writes to an Azure SQL Database. The source and sink are in the same region. What is the most likely cause?

A.Azure SQL Database firewall rules blocking the IP address of the integration runtime.

B.Transient network connectivity issues between the services.

C.The Azure SQL Database DTU limit has been exceeded, causing throttling.

D.The self-hosted integration runtime is using TLS 1.0, which is not supported by the services.

AnswerD

SSL/TLS handshake failure often stems from TLS version mismatch.

Why this answer

The error 'Could not create SSL/TLS secure channel' indicates a TLS version mismatch. Azure SQL Database and Azure Blob Storage require at least TLS 1.2 for secure connections. If the self-hosted integration runtime (SHIR) is configured to use TLS 1.0, the handshake fails because the services reject the older protocol.

This is the most likely cause given the intermittent nature and the specific error message.

Exam trap

The trap here is that candidates confuse SSL/TLS errors with network connectivity or throttling issues, overlooking the specific protocol version mismatch that Azure services now enforce.

How to eliminate wrong answers

Option A is wrong because firewall rules blocking the SHIR IP would produce a different error (e.g., 'Cannot open server '...' requested by the login') and would be consistent, not intermittent. Option B is wrong because transient network issues typically result in timeouts or retryable errors, not a specific SSL/TLS channel creation failure. Option C is wrong because DTU throttling would cause performance degradation or 'Request limit exceeded' errors, not an SSL/TLS handshake failure.

Practice this question →

189

MCQeasy

You are building a data transformation in Azure Databricks using PySpark. The data includes a column 'timestamp' in string format 'yyyy-MM-dd HH:mm:ss'. You need to convert this to a timestamp type and extract the date part for partitioning. Which code snippet should you use?

A.df.withColumn('date', col('timestamp').cast('date'))

B.df.withColumn('date', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss'))

C.df.withColumn('date', to_date('timestamp', 'yyyy-MM-dd HH:mm:ss'))

D.df.withColumn('date', to_date('timestamp'))

AnswerC

Correctly converts string to date with format.

Why this answer

Option C is correct because `to_date` with the format string 'yyyy-MM-dd HH:mm:ss' converts the string column to a date type, extracting only the date part (year, month, day) as required for partitioning. This matches the requirement to convert the timestamp string to a date for partitioning, not a full timestamp.

Exam trap

The trap here is that candidates often confuse `to_date` and `to_timestamp`, assuming both extract only the date, or they forget that `cast('date')` does not accept a custom format string, leading to runtime errors or null values.

How to eliminate wrong answers

Option A is wrong because `cast('date')` on a string column will fail or produce null if the string is not in a default date format (e.g., 'yyyy-MM-dd'), and it does not accept a custom format pattern. Option B is wrong because `to_timestamp` converts the string to a full timestamp type (including time), not just the date part, which is not suitable for partitioning by date. Option D is wrong because `to_date('timestamp')` without a format string relies on the default date format (typically 'yyyy-MM-dd'), which will fail or produce incorrect results for strings with time components like 'yyyy-MM-dd HH:mm:ss'.

Practice this question →

190

MCQmedium

You are designing a data pipeline in Azure Synapse Analytics that ingests streaming taxi trip data from Azure Event Hubs. The data must be processed in near real-time and stored in a dedicated SQL pool. The pipeline should handle late-arriving data (up to 30 minutes late) without reprocessing the entire stream. Which Azure service should you use to process the streaming data?

A.Azure Databricks Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Functions with Event Hubs trigger

AnswerB

Azure Stream Analytics provides built-in support for late-arriving events and can output directly to Synapse SQL pool.

Why this answer

Azure Stream Analytics is the correct choice because it is designed for real-time stream processing with native support for Event Hubs as an input and dedicated SQL pool as an output. It can handle late-arriving data via its built-in 'late arrival' window (configurable up to 30 minutes) using event time processing, without requiring reprocessing of the entire stream.

Exam trap

The trap here is that candidates often confuse Azure Data Factory's 'real-time' monitoring or Azure Functions' 'event-driven' nature with true stream processing, overlooking the need for built-in windowing and late-arrival handling that only Azure Stream Analytics provides.

How to eliminate wrong answers

Option A is wrong because Azure Databricks Structured Streaming is a batch-micro-batch engine that, while capable of streaming, requires manual management of late-arriving data via watermarking and checkpointing, and does not natively integrate with dedicated SQL pool as a sink without additional complexity. Option C is wrong because Azure Data Factory is an orchestration and ETL service for batch data movement, not a real-time stream processing engine; it cannot process streaming data from Event Hubs in near real-time. Option D is wrong because Azure Functions with Event Hubs trigger processes events one at a time in a serverless compute model, which lacks the built-in windowing, aggregation, and late-arrival handling capabilities needed for near real-time stream processing at scale.

Practice this question →

191

MCQhard

You are designing a batch processing solution using Azure Databricks. The data source is a large Parquet dataset stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The processing requires joining two datasets: one with 10 billion rows and another with 1 million rows. The cluster uses Photon runtime. Which optimization should you apply to minimize shuffle?

A.Broadcast the smaller table (1 million rows) to all worker nodes.

B.Increase the cluster size to reduce shuffle overhead.

C.Create bucketed tables on the join key for both datasets.

D.Use Delta Lake and optimize file layout with OPTIMIZE command.

AnswerA

Broadcasting the smaller table avoids shuffling the large table, significantly reducing data movement.

Why this answer

Broadcasting the smaller table (1 million rows) to all worker nodes is the correct optimization because it eliminates the need for a full shuffle during the join. With Photon runtime, broadcast joins are highly efficient as they replicate the small table to each executor, allowing map-side joins that avoid costly data movement across the network. Given the 10:1 row ratio, the 1-million-row table is well within the default broadcast threshold (10 MB compressed, configurable via spark.sql.autoBroadcastJoinThreshold), making this the most effective shuffle-minimization technique.

Exam trap

The trap here is that candidates often assume increasing cluster size (Option B) is a universal performance fix, but the DP-203 exam specifically tests the understanding that shuffle reduction techniques like broadcast joins are more impactful than simply adding more nodes, especially when one dataset is small enough to fit in executor memory.

How to eliminate wrong answers

Option B is wrong because increasing cluster size does not reduce shuffle overhead; it only adds more parallelism, which can actually increase shuffle traffic and does not address the fundamental need to avoid shuffling large datasets. Option C is wrong because creating bucketed tables on the join key requires both datasets to be bucketed with the same number of buckets and a compatible bucketing scheme; while this can reduce shuffle, it involves significant upfront data reorganization and is not as immediate or lightweight as broadcasting the small table. Option D is wrong because using Delta Lake and the OPTIMIZE command improves file layout and read performance (e.g., bin-packing small files) but does not directly reduce shuffle during a join operation; shuffle reduction requires join-specific optimizations like broadcast or bucketing.

Practice this question →

192

Multi-Selecteasy

Which TWO are benefits of using Azure Databricks Auto Loader for incremental data ingestion?

Select 2 answers

A.It can process new files as they arrive in cloud storage.

B.It can handle large volumes of data without manual checkpointing.

C.It automatically evolves the schema without any configuration.

D.It provides sub-second latency for real-time streaming.

E.It provides built-in deduplication of records.

AnswersA, B

Auto Loader incrementally processes new files.

Why this answer

Azure Databricks Auto Loader is designed to incrementally ingest new data files as they arrive in cloud storage (e.g., Azure Data Lake Storage Gen2 or Blob Storage) by using a notification-based or directory-listing approach. It automatically tracks which files have already been processed using a structured streaming checkpoint mechanism, eliminating the need for manual checkpoint management even at high data volumes. This makes options A and B correct because Auto Loader's core value is its ability to discover and process new files without manual intervention and to handle large-scale ingestion reliably.

Exam trap

The trap here is that candidates often confuse Auto Loader's schema inference (which is automatic on first read) with automatic schema evolution (which requires explicit configuration), and they also mistakenly assume file-based ingestion can achieve sub-second latency or provide built-in deduplication, which are not features of this service.

Practice this question →

193

Multi-Selecteasy

Which TWO Azure services can be used to perform data transformation in a data pipeline? (Select two.)

Select 2 answers

A.Azure Data Factory

B.Azure Monitor

C.Azure Storage

D.Azure Event Hubs

E.Azure Databricks

AnswersA, E

Data Factory offers mapping data flows and compute activities for transformation.

Why this answer

Azure Data Factory is a cloud-based ETL service that provides a code-free visual interface for orchestrating data movement and transformation at scale. It supports data flows, which allow you to perform transformations like aggregations, joins, and filtering without writing code, making it a correct choice for data transformation in a pipeline.

Exam trap

The trap here is that candidates often confuse data ingestion services (like Event Hubs) or storage services (like Azure Storage) with transformation services, forgetting that transformation requires compute engines like Data Factory or Databricks.

Practice this question →

194

Multi-Selecthard

You are building a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow to perform a lookup transformation. The lookup source is a dimension table with 10 million rows. You need to optimize the lookup performance. Which THREE actions should you take?

Select 3 answers

A.Increase the batch size in the lookup transformation settings.

B.Partition the dimension table on the lookup key before reading.

C.Enable the 'Broadcast' option on the lookup source transformation if the dimension table is less than 100 MB.

D.Select only the necessary columns in the lookup source transformation.

E.Ensure the dimension table has an index on the columns used for the lookup.

AnswersC, D, E

Broadcasting avoids shuffle for small dimension tables.

Why this answer

Options A, B, and E are correct. Option A is correct because broadcasting the lookup table (if small enough) avoids shuffling. Option B is correct because indexing the lookup columns speeds up the lookup.

Option C is wrong because increasing batch size in a lookup does not directly improve performance; it's for sink. Option D is wrong because partitioning the lookup table on the join key can help, but the question is about the data flow, not the table design. Option E is correct because narrowing the columns reduces data transfer.

Practice this question →

195

Multi-Selecthard

Which THREE components are required to implement a real-time data processing solution using Azure Stream Analytics?

Select 3 answers

A.Power BI as the output sink

B.Azure Data Factory pipeline for orchestration

C.An input source such as Azure Event Hubs or IoT Hub

D.An output sink such as Azure Synapse Analytics or Blob Storage

E.A Stream Analytics job with a defined query

AnswersC, D, E

Streaming input is required for real-time processing.

Why this answer

Option C is correct because Azure Stream Analytics requires a streaming input source to ingest real-time data. Azure Event Hubs and IoT Hub are the primary services that provide high-throughput, low-latency event ingestion, which Stream Analytics can consume via its built-in connector. Without a streaming input, the job cannot process real-time data.

Exam trap

The trap here is that candidates often assume Power BI is a required output for real-time dashboards, but Stream Analytics can function without any visualization sink, and the exam focuses on the minimal required components: input, job with query, and output sink.

Practice this question →

196

MCQmedium

You are designing a data processing solution for an e-commerce company. The company receives millions of clickstream events per hour from their website and needs to aggregate the data by product category and windowed time intervals for real-time dashboards. You need to minimize latency and cost. Which service should you use?

A.Azure Databricks Structured Streaming

B.Azure Data Factory

C.Azure Stream Analytics

D.Azure Synapse Pipelines

AnswerC

Provides real-time stream processing with windowed aggregations.

Why this answer

Option B (Azure Stream Analytics) is the best choice because it is purpose-built for real-time stream processing, supports windowed aggregations, and integrates with Power BI for dashboards. Option A (Azure Synapse Pipelines) is for orchestrating data movement, not real-time processing. Option C (Azure Databricks) can handle streaming but is more complex and typically more expensive for simple aggregations.

Option D (Azure Data Factory) is for batch data movement, not real-time.

Practice this question →

197

MCQhard

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime. You notice that the copy activity fails intermittently with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed'. The on-premises SQL Server is under heavy load during business hours. What is the most likely cause?

A.The SQL Server is experiencing resource contention or timeout due to heavy load.

B.The Azure Blob Storage account is throttling requests.

C.The authentication method to SQL Server is incorrect.

D.The self-hosted integration runtime is not connected to the network.

AnswerA

Intermittent failures often due to resource pressure.

Why this answer

The error 'SqlOperationFailed' on the source side indicates that the SQL Server itself is failing to complete the query or data extraction operation. Under heavy load, the SQL Server may experience resource contention (CPU, memory, I/O) or reach query timeout thresholds, causing the copy activity to fail intermittently. This is consistent with the described scenario of heavy load during business hours.

Exam trap

The trap here is that candidates may confuse a source-side error with a sink-side error, or assume that any intermittent failure must be a network or connectivity issue, rather than recognizing that SQL Server resource contention under heavy load is a classic cause of intermittent 'SqlOperationFailed' errors.

How to eliminate wrong answers

Option B is wrong because Azure Blob Storage throttling would produce an error on the 'Sink' side (e.g., 'StorageError' or 'BlobOperationFailed'), not on the 'Source' side. Option C is wrong because an incorrect authentication method would cause a persistent authentication failure (e.g., 'Login failed for user') on every attempt, not intermittent failures. Option D is wrong because if the self-hosted integration runtime were not connected to the network, the pipeline would fail consistently with a connectivity error (e.g., 'Unable to connect to Integration Runtime'), not an intermittent SQL operation error.

Practice this question →

198

MCQeasy

You are designing a data processing pipeline in Azure Data Factory. The pipeline must copy data from Azure Blob Storage to Azure SQL Database and transform the data using a mapping data flow. The data flow includes a Derived Column transformation. What is the purpose of the Derived Column transformation?

A.Aggregate data by grouping rows.

B.Create new columns or modify existing columns using expressions.

C.Sort data in ascending or descending order.

D.Rename or drop columns.

AnswerB

Derived Column allows expression-based column creation and modification.

Why this answer

The Derived Column transformation in Azure Data Factory mapping data flows is used to create new columns or modify existing columns by applying expressions. This allows you to perform calculations, string manipulations, or conditional logic directly within the data flow, enabling in-flight data transformation before writing to the sink.

Exam trap

The trap here is that candidates confuse the Derived Column transformation with the Select transformation, assuming it is used for renaming or dropping columns, when in fact Derived Column is specifically for creating or modifying column values via expressions.

How to eliminate wrong answers

Option A is wrong because aggregating data by grouping rows is the purpose of the Aggregate transformation, not the Derived Column transformation. Option C is wrong because sorting data is performed by the Sort transformation, which reorders rows based on column values. Option D is wrong because renaming or dropping columns is handled by the Select transformation, which allows you to include, exclude, or alias columns.

Practice this question →

199

MCQhard

You are implementing a streaming solution using Azure Stream Analytics. The input is from an IoT Hub receiving telemetry from thousands of devices. The output is to Azure Synapse Analytics dedicated SQL pool. The requirement is to compute rolling averages over a 5-minute tumbling window and write results every minute. Which windowing function and output configuration should you use?

A.Use a TumblingWindow with duration of 5 minutes and output every 5 minutes.

B.Use a SlidingWindow with duration 5 minutes and output every 1 minute.

C.Use a HoppingWindow with size 5 minutes and hop 1 minute.

D.Use a SessionWindow with timeout 5 minutes and maximum duration 10 minutes.

AnswerC

Hopping windows with a 1-minute hop produce results every minute, each covering the last 5 minutes.

Why this answer

Option C is correct because a HoppingWindow with a size of 5 minutes and a hop of 1 minute allows you to compute rolling averages over a 5-minute window while producing results every minute. This satisfies the requirement of outputting results at a higher frequency than the window duration, which is not possible with a TumblingWindow (which only outputs at the end of the window) or a SlidingWindow (which outputs on each event, not at fixed intervals).

Exam trap

The trap here is that candidates confuse the output frequency with the window duration, assuming a TumblingWindow can produce results more frequently by adjusting the duration, but only a HoppingWindow with a hop smaller than the size can achieve that.

How to eliminate wrong answers

Option A is wrong because a TumblingWindow with a duration of 5 minutes only produces output at the end of each 5-minute window, not every minute. Option B is wrong because a SlidingWindow with a duration of 5 minutes outputs results for every event (or at the end of each sliding interval), not at a fixed 1-minute interval; it also does not align with the requirement to output every minute on a schedule. Option D is wrong because a SessionWindow is designed for event-based sessions with gaps and timeouts, not for fixed-duration rolling averages with periodic output.

Practice this question →

200

MCQeasy

You need to orchestrate a data pipeline that includes a Python script and a Data Flow in Azure Synapse Analytics. The Python script must run before the Data Flow. Which activity should you use to run the Python script?

A.Notebook activity configured to use a Python kernel

B.Web activity

C.Stored Procedure activity

D.HDInsight Hive activity

AnswerA

A Notebook activity can run Python code in Synapse Spark notebooks.

Why this answer

A Notebook activity in Azure Synapse Analytics can be configured to use a Python kernel, allowing you to run a Python script directly within the pipeline. This is the correct choice because the requirement is to execute a Python script before a Data Flow, and the Notebook activity supports Python execution natively in Synapse pipelines.

Exam trap

The trap here is that candidates may confuse a Notebook activity with a Web activity or a Stored Procedure activity, thinking they can execute arbitrary code, but only the Notebook activity supports Python execution natively in Synapse pipelines.

How to eliminate wrong answers

Option B is wrong because a Web activity calls an HTTP/S endpoint (e.g., a REST API) and cannot run a Python script directly; it is used for invoking external services, not for executing code within Synapse. Option C is wrong because a Stored Procedure activity executes SQL stored procedures in a database, which is not designed for running Python scripts. Option D is wrong because an HDInsight Hive activity runs Hive queries on an HDInsight cluster, not Python scripts; it is meant for HiveQL, not Python execution.

Practice this question →

201

MCQmedium

Refer to the exhibit. A data engineer is reviewing a Mapping Data Flow in Azure Data Factory. The source reads 1000 rows from a CSV file. The sink writes to a Delta dataset with optimizeWrite enabled. The data flow currently has no transformations. The engineer runs the data flow and notices that the sink writes only 500 rows. What is the most likely cause?

A.The source's rowLimit setting limits the number of rows read, but the sink writes only unique rows.

B.The sink's allowSchemaDrift setting prevents writing rows with additional columns.

C.The sink's partitionBy('hash', 1) setting causes some rows to be dropped due to partitioning logic.

D.The data flow has no transformations, so the sink cannot write data.

AnswerC

If the partition column is missing or null, rows may be dropped.

Why this answer

Option A is correct because the sink script has 'partitionBy('hash', 1)' which forces the data to be repartitioned into a single partition. Since the source reads 1000 rows and the repartition may cause some rows to be dropped due to hash distribution logic in the sink. However, the key point is that the sink's partitioning overrides the source partitioning and may lead to data loss if there is a mismatch.

Actually, the correct cause is that the sink is using hash partitioning on a column that may have many nulls or the partition column is not specified, causing skewed distribution. But the most direct answer is that the sink's partitionBy('hash', 1) creates a single partition and the hash function might drop rows that hash to a non-existent partition? No, hash with 1 partition should include all rows. Another possibility: the source has rowLimit 1000, but sink's partitionBy may cause duplicates? Let's think.

The exhibit shows partitionBy('hash', 1). In ADF, when you use hash partitioning with 1 partition, all rows go to that partition, so no rows should be lost. However, the sink has allowSchemaDrift true, but no transformations.

The discrepancy might be due to the source rowLimit being applied after reading? Actually rowLimit is applied at source. The most plausible cause is that the sink uses a Delta dataset and the Delta write operation may skip rows with nulls in partition columns? Option A mentions the sink's partitioning causes some rows to be dropped. Option B mentions allowSchemaDrift causing schema mismatch; but schema is matched.

Option C mentions source rowLimit being applied before transformations; there are no transformations. Option D mentions missing transformation; not needed. The correct answer is A because in some cases, if the partition column is not in the source, the hash function might fail.

But the script shows no partition column specified. In ADF, if you use partitionBy('hash', 1) without specifying a column, it uses a random hash? Actually it requires a column. The script is incomplete.

Given the context, Option A is the best answer.

Practice this question →

202

MCQeasy

Your team is developing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs. The output must be written to Azure SQL Database. You need to ensure that the processing guarantees exactly-once semantics for the output. Which output configuration should you use?

A.Set output mode to 'When possible'

B.Set output mode to 'Exactly once'

C.Set output mode to 'At least once'

D.Set output mode to 'Best effort'

AnswerB

Ensures no duplicate writes to SQL Database.

Why this answer

Option C is correct because Azure Stream Analytics supports exactly-once delivery to Azure SQL Database when using the 'Exactly once' output mode. Option A is wrong because 'At least once' is the default and can cause duplicates. Option B is wrong because 'When possible' is not a valid output mode.

Option D is wrong because 'Best effort' is not a valid mode.

Practice this question →

203

MCQhard

You are designing a data ingestion pipeline for Azure Synapse Analytics using PolyBase. The source data is in Azure Blob Storage, with files in the Parquet format. The data volume is approximately 2 TB daily. You need to minimize the time to load data into a dedicated SQL pool table. Which approach should you recommend?

A.Split the Parquet files into many small files (e.g., 64 MB each) to maximize parallelism.

B.Convert the Parquet files to CSV format before loading using PolyBase.

C.Use the COPY INTO command to load the Parquet files directly into the dedicated SQL pool.

D.Create an external table pointing to the Parquet files and use CREATE TABLE AS SELECT (CTAS) to load the data.

AnswerD

CTAS with external tables enables predicate pushdown and efficient column pruning.

Why this answer

Option B is correct because creating an external table with the Parquet file format and using CREATE TABLE AS SELECT (CTAS) allows PolyBase to leverage predicate pushdown and column pruning, which reduces data movement and improves performance. Option A is wrong because splitting files into smaller chunks increases the number of parallel reads but the overhead of many small files can degrade performance. Option C is wrong because converting to CSV adds overhead and increases data size.

Option D is wrong because COPY INTO is optimized for small to medium loads, while CTAS with external tables is better for large-scale loads.

Practice this question →

204

Multi-Selectmedium

You are designing a data processing solution that must handle personally identifiable information (PII). The data will be ingested from multiple sources into Azure Data Lake Storage. Which TWO actions should you take to protect the PII data during processing?

Select 2 answers

A.Apply data masking in Power BI reports to hide PII from end users.

B.Use column-level security in Azure Synapse Analytics to restrict access to sensitive columns.

C.Configure the Azure Data Lake Storage firewall to allow only specific IP addresses.

D.Implement SQL Server Always Encrypted for all data stored in the data lake.

E.Use Azure Purview to classify and label PII data in the data lake.

AnswersB, E

Column-level security allows fine-grained access control at the database level.

Why this answer

Options A and C are correct. Column-level security in Azure Synapse restricts access to sensitive columns. Azure Purview provides data classification and lineage.

Option B (masking in Power BI) is a presentation-layer control. Option D (server-level firewall) is network security. Option E (always encrypted) is for SQL Server, not ADLS.

Practice this question →

205

Multi-Selectmedium

Which TWO actions should you take to optimize performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Select 2 answers

A.Disable index on the target table after loading.

B.Use a large batch size (e.g., 100 MB) for each copy operation.

C.Use round-robin distribution for the staging table.

D.Use a small batch size (e.g., 1 MB) for each copy operation.

E.Use clustered columnstore index on the target table during load.

AnswersB, C

Large batches reduce number of transactions.

Why this answer

Option B is correct because using a large batch size (e.g., 100 MB) for each copy operation minimizes the number of round trips and transaction commits, which significantly improves throughput when loading large volumes of data into a dedicated SQL pool. The PolyBase or COPY statement in Azure Synapse performs best when batches are large enough to leverage parallel processing and reduce overhead from frequent small writes.

Exam trap

The trap here is that candidates often confuse batch size optimization with transaction log management, mistakenly thinking smaller batches reduce log pressure, when in fact larger batches reduce overall load time and improve throughput in Azure Synapse's distributed architecture.

Practice this question →

206

MCQeasy

You are running a Spark job in Azure Synapse Analytics that reads from a Delta Lake table and performs multiple transformations. The job fails with an out-of-memory error on the executors. Which action should you take first to resolve the issue?

A.Enable checkpointing to truncate the lineage.

B.Decrease the number of partitions to reduce overhead.

C.Increase the executor memory setting in the Spark configuration.

D.Use the cache() action on intermediate DataFrames.

AnswerC

Increasing executor memory provides more heap space to avoid OOM errors.

Why this answer

Option C is correct because an out-of-memory error on executors indicates that the available memory per executor is insufficient for the data being processed. Increasing the executor memory setting in the Spark configuration directly addresses this by allocating more heap space, allowing transformations to complete without spilling to disk or failing. This is the first and most straightforward action to take before optimizing partitioning or caching.

Exam trap

The trap here is that candidates often confuse memory issues with partitioning or caching optimizations, but the immediate fix for an out-of-memory error is to increase executor memory, not to reduce parallelism or persist data.

How to eliminate wrong answers

Option A is wrong because checkpointing truncates the lineage and helps with recovery and plan optimization, but it does not directly increase available memory or resolve an out-of-memory error. Option B is wrong because decreasing the number of partitions reduces parallelism and can actually increase memory pressure per partition, worsening the out-of-memory issue. Option D is wrong because using cache() persists intermediate DataFrames in memory, which consumes additional memory and can exacerbate the out-of-memory error rather than resolving it.

Practice this question →

207

MCQeasy

Your team is developing a data processing solution that uses Azure Databricks to transform streaming data from Azure Event Hubs. The transformation includes joining the stream with a static reference table stored in Azure Data Lake Storage Gen2. You need to implement the join efficiently. Which approach should you use?

A.Use a watermark on both sides and perform a stream-stream join

B.Use a broadcast join with the static DataFrame loaded from Delta Lake

C.Use foreachBatch to micro-batch the stream and perform a batch join

D.Use a stream-stream join by converting the static table to a stream

AnswerB

Broadcast join avoids shuffling and is efficient for streaming-static joins.

Why this answer

Option A is correct because streaming-static joins in Spark Structured Streaming are optimized when the static data is broadcast to all nodes, avoiding shuffles. Option B is wrong because the reference table is static; streaming-static join is appropriate. Option C is wrong because joining two streams is not needed.

Option D is wrong because foreachBatch with a batch join adds complexity.

Practice this question →

208

MCQmedium

You are building a data processing solution in Azure Synapse Analytics. The solution requires creating a table that stores sales transactions. The table will be used for both point-of-sale lookups and large aggregation queries. The data is not updated frequently. Which table distribution should you recommend?

A.ROUND_ROBIN

B.HASH on TransactionID

C.HASH on SalesDate

D.REPLICATE

AnswerA

ROUND_ROBIN evenly distributes data and works well for mixed workloads.

Why this answer

ROUND_ROBIN is the correct choice because the table is used for both point-of-sale lookups (single-row queries) and large aggregation queries, and the data is not updated frequently. ROUND_ROBIN distributes data evenly across all distributions without a hash key, which provides the best overall performance for mixed workloads where no single distribution key optimizes both lookup and aggregation patterns. It avoids data skew and allows parallel processing for aggregations while still supporting efficient lookups when combined with appropriate indexes.

Exam trap

The trap here is that candidates often choose HASH distribution thinking it always improves query performance, but they overlook that without a clear join or grouping column that matches the hash key, HASH can cause data skew and actually degrade mixed workload performance compared to ROUND_ROBIN.

How to eliminate wrong answers

Option B (HASH on TransactionID) is wrong because hashing on a high-cardinality column like TransactionID would distribute data evenly but would not optimize large aggregation queries that typically group by date or region, and it would not improve point-of-sale lookups unless the lookup filter includes TransactionID. Option C (HASH on SalesDate) is wrong because hashing on a date column can cause severe data skew if most transactions occur on a few dates, leading to uneven distribution and poor query performance; it also does not optimize point-of-sale lookups that rarely filter by date alone. Option D (REPLICATE) is wrong because replication is designed for small dimension tables (typically < 2 GB) and would be impractical for a large sales transaction table, causing excessive storage and maintenance overhead.

Practice this question →

209

MCQhard

Refer to the exhibit. You have an Azure Synapse Analytics workspace. You need to ensure that data processing jobs can access the Data Lake Storage Gen2 account using a managed identity. What should you do?

A.Use the SQL admin login credentials to access the storage account

B.Enable the system-assigned managed identity on the Synapse workspace and assign it the 'Storage Blob Data Contributor' role on the storage account

C.Create a private endpoint connection between the workspace and the storage account

D.Configure the storage account firewall to allow access from the Synapse workspace

AnswerB

The managed identity needs RBAC permissions on the storage account.

Why this answer

Option B is correct because Azure Synapse Analytics supports system-assigned managed identities, which provide a secure, passwordless authentication method for accessing Azure Data Lake Storage Gen2. By enabling the managed identity on the Synapse workspace and assigning it the 'Storage Blob Data Contributor' role, you grant the workspace's data processing jobs the necessary permissions to read, write, and delete data in the storage account without managing credentials.

Exam trap

The trap here is that candidates often confuse network-level access controls (firewall rules or private endpoints) with identity-based authorization (RBAC), mistakenly thinking that allowing network traffic alone is sufficient for data access.

How to eliminate wrong answers

Option A is wrong because using SQL admin login credentials to access a storage account is not supported; SQL authentication is for database access, not for Azure Storage RBAC. Option C is wrong because creating a private endpoint ensures network-level isolation and private connectivity, but it does not grant the identity permissions to access the storage account; RBAC role assignment is still required. Option D is wrong because configuring the storage account firewall to allow access from the Synapse workspace only controls network traffic, not authentication or authorization; the managed identity still needs the appropriate RBAC role to perform data operations.

Practice this question →

210

MCQeasy

You are using Azure Data Factory to copy data from Azure Blob Storage to Azure SQL Database. The copy operation fails with error 'Cannot insert duplicate key'. What is the most likely cause and solution?

A.The source files are in an incompatible format; convert to CSV

B.The sink table schema does not match the source; update the schema

C.The copy activity is not using staging; enable staging

D.The sink table has a primary key, and the source contains duplicate rows; enable upsert in the copy activity

AnswerD

Upsert behavior can handle duplicate key violations.

Why this answer

The error indicates duplicate rows violating a primary key or unique constraint. Enabling upsert behavior allows the copy activity to handle duplicates. Option A is wrong because source format doesn't cause duplicates.

Option B is wrong because using staging doesn't fix duplicates. Option D is wrong because sink table schema is not the cause.

Practice this question →

211

MCQhard

You are troubleshooting a pipeline in Azure Data Factory that copies data from an Azure Blob Storage to an Azure Synapse Analytics dedicated SQL pool. The pipeline fails with the error: 'PolyBase requires a varchar(max) column to be less than 1 MB.' Which action should you take to resolve this issue?

A.Modify the source data to truncate varchar(max) columns to 8000 characters.

B.Configure the copy activity to use staging via Azure Blob Storage.

C.Increase the 'batchSize' property in the copy activity to 10000.

D.Disable PolyBase in the sink settings and use bulk insert instead.

AnswerB

Using staging allows PolyBase to handle large varchar(max) columns by breaking them into smaller chunks.

Why this answer

The error indicates that PolyBase is being used for the copy operation, and it has a limitation that varchar(max) columns must be less than 1 MB. Configuring staging via Azure Blob Storage (option B) allows the copy activity to use PolyBase with staging, which automatically splits large varchar(max) values into manageable chunks, bypassing the 1 MB limit. This is the recommended approach in Azure Data Factory for loading large string data into Synapse dedicated SQL pools.

Exam trap

The trap here is that candidates often assume the error requires truncating data or switching to bulk insert, but the correct solution leverages PolyBase's staging feature to handle large columns without data loss.

How to eliminate wrong answers

Option A is wrong because truncating varchar(max) columns to 8000 characters is a data loss solution and does not address the PolyBase limitation; PolyBase's 1 MB limit is on the total size of the column value, not character count, and truncation may still exceed 1 MB if the data is multi-byte. Option C is wrong because the 'batchSize' property controls the number of rows per batch for bulk insert operations, not the size of individual columns, and it does not affect PolyBase's varchar(max) size restriction. Option D is wrong because disabling PolyBase and using bulk insert would work but is less performant and not the recommended resolution; the error specifically occurs when PolyBase is enabled, and using staging with PolyBase is the intended fix.

Practice this question →

212

Multi-Selectmedium

You are using Azure Stream Analytics to process real-time data from an Event Hub. Which TWO of the following are valid output sinks?

Select 2 answers

A.Azure Synapse Analytics

B.Azure Blob Storage

C.Azure Queue Storage

D.Azure Files

E.Azure Table Storage

AnswersA, B

Supported via a dedicated SQL pool.

Why this answer

Azure Stream Analytics supports Azure Synapse Analytics as a native output sink, allowing you to write real-time streaming results directly into dedicated SQL pools for high-performance analytics. This is achieved via the built-in Azure Synapse Analytics output adapter, which uses PolyBase or COPY INTO for efficient bulk ingestion.

Exam trap

The trap here is that candidates often confuse Azure Table Storage or Queue Storage as valid sinks because they are general Azure storage services, but Stream Analytics has a specific, limited set of supported output sinks documented in official Microsoft documentation.

Practice this question →

213

MCQmedium

Refer to the exhibit. You are monitoring the CopyDataPipeline in Azure Data Factory. The copy activity is failing with timeout errors. What is the most likely cause?

A.The writeBatchTimeout is set too low (30 seconds), causing timeouts

B.The enableStaging is false, causing network congestion

C.The writeBatchSize is too large (10000), exceeding SQL limits

D.The recursive property is set to true, causing infinite loops

AnswerA

30 seconds may be insufficient for large batches.

Why this answer

The copy activity in Azure Data Factory is failing with timeout errors because the `writeBatchTimeout` is set to 30 seconds, which is too low for the volume of data being written to the sink. This property defines the maximum time allowed for a single batch write operation to complete; when it expires, the activity times out. Increasing this value (e.g., to 120 seconds or more) accommodates larger or slower writes, resolving the timeout.

Exam trap

The trap here is that candidates often confuse `writeBatchTimeout` with `writeBatchSize`, assuming a large batch size causes timeouts, but the timeout is a separate property that controls how long the system waits for a batch to complete, not the size of the batch itself.

How to eliminate wrong answers

Option B is wrong because `enableStaging` being false does not cause network congestion; staging is an optional intermediate storage for large data transfers or to enable PolyBase, and its absence does not inherently lead to timeouts. Option C is wrong because `writeBatchSize` of 10000 rows is within typical SQL Database limits (default batch size is 10,000 rows for bulk insert), and exceeding limits would cause row-level errors, not timeout errors. Option D is wrong because `recursive` property controls whether subdirectories are processed in file-based sources (e.g., Blob storage) and has no effect on timeout behavior in a copy activity; it cannot cause infinite loops.

Practice this question →

214

MCQmedium

You are designing a data processing solution for a retail company. The solution must ingest clickstream data from a website, process it in near real-time to update user session information, and store the results in Azure Cosmos DB for low-latency queries. Which combination of Azure services should you use?

A.Azure Event Hubs -> Azure Stream Analytics -> Azure SQL Database

B.Azure IoT Hub -> Azure Data Factory -> Azure Blob Storage

C.Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB

D.Azure Event Hubs -> Azure Functions -> Azure Cosmos DB

AnswerD

Event Hubs ingests events, Functions performs lightweight processing, Cosmos DB provides low-latency access.

Why this answer

Option D is correct because Azure Event Hubs ingests streaming data, Azure Functions processes it (e.g., for sessionization), and Cosmos DB stores it. Option A is wrong because Azure SQL Database is not ideal for low-latency key-value queries. Option B is wrong because Blob Storage is not designed for real-time querying.

Option C is wrong because Stream Analytics is more for analytical processing, but Cosmos DB is correct, but the combination with Azure SQL DB is not optimal.

Practice this question →

215

MCQeasy

You are designing a streaming job in Azure Stream Analytics. The job needs to count the number of events per device type every 10 seconds. The input is from Event Hubs. Which query should you use?

A.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, SessionWindow(second, 10, 30)

B.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, TumblingWindow(second, 10)

C.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, HoppingWindow(second, 10, 1)

D.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, SlidingWindow(second, 10)

AnswerB

Tumbling window outputs exactly every 10 seconds.

Why this answer

Option B is correct because a TumblingWindow(second, 10) produces non-overlapping, fixed-size 10-second windows, which is exactly what is needed to count events per device type every 10 seconds. The GROUP BY clause groups by DeviceType and the window, ensuring each device type gets its own count per window. This query meets the requirement without overlapping or sliding behavior.

Exam trap

The trap here is that candidates confuse HoppingWindow with TumblingWindow, thinking a hop size of 1 second still produces 10-second intervals, but HoppingWindow emits results at every hop, not at the window duration, leading to incorrect output frequency.

How to eliminate wrong answers

Option A is wrong because SessionWindow(second, 10, 30) defines session windows based on inactivity gaps, not fixed 10-second intervals; the 30-second timeout means windows can be much longer than 10 seconds, violating the requirement. Option C is wrong because HoppingWindow(second, 10, 1) creates overlapping windows that emit results every 1 second, not every 10 seconds, leading to redundant counts. Option D is wrong because SlidingWindow(second, 10) produces a continuous stream of results for every event within the last 10 seconds, not discrete 10-second intervals, so it does not count events 'every 10 seconds' as a batch.

Practice this question →

216

MCQmedium

You are using Azure Synapse Analytics dedicated SQL pool to run a query that joins a large fact table (10 billion rows) and a small dimension table (1 million rows). The query is slow. Which distribution strategy should you use for the dimension table to improve performance?

A.Round-robin distribute the dimension table.

B.Hash-distribute the dimension table on its primary key.

C.Replicate the dimension table to all compute nodes.

D.Hash-distribute the dimension table on the foreign key column.

AnswerC

Replication avoids data movement for small tables.

Why this answer

Replicating the small dimension table (1 million rows) to all compute nodes eliminates data movement during the join with the large fact table (10 billion rows). In Azure Synapse dedicated SQL pool, replicated tables store a full copy on each distribution, so the join can be performed locally on every node without shuffling data across the network, drastically reducing query latency.

Exam trap

The trap here is that candidates often choose hash distribution on the foreign key (Option D) thinking it aligns the join keys, but they overlook that the fact table is typically distributed on a different column (e.g., its own primary key or a date column), so the join still requires data movement, whereas replication is the optimal strategy for small dimension tables in a star schema.

How to eliminate wrong answers

Option A is wrong because round-robin distribution spreads the dimension table evenly across distributions without any alignment with the fact table, causing all join operations to require data movement (shuffle) across nodes, which is highly inefficient for a large fact table. Option B is wrong because hash-distributing the dimension table on its primary key does not align with the fact table's distribution key (typically the foreign key), so the join will still require redistributing one or both tables unless the fact table is also hash-distributed on the same column. Option D is wrong because hash-distributing the dimension table on the foreign key column would scatter its rows across distributions, but the fact table is likely hash-distributed on a different column (e.g., its own primary key or a different foreign key), so the join would still cause data movement; moreover, dimension tables are typically small and benefit more from replication than from hash distribution.

Practice this question →

217

MCQhard

Your company uses Azure Synapse Analytics and has deployed a pipeline that uses a Mapping Data Flow to transform data. The data flow reads from a source in Azure Blob Storage and writes to a dedicated SQL pool. You notice that the data flow is running slowly and consuming a lot of Data Flow cluster resources. You need to improve performance without increasing the cluster size. Which action should you take?

A.Use a self-hosted integration runtime instead of the default auto-resolve IR.

B.Increase the batch size in the data flow settings to reduce the number of round trips.

C.Add a partitioning step in the data flow to distribute the data across partitions based on a key column.

D.Change the source format to Delta Lake to leverage optimizations.

AnswerC

Partitioning the data can improve parallelism and performance.

Why this answer

Adding a partitioning step in the Mapping Data Flow distributes data across partitions based on a key column, which allows parallel processing across the cluster's nodes. This reduces data shuffling and improves throughput without increasing the cluster size, directly addressing the performance bottleneck caused by skewed or unpartitioned data.

Exam trap

The trap here is that candidates often confuse increasing batch size (Option B) with improving parallelism, but batch size only affects sink write operations, not the internal data processing distribution that causes cluster resource exhaustion.

How to eliminate wrong answers

Option A is wrong because using a self-hosted integration runtime (IR) would introduce network latency and management overhead, and it does not address the internal data flow processing inefficiency; the default auto-resolve IR is optimized for Azure services. Option B is wrong because increasing the batch size in data flow settings reduces round trips to the sink but does not resolve the root cause of slow transformation performance, which is data distribution and parallelism within the cluster. Option D is wrong because changing the source format to Delta Lake does not inherently improve performance for a Mapping Data Flow reading from Blob Storage; Delta Lake optimizations (like Z-ordering) require a Delta Lake engine and are not applicable to the current pipeline's source format.

Practice this question →

218

MCQhard

You are designing a batch processing pipeline in Azure Databricks. The data is stored in Delta Lake and you need to perform a time-series join between two tables: 'events' (100 billion rows) and 'sessions' (10 billion rows). The join condition is on 'device_id' and a timestamp range (event_time BETWEEN session_start AND session_end). Which join strategy would be most efficient?

A.Broadcast the smaller table (sessions) to all nodes.

B.Use a range join with interval threshold using Delta Lake's optimized join.

C.Use a sort-merge join by repartitioning both tables on device_id.

D.Bucket both tables on device_id with 500 buckets.

AnswerB

Delta Lake supports range join optimization with interval thresholds, reducing data shuffle.

Why this answer

Option B is correct because Delta Lake's optimized range join leverages interval threshold pruning and data skipping to efficiently handle time-series joins on large datasets. This strategy avoids full shuffles by using min/max statistics and Bloom filters to eliminate non-matching partitions, making it far more efficient than generic join methods for 100B and 10B row tables.

Exam trap

Microsoft often tests the misconception that broadcasting a large table is acceptable if it fits in memory, but the trap here is that candidates overlook the driver memory limit and assume broadcast join scales linearly, while the correct answer requires understanding Delta Lake's specialized range join optimization for time-series data.

How to eliminate wrong answers

Option A is wrong because broadcasting a 10 billion row 'sessions' table would exceed driver memory and cause out-of-memory errors; broadcast joins are only suitable for small tables (typically < 1 GB). Option C is wrong because a sort-merge join with repartitioning on 'device_id' alone does not optimize the timestamp range condition, leading to a full shuffle of both massive tables and poor performance. Option D is wrong because bucketing on 'device_id' with 500 buckets does not address the range join predicate; it only co-locates rows by hash, but the timestamp range still requires a cross-join-like comparison within each bucket, which is inefficient.

Practice this question →

219

MCQhard

You are developing a real-time data processing solution for a financial services company. The system ingests stock trade data from Azure Event Hubs at 50,000 events per second. Each event is a JSON object with fields: TradeID, Symbol, Price, Quantity, Timestamp. You need to calculate a 5-minute rolling average of the trade price per symbol and store the result in Azure Cosmos DB for low-latency queries. Additionally, you need to detect anomalies where the price deviates more than 10% from the rolling average within the same window, and send alerts to Azure Event Grid. You must minimize latency and ensure that the processing is stateful across multiple partitions. What should you do?

A.Use Azure Functions with Event Hubs trigger. In each function invocation, compute the rolling average using a distributed cache (Redis) and detect anomalies. Write to Cosmos DB and Event Grid via output bindings.

B.Use Azure Synapse Pipelines with a Data Flow. Set up a streaming Data Flow from Event Hubs, compute rolling average using window functions, and sink to Cosmos DB and Event Grid.

C.Use Azure Databricks with Structured Streaming. Read from Event Hubs using Kafka API. Perform windowed aggregations and anomaly detection using Spark SQL. Write to Cosmos DB via the Azure Cosmos DB Spark connector and to Event Grid via HTTP sink.

D.Create an Azure Stream Analytics job. Define input from Event Hubs. Use a Tumbling window of 5 minutes to compute average price per symbol. Add a custom function to compare each event's price to the average and output anomalies. Write to Cosmos DB via the Azure Cosmos DB output adapter and to Event Grid via the Event Grid output adapter.

AnswerD

Stream Analytics provides native support for windowing, stateful processing, and multiple outputs.

Why this answer

Azure Stream Analytics can ingest from Event Hubs, perform windowed aggregations (e.g., Tumbling window for rolling average), detect anomalies using conditional logic, and output to both Cosmos DB and Event Grid. It handles partitioning automatically and is stateful. Option A is correct.

Option B uses Azure Functions which are not ideal for high-throughput stateful stream processing. Option C uses Databricks Structured Streaming which is more complex to manage. Option D uses Synapse Pipelines which are batch-oriented.

Practice this question →

220

MCQmedium

A manufacturing company uses Azure Data Lake Storage Gen2 to store IoT sensor data. The data arrives in JSON format with a nested structure. You need to transform the data into a tabular format for downstream analytics using Azure Synapse Pipelines. Which data flow transformation should you use?

A.Aggregate transformation

B.Flatten transformation

C.Window transformation

D.Pivot transformation

AnswerB

Flatten unpacks nested arrays into rows, converting JSON to tabular format.

Why this answer

Option B is correct because the Flatten transformation in mapping data flows unpacks nested arrays into rows. Option A is wrong because the Aggregate transformation groups data but does not flatten nested structures. Option C is wrong because the Pivot transformation rotates rows to columns.

Option D is wrong because the Window transformation calculates aggregated values over a range of rows.

Practice this question →

221

MCQeasy

You are monitoring an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The pipeline fails intermittently with the error: 'Operation on target SQL table failed: String or binary data would be truncated.' Which action should you take to resolve this issue?

A.Increase the length of the destination columns in the SQL table to accommodate the source data.

B.Set 'enable identity insert' to true.

C.Use auto-create table option in the copy activity.

D.Enable staging copy to use PolyBase.

AnswerA

Direct fix for truncation error.

Why this answer

Option A is correct because the error indicates that source data length exceeds destination column length. Increasing column size resolves it. Option B is incorrect because the table already exists.

Option C is incorrect because the error is not about connection. Option D is incorrect because the error is not about identity insert.

Practice this question →

222

MCQmedium

You are a data engineer at a manufacturing company. You need to process sensor data from IoT devices that arrive in real time. The data is sent to Azure Event Hubs. You need to aggregate the data over 5-minute windows and store the results in Azure Data Lake Storage Gen2 in Parquet format. The solution should minimize cost and use serverless components. Which solution should you use?

A.Use Azure Stream Analytics to create a query with a tumbling window of 5 minutes, and output the results to Azure Data Lake Storage Gen2 in Parquet format.

B.Use Azure Databricks with Structured Streaming to read from Event Hubs, aggregate with a sliding window, and write to ADLS Gen2 in Parquet.

C.Use Azure Data Factory with a tumbling window trigger to run a pipeline every 5 minutes that copies data from Event Hubs to ADLS Gen2.

D.Use Azure Functions with an Event Hubs trigger to aggregate data in memory and write to ADLS Gen2.

AnswerA

Serverless, real-time, and cost-effective.

Why this answer

Option A is correct because Azure Stream Analytics is a serverless, cost-effective solution for real-time stream processing with windowed aggregations. It can output directly to ADLS Gen2 in Parquet. Option B is wrong because Azure Databricks with Structured Streaming requires a running cluster, which is not serverless and incurs cost.

Option C is wrong because Azure Data Factory is not designed for real-time streaming. Option D is wrong because Azure Functions would require custom code and may not handle large throughput efficiently.

Practice this question →

223

MCQeasy

You are a data engineer at a financial services company. You are developing a data processing pipeline that uses Azure Data Factory to copy transactional data from an Azure SQL Database to Azure Data Lake Storage Gen2. The pipeline runs daily and processes about 10 GB of data. You need to implement error handling for the pipeline. Specifically, if the copy activity fails due to a transient error, the pipeline should retry automatically. If the retry fails, the pipeline should log the error and send an email alert to the operations team. What should you do?

A.Configure the copy activity with retry policy (retry count = 2, retry interval = 30 seconds). Add a failure path to a web activity that calls an Azure Logic App to send an email.

B.Use Azure Functions to implement custom retry logic and send email.

C.Create an Azure Monitor alert for failed pipeline runs and configure an action group to send an email.

D.Set the pipeline retry to 2 and add a storage event trigger on the error file.

AnswerA

Built-in retry and integration with Logic Apps handle the requirement.

Why this answer

Option A is correct because Azure Data Factory allows you to set retry count and retry interval on the activity, and you can add a web activity to send an email via Azure Logic Apps after failure. Option B is wrong because the pipeline-level retry would not allow activity-specific settings. Option C is wrong because Azure Monitor alerts would be reactive, not integrated into the pipeline.

Option D is wrong because Azure Functions would require additional coding and setup.

Practice this question →

224

MCQmedium

You are designing a data processing pipeline that ingests data from a REST API endpoint every hour. The API returns JSON data with a varying schema. You need to store the raw data in Azure Data Lake Storage Gen2 and later process it using Azure Databricks. Which file format should you use for the raw data storage?

A.Parquet

B.CSV

C.JSON

D.Avro

AnswerC

Preserves schema flexibility and is easy to store.

Why this answer

Option A (JSON) is correct because it preserves the original schema and is easy to store without transformation. Option B (Parquet) requires schema definition upfront. Option C (CSV) is not suitable for nested JSON.

Option D (Avro) is good for streaming but less common for raw ingestion.

Practice this question →

225

MCQeasy

Your team is using Azure Data Factory to orchestrate a data pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline runs successfully during testing. However, after moving to production, you notice that the pipeline fails intermittently with connectivity errors. You need to ensure reliable data transfer. What should you implement?

A.Use Azure Integration Runtime (IR)

B.Use PolyBase for data transfer

C.Configure Azure VPN Gateway

D.Use Self-Hosted Integration Runtime

AnswerD

Self-hosted IR acts as a gateway connecting ADF to on-premises resources.

Why this answer

Option B is correct because a self-hosted integration runtime provides a gateway for on-premises connectivity. Option A is wrong because Azure IR cannot access on-premises networks. Option C is wrong because VPN is not a direct solution for data factory connectivity.

Option D is wrong because PolyBase is for loading into Synapse, not for connectivity.

Practice this question →

← PreviousPage 3 of 4 · 297 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Develop data processing questions.

Start 20-question session