Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 151–225

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 3 of 12

151

MCQmedium

You are a data engineer for a financial services company. The company uses Azure Data Lake Storage Gen2 as its data lake. You have a directory structure where each customer has a folder containing transaction files in CSV format. The security team requires that each customer's data be accessible only to that customer's users. You need to implement fine-grained access control using Azure Data Lake Storage Gen2's POSIX-like ACLs. However, you have thousands of customers, and managing ACLs individually is not feasible. What should you do?

A.Create a shared access signature (SAS) token for each customer and distribute it securely

B.Use POSIX ACLs on each customer folder, assigning permissions to individual user identities

C.Use row-level security in Azure Data Lake Storage Gen2

D.Create an Azure AD group for each customer, add users to the group, and assign ACLs to the group on the customer folder

AnswerD

Group-based ACL assignment is scalable and manageable.

Why this answer

Option D is correct because Azure Data Lake Storage Gen2 supports POSIX-like ACLs that can be assigned to Azure AD security groups. By creating one Azure AD group per customer, adding the customer's users to that group, and then assigning the group the appropriate read/execute ACLs on the customer's folder, you achieve scalable, fine-grained access control without managing thousands of individual user ACLs. This approach aligns with the principle of least privilege and simplifies administration.

Exam trap

The trap here is that candidates often confuse row-level security (a SQL-based feature) with file-system access control in Azure Data Lake Storage Gen2, or they mistakenly believe that SAS tokens can provide granular directory-level isolation, when in fact SAS tokens operate at the container or storage account level and cannot enforce per-folder ACLs.

How to eliminate wrong answers

Option A is wrong because shared access signature (SAS) tokens provide delegated access at the storage account or container level, not at the directory or file level, and managing thousands of SAS tokens securely is operationally complex and does not integrate with Azure AD identity-based access control. Option B is wrong because assigning POSIX ACLs to individual user identities for thousands of customers is not feasible due to the Azure AD limit of 32 ACL entries per file or directory and the administrative overhead of managing individual user permissions at scale. Option C is wrong because row-level security is a feature of Azure SQL Database and Azure Synapse Analytics dedicated SQL pools, not of Azure Data Lake Storage Gen2, which uses POSIX ACLs and RBAC for access control.

Full explanation →

152

MCQmedium

You have an Azure Data Lake Storage Gen2 account that stores large volumes of parquet files. A reporting application frequently queries a specific subset of data filtered by a 'region' column. To minimize query latency and cost, which optimization should you implement?

A.Partition the data by region in the folder structure.

B.Create a clustered index on the region column.

C.Compress the parquet files using gzip.

D.Enable hierarchical namespace on the storage account.

AnswerA

Partition elimination reduces data scanned.

Why this answer

Partitioning the data by region in the folder structure (e.g., /region=NorthAmerica/...) enables Azure Data Lake Storage Gen2 and query engines like Azure Synapse or PolyBase to perform partition pruning. This skips scanning irrelevant files entirely, reducing I/O and query latency while lowering cost by minimizing data processed.

Exam trap

The trap here is that candidates confuse compression (Option C) with partitioning, thinking reducing file size alone minimizes I/O, but without partition pruning the engine still scans all files, negating the benefit.

How to eliminate wrong answers

Option B is wrong because clustered indexes are a SQL Server/PaaS feature and are not supported on Parquet files in Azure Data Lake Storage Gen2; they apply only to relational tables in a database. Option C is wrong because compressing Parquet files with gzip does not reduce the amount of data scanned for a filtered query—Parquet already uses column-level compression (e.g., Snappy, ZSTD), and gzip adds CPU overhead without improving partition pruning. Option D is wrong because enabling hierarchical namespace is a prerequisite for folder-based partitioning, not an optimization itself; it must already be enabled to create the partitioned folder structure.

Full explanation →

153

MCQmedium

You are a data engineer for a financial services company. You have an Azure Data Lake Storage Gen2 account containing historical trade data organized by date in the format 'yyyy/MM/dd'. Each day's data is stored as a collection of Parquet files. The data is used by a team of analysts who run ad-hoc queries using Azure Synapse Serverless SQL. Recently, the analysts have reported that queries scanning multiple months of data are slow. The storage account uses LRS with a general-purpose v2 tier. You have enabled hierarchical namespace. The data is not partitioned in any other way. You need to improve query performance without moving data or changing the storage tier. What should you do?

A.Create external tables with partition definition using the directory structure and ensure queries filter on the date column.

B.Increase the query timeout setting in Azure Synapse Studio.

C.Redistribute the data using hash distribution on the date column.

D.Increase the data warehouse units (DWU) for the serverless SQL endpoint.

AnswerA

Partition elimination reduces data scanned, improving performance.

Why this answer

Option A is correct because Azure Synapse Serverless SQL can leverage the directory structure of Azure Data Lake Storage Gen2 as virtual partitions. By creating external tables with a partition definition that maps to the 'yyyy/MM/dd' folder hierarchy and ensuring queries filter on the date column, the serverless SQL engine performs partition elimination. This reduces the amount of data scanned, directly addressing the slow performance when querying multiple months of data without moving data or changing the storage tier.

Exam trap

The trap here is that candidates may confuse serverless SQL with dedicated SQL pool concepts, such as hash distribution or DWU scaling, and fail to recognize that partition elimination via external table definitions is the only viable optimization for serverless SQL when data remains in the lake.

How to eliminate wrong answers

Option B is wrong because increasing the query timeout setting in Azure Synapse Studio does not improve query performance; it only allows the query to run longer before failing, which does not address the root cause of slow data scans. Option C is wrong because hash distribution is a concept for dedicated SQL pools (provisioned) in Azure Synapse, not for serverless SQL endpoints; serverless SQL does not support redistributing data with hash distribution, and the data remains in the lake. Option D is wrong because serverless SQL endpoints do not use data warehouse units (DWU); DWU is a scaling metric for dedicated SQL pools, and serverless SQL scales automatically based on the amount of data processed, so increasing DWU is not applicable.

Full explanation →

154

MCQhard

Refer to the exhibit. The pipeline executes a Spark job definition that reads from the input path and writes to the output path. The job fails with an error indicating that the output path already exists. The job is configured with default behavior. What configuration change should you make to allow the job to overwrite the existing output?

A.Modify the Spark job definition to include 'overwrite' save mode.

B.Change the output path in the arguments to a non-existing folder.

C.Set the pipeline property 'outputMode' to 'overwrite' on the SynapseSparkJob activity.

D.Add a Delete activity before the Spark job to delete the output folder if it exists.

AnswerD

Using a Delete activity in the pipeline to remove the existing output folder before running the Spark job ensures the job can write without conflict.

Why this answer

Spark jobs by default fail if the output directory exists. To overwrite, you need to set the Spark configuration 'spark.hadoop.parquet.enable.summary-metadata' is not relevant. The correct approach is to set the Spark configuration 'spark.sql.sources.partitionOverwriteMode' to 'STATIC' or use 'overwrite' save mode in the code.

However, in the exhibit, the job is a Spark job definition that calls a file; the simplest way is to set the argument to use overwrite mode. Option B (Delete activity) is a separate activity. Option C changes the job definition.

Option D (output mode) is not a pipeline property.

Full explanation →

155

Multi-Selectmedium

You are implementing a data lake using Azure Data Lake Storage Gen2. Which THREE actions should you take to secure the data at rest and in transit?

Select 3 answers

A.Enable TLS 1.0 for compatibility with legacy clients

B.Enable Azure Storage Service Encryption (SSE) for data at rest

C.Configure firewall rules to allow only trusted IPs

D.Use Azure RBAC and ACLs to control access to data

E.Require HTTPS for all data transfers

AnswersB, D, E

SSE encrypts data at rest by default.

Why this answer

Azure Storage Service Encryption (SSE) automatically encrypts data at rest using 256-bit AES encryption, which is transparent to applications and meets compliance requirements. This is a fundamental security control for protecting data stored in Azure Data Lake Storage Gen2.

Exam trap

The trap here is that candidates may confuse network security controls (firewalls) or legacy protocol compatibility (TLS 1.0) with actual data encryption mechanisms, leading them to select options that address access or connectivity rather than encryption of data at rest and in transit.

Full explanation →

156

MCQmedium

You are debugging an Azure Data Factory pipeline that copies CSV files from Azure Blob Storage to Azure SQL Database. The pipeline is defined as shown in the exhibit. The first row of each CSV file contains column headers. After a run, you notice that the first data row is missing from the target table. What is the most likely cause?

A.The 'writeBehavior' is set to 'insert', which does not overwrite existing rows

B.The 'skipLineCount' is set to 1, which skips the first data row after the header

C.The 'recursive' setting is true, causing the pipeline to read files in subdirectories and duplicate data

D.The 'enableStaging' is false, which prevents PolyBase from loading data

AnswerB

skipLineCount skips lines after header, removing first data row.

Why this answer

Option B is correct because 'skipLineCount': 1 skips the first line after the header line; since header is already parsed, the first data row is skipped. Option A is wrong because 'recursive': true only affects subdirectories, not line skipping. Option C is wrong because 'writeBehavior': 'insert' is default and does not skip rows.

Option D is wrong because 'enableStaging': false is fine for direct copy.

Full explanation →

157

MCQmedium

You are developing a data processing pipeline for a gaming company that uses Azure Databricks. The pipeline processes game event data from Azure Event Hubs. You need to detect cheating patterns by analyzing events in real time. The solution must be able to handle high throughput and low latency. The output should be written to Azure Cosmos DB for real-time dashboards. Which approach should you use?

A.Use Azure Databricks Structured Streaming to read from Event Hubs, use Spark SQL and machine learning to detect cheating patterns, and write to Cosmos DB using the Azure Cosmos DB Spark connector.

B.Use Azure Functions with Event Hubs trigger to process each event and write to Cosmos DB.

C.Use Azure Data Factory with continuous copy to load data into Cosmos DB and then use Azure Synapse Analytics to detect patterns.

D.Use Azure Stream Analytics to query the stream for cheating patterns and output to Cosmos DB.

AnswerA

Supports high throughput and complex analytics.

Why this answer

Option B is correct because Azure Databricks Structured Streaming can handle high throughput and low latency, and it can write to Cosmos DB using the Azure Cosmos DB Spark connector. Option A is wrong because Azure Stream Analytics outputs to Cosmos DB but may not provide the same flexibility for complex pattern detection. Option C is wrong because Azure Data Factory is not for real-time processing.

Option D is wrong because Azure Functions would be too complex for high throughput.

Full explanation →

158

Multi-Selecteasy

Which TWO Azure services can be used to audit data access and changes in Azure Data Lake Storage Gen2? (Choose two.)

Select 2 answers

A.Microsoft Entra ID sign-in logs.

B.Azure Backup reports.

C.Storage account diagnostic settings.

D.Azure Monitor and Microsoft Sentinel.

E.Azure Policy.

AnswersC, D

Diagnostic settings log read/write operations.

Why this answer

Options B and C are correct. Diagnostic settings in storage accounts capture logs, and Azure Monitor integrates with Microsoft Sentinel for security monitoring. Option A is wrong because Azure Policy is for compliance, not auditing data access.

Option D is wrong because Microsoft Entra ID is for identity, not logging. Option E is wrong because Azure Backup is for data protection, not auditing.

Full explanation →

159

Multi-Selectmedium

You are designing a data processing solution using Azure Databricks. You need to read data from Azure Data Lake Storage Gen2, transform it using Spark SQL, and write to a Delta table. Which TWO configurations are required to ensure optimal performance for large datasets?

Select 2 answers

A.Disable automatic schema detection to reduce overhead.

B.Use Delta Lake's OPTIMIZE command to compact small files.

C.Use Delta Lake Z-order optimization on frequently filtered columns.

D.Cache the entire DataFrame in memory after reading.

E.Enable auto-compaction in Spark configuration.

AnswersB, C

Compacting small files improves read performance.

Why this answer

Options A and D are correct. Option A is correct because optimizing the Delta table with Z-order improves data skipping for columns used in queries. Option D is correct because using Delta Lake's optimize command (bin-packing) compacts small files, improving read performance.

Option B is wrong because disabling automatic schema detection can be useful but is not required for performance. Option C is wrong because caching is optional and may not help for large datasets that don't fit in memory. Option E is wrong because enabling auto compaction helps but is not a requirement; it's a feature.

Full explanation →

160

MCQmedium

A data engineer needs to store semi-structured JSON logs for analysis using Azure Synapse Serverless SQL. Which file format should be used for optimal query performance?

A.Avro

B.Parquet

C.CSV

D.JSON

AnswerB

Columnar, compressed, optimized for query performance.

Why this answer

Parquet is correct because it is a columnar storage format that enables predicate pushdown and compression, significantly reducing the amount of data scanned by Azure Synapse Serverless SQL for analytical queries on semi-structured JSON logs. This format aligns with the engine's design for high-performance read operations on large datasets, unlike row-oriented formats that require full file scans.

Exam trap

The trap here is that candidates often assume semi-structured data must stay in its native JSON format for simplicity, overlooking that columnar formats like Parquet can natively store nested JSON structures via repeated fields and maps, while providing massive performance gains in serverless SQL engines.

How to eliminate wrong answers

Option A is wrong because Avro is a row-oriented format that, while efficient for write-heavy and schema-evolving scenarios, does not support column pruning or predicate pushdown as effectively as Parquet, leading to higher I/O and slower query performance in Synapse Serverless SQL. Option C is wrong because CSV is a plain-text, row-oriented format with no built-in compression or indexing, forcing full table scans and increasing data transfer costs, which degrades query performance. Option D is wrong because storing logs as raw JSON files results in verbose, uncompressed data that lacks schema enforcement and columnar optimization, causing Synapse Serverless SQL to parse entire files without the benefits of predicate pushdown or efficient compression.

Full explanation →

161

MCQeasy

Refer to the exhibit. You are deploying the ARM template to create an Azure Synapse Analytics workspace. The deployment fails because the managed resource group cannot be created. What is the most likely reason?

A.The managedResourceGroupName property is not allowed; Azure Synapse creates the resource group automatically

B.The apiVersion is invalid for this resource type

C.The location is missing the 'properties' section

D.The storage account URL must use the blob endpoint, not DFS

AnswerA

You cannot specify a custom managed resource group name.

Why this answer

Option A is correct because the `managedResourceGroupName` property is not allowed in an ARM template for Azure Synapse Analytics. Azure Synapse automatically creates and manages its own managed resource group (e.g., `managed-rg-<workspace-name>`) to hold provisioned resources like Spark pools and SQL pools. Specifying this property in the template causes a validation error, as the service requires full control over the naming and lifecycle of this group.

Exam trap

The trap here is that candidates often assume they can control the naming of all Azure resources in an ARM template, but Azure Synapse enforces automatic management of its internal resource group, and specifying it triggers a validation error rather than a runtime failure.

How to eliminate wrong answers

Option B is wrong because the `apiVersion` shown in the exhibit (e.g., `2021-06-01-preview` or similar) is valid for the `Microsoft.Synapse/workspaces` resource type; an invalid apiVersion would produce a different error, not a managed resource group creation failure. Option C is wrong because the `location` property is a required top-level property for the resource, and it does not need a nested `properties` section; the error is unrelated to location formatting. Option D is wrong because Azure Synapse supports both blob (`*.blob.core.windows.net`) and DFS (`*.dfs.core.windows.net`) endpoints for the default storage account; using DFS is actually recommended for optimal performance with Synapse pipelines and Spark, and it would not cause a managed resource group creation failure.

Full explanation →

162

MCQeasy

Your company uses Azure Data Lake Storage Gen2 as a data lake. You need to monitor for unauthorized access attempts and ensure compliance with regulatory requirements. You want to be alerted when someone attempts to access data without proper permissions. What should you configure?

A.Assign RBAC roles to limit access and rely on Azure Activity Log.

B.Configure Azure Policy to audit storage account access.

C.Enable diagnostic settings for the storage account, send logs to Log Analytics, and create an alert rule for 'StorageRead' failures.

D.Enable Azure Sentinel and connect storage logs.

AnswerC

Diagnostic logs capture detailed access, alerts can be set on specific conditions.

Why this answer

Option D is correct because diagnostic settings send logs to Log Analytics, and Azure Monitor alerts can be created to detect unauthorized access. Option A is wrong because Azure Policy is for governance, not real-time alerts. Option B is wrong because RBAC is for access control, not monitoring.

Option C is wrong because Azure Sentinel is for SIEM, but simpler solution is Log Analytics alerts.

Full explanation →

163

Multi-Selectmedium

Which TWO actions should you take to reduce costs associated with an Azure Synapse Dedicated SQL Pool that is used for reporting during business hours only?

Select 2 answers

A.Pause the pool during non-business hours.

B.Enable advanced data compression on all tables.

C.Scale down the pool during business hours.

D.Change the distribution of large tables to ROUND_ROBIN.

E.Implement result set caching for frequently run queries.

AnswersA, E

Stops compute billing when not in use.

Why this answer

Option A is correct because pausing a Dedicated SQL Pool stops billing for compute resources (DWU) while retaining storage costs. Since the pool is only needed for reporting during business hours, pausing it during non-business hours directly eliminates compute charges for that period, which is the most significant cost driver.

Exam trap

Microsoft often tests the distinction between compute cost reduction (pausing/scaling) and storage/performance optimizations (compression, distribution, caching), leading candidates to confuse storage-saving actions with compute-saving actions.

Full explanation →

164

Multi-Selecthard

Which TWO Azure services can be used to monitor data pipeline runs and set up alerts for failures in Azure Data Factory?

Select 2 answers

A.Azure Data Factory monitoring views

B.Azure Log Analytics

C.Azure Sentinel

D.Azure Monitor

E.Azure Automation

AnswersA, D

Azure Data Factory provides built-in monitoring views and can integrate with Azure Monitor for alerts.

Why this answer

Options A and D are correct. Azure Monitor is the primary service for monitoring and alerts; Azure Data Factory metrics and logs are sent to Azure Monitor. Option B is wrong because Azure Sentinel is for security, not pipeline monitoring.

Option C is wrong because Log Analytics is part of Azure Monitor but not the service name for alerting. Option E is wrong because Azure Automation is for automation, not monitoring.

Full explanation →

165

MCQeasy

You need to process streaming data from Azure Event Hubs and store the results in Azure Cosmos DB for a real-time dashboard. The solution must handle duplicate events and ensure exactly-once processing. Which Azure service should you use?

A.Azure Data Factory

B.Azure Functions with Event Hubs trigger

C.Azure Stream Analytics

D.Azure Databricks Structured Streaming

AnswerC

Supports exactly-once semantics with Event Hubs.

Why this answer

Option C (Azure Stream Analytics) is correct because it provides exactly-once processing when configured with Event Hubs and Cosmos DB output. Option A (Azure Data Factory) is batch-oriented. Option B (Azure Functions) may have at-least-once guarantees.

Option D (Azure Databricks) can achieve exactly-once but requires more configuration.

Full explanation →

166

MCQhard

A data engineer is monitoring Azure Data Lake Storage Gen2 costs and notices high transaction costs for a specific container. The container stores Parquet files used by Azure Databricks for read-heavy analytics. The files are accessed frequently by multiple jobs. What is the most cost-effective way to reduce transaction costs?

A.Move the data to Azure Blob Storage cool tier.

B.Increase the Parquet file size to maximize block size.

C.Convert the container to Azure Files.

D.Enable Azure CDN to cache the files.

AnswerD

Azure CDN caches data at edge locations, reducing the number of direct read transactions to the storage account.

Why this answer

Option D is correct because enabling Azure CDN caches the frequently accessed Parquet files at edge locations, reducing the number of direct read requests to Azure Data Lake Storage Gen2. This lowers transaction costs (both read and list operations) while maintaining low-latency access for read-heavy analytics workloads. The CDN serves cached content, so the storage account incurs fewer billable transactions.

Exam trap

The trap here is that candidates often assume increasing file size (Option B) reduces costs because fewer files mean fewer transactions, but they overlook that each read of a large file still incurs a single transaction per API call, and transaction costs are per operation, not per file size.

How to eliminate wrong answers

Option A is wrong because moving data to Azure Blob Storage cool tier reduces storage costs but does not reduce transaction costs; in fact, cool tier has higher per-transaction charges, which would increase costs for read-heavy workloads. Option B is wrong because increasing Parquet file size to maximize block size does not reduce transaction costs; Azure Data Lake Storage Gen2 uses hierarchical namespace and transactions are counted per API call (e.g., per file read), not per block size, so larger files reduce the number of files but each read still incurs a transaction. Option C is wrong because converting the container to Azure Files introduces SMB protocol overhead and is designed for file shares, not optimized for read-heavy analytics with Parquet files; it would increase latency and complexity without reducing transaction costs.

Full explanation →

167

Multi-Selecthard

Which THREE components are part of a defense-in-depth strategy for data security in Azure?

Select 3 answers

A.Azure Policy to enforce tagging

B.Network security groups (NSGs) on subnets

C.Data classification and labeling

D.Encryption at rest for storage accounts

E.Dynamic data masking for all databases

AnswersB, C, D

NSGs provide network-level security by filtering traffic.

Why this answer

Network security groups (NSGs) are a fundamental component of a defense-in-depth strategy because they provide network-layer segmentation and filtering. By applying NSGs to subnets, you can control inbound and outbound traffic based on source/destination IP addresses, ports, and protocols, creating a perimeter defense that limits lateral movement in case of a breach.

Exam trap

The trap here is that candidates often confuse governance controls (like Azure Policy tagging) with actual security controls, or they assume dynamic data masking is a core defense layer when it is merely a data obfuscation feature that does not prevent unauthorized access or encryption.

Full explanation →

168

MCQeasy

A data engineer monitors an Azure Stream Analytics job that processes real-time data. The job is falling behind, and the SU utilization is at 100%. Which action should be taken to improve performance?

A.Increase the number of Streaming Units (SU).

B.Reduce the number of Streaming Units.

C.Change the query compatibility level to 1.0.

D.Deploy a second Stream Analytics job and split the input.

AnswerA

More SU provides more processing power.

Why this answer

When SU utilization reaches 100%, the job is fully saturated and cannot process incoming data fast enough. Increasing the number of Streaming Units (SU) allocates more compute resources (CPU and memory) to the job, allowing it to handle higher throughput and reduce backlog. This is the direct and recommended action for resolving performance bottlenecks caused by insufficient SU capacity.

Exam trap

The trap here is that candidates may think reducing SU or splitting the job is a valid optimization, but the correct response is to increase SU when utilization is at 100%, as this directly addresses the resource bottleneck.

How to eliminate wrong answers

Option B is wrong because reducing the number of Streaming Units would further starve the job of resources, worsening the backlog and increasing latency. Option C is wrong because changing the query compatibility level to 1.0 does not affect resource allocation or throughput; it only alters query language features and behavior, which cannot resolve a 100% SU utilization issue. Option D is wrong because deploying a second Stream Analytics job and splitting the input does not address the root cause of resource saturation; it adds complexity and may cause ordering or partitioning issues without guaranteeing improved performance, and the original job would still be overloaded.

Full explanation →

169

MCQeasy

You are monitoring an Azure Data Factory pipeline that runs hourly. You notice that the pipeline occasionally fails due to transient errors. Which monitoring solution should you use to get alerts on failures and analyze trends over time?

A.Azure Event Grid subscription for pipeline failures

B.Azure Monitor with Log Analytics workspace

C.Azure Dashboard pinned with pipeline metrics

D.Azure Data Factory Monitor in the Azure portal

AnswerB

Provides alerting and long-term trend analysis via KQL queries.

Why this answer

Option B is correct because Azure Monitor with alerts and Log Analytics provides historical analysis and alerting. Option A (Data Factory Monitor) is for real-time monitoring but lacks long-term trend analysis. Option C (Azure Dashboard) is a visualization tool.

Option D (Event Grid) is for event-driven notifications, not analysis.

Full explanation →

170

Multi-Selecthard

You are designing a data processing solution for a retail company that uses Azure Databricks. The solution needs to process streaming sales data from Event Hubs and batch data from Azure Data Lake Storage Gen2. You need to ensure that the solution can handle late-arriving data and maintain exactly-once semantics. Which TWO technologies should you use?

Select 2 answers

A.Delta Lake

B.Azure Databricks Structured Streaming

C.PolyBase

D.Azure Stream Analytics

E.Azure Data Factory

AnswersA, B

Provides ACID transactions and supports exactly-once semantics.

Why this answer

Correct answers: A and D. Delta Lake provides ACID transactions and schema enforcement, enabling exactly-once semantics. Azure Databricks Structured Streaming can handle late-arriving data using watermarking and event-time processing.

Option B is wrong because Azure Stream Analytics does not integrate natively with Delta Lake for exactly-once semantics. Option C is wrong because Azure Data Factory is not suitable for streaming processing. Option E is wrong because PolyBase is for batch loading, not streaming.

Full explanation →

171

Multi-Selectmedium

Which TWO actions should you take to secure data at rest in Azure Synapse Analytics dedicated SQL pools?

Select 2 answers

A.Use Always Encrypted with secure enclaves for specific columns.

B.Implement column-level security to filter sensitive columns.

C.Enable Transparent Data Encryption (TDE) on the SQL pool.

D.Configure Dynamic Data Masking on tables containing sensitive data.

E.Assign Azure RBAC roles to restrict access to the storage account.

AnswersA, C

Always Encrypts protects data at rest and in use.

Why this answer

Option A is correct: Transparent Data Encryption (TDE) encrypts data at rest without application changes. Option D is correct: Always Encrypted protects sensitive columns with client-side encryption. Option B (Dynamic Data Masking) is for masking data in query results, not at rest.

Option C (Azure RBAC) is for access control, not encryption. Option E (Column-level security) is for access control, not encryption.

Full explanation →

172

MCQhard

A company uses Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2. They notice that queries are slow and want to improve performance by reducing the amount of data read. What is the most effective strategy?

A.Partition the data by a frequently filtered column and use file elimination in queries.

B.Increase the number of compute nodes in the serverless pool.

C.Use OPENROWSET with CSV format instead of Parquet.

D.Create external tables using CETAS and query them.

AnswerA

Partitioning allows the serverless pool to skip irrelevant partitions, reducing data scanned.

Why this answer

Option A is correct because partitioning the data in the lake and using partition elimination reduces data read. Option B is wrong because OPENROWSET with CSV reads all files; it does not reduce data read. Option C is wrong because CETAS is for creating external tables, not for improving query performance directly.

Option D is wrong because increasing the number of compute nodes is not possible in serverless; it's auto-scaling.

Full explanation →

173

Multi-Selecteasy

Which TWO Azure features can be used to encrypt data at rest in Azure Blob Storage? (Choose two.)

Select 2 answers

A.Azure Disk Encryption

B.Azure Information Protection

C.Customer-managed keys in Azure Key Vault

D.Storage Service Encryption (SSE)

E.Transport Layer Security (TLS)

AnswersC, D

CMK allows customers to control encryption keys for SSE.

Why this answer

Options A and C are correct. A: Storage Service Encryption (SSE) encrypts data at rest by default. C: Customer-managed keys in Azure Key Vault provide additional control over encryption keys.

Option B is wrong because Azure Disk Encryption is for VMs. Option D is wrong because TLS is for data in transit. Option E is wrong because Azure Information Protection is for classification, not encryption at rest.

Full explanation →

174

Multi-Selecthard

You are optimizing the performance of an Azure Synapse Analytics dedicated SQL pool. Which THREE of the following actions will most likely improve query performance?

Select 3 answers

A.Create materialized views for frequently used aggregations

B.Use a smaller distribution column to improve data distribution

C.Partition large fact tables on a date column

D.Enable result-set caching for repetitive queries

E.Convert all tables to heap tables to avoid index maintenance

AnswersA, C, D

Materialized views store precomputed results, speeding up queries.

Why this answer

Option A is correct because partitioning large tables can reduce data scanned. Option B is correct because materialized views precompute aggregations. Option D is correct because result-set caching speeds up repeated queries.

Option C is wrong because smaller distribution columns can cause data skew. Option E is wrong because converting to heap table is for staging data, not performance optimization.

Full explanation →

175

MCQmedium

You are working with Azure Synapse Analytics serverless SQL pool. You need to query a set of Parquet files located in ADLS Gen2. The files have nested columns (structs and arrays). Which function should you use to flatten the nested data?

A.OPENJSON

B.PIVOT

C.UNNEST

D.CROSS APPLY

AnswerA

OPENJSON can parse nested JSON structures and flatten them into rows.

Why this answer

OPENJSON is the correct function because it parses JSON text and returns objects and properties from JSON input as rows and columns. In Azure Synapse serverless SQL pool, when Parquet files contain nested columns (structs and arrays), they are exposed as JSON strings, and OPENJSON can flatten these nested structures into a relational format. This allows you to query complex nested data directly without needing to pre-process the files.

Exam trap

The trap here is that candidates confuse CROSS APPLY with a flattening function, but CROSS APPLY only invokes a table-valued function (like OPENJSON) and does not flatten data by itself.

How to eliminate wrong answers

Option B (PIVOT) is wrong because PIVOT rotates table-valued expressions by turning unique values from one column into multiple columns in the output; it does not flatten nested data. Option C (UNNEST) is wrong because UNNEST is a PostgreSQL function for expanding arrays into rows; it is not supported in Azure Synapse serverless SQL pool. Option D (CROSS APPLY) is wrong because CROSS APPLY joins a table with a table-valued function, but it does not inherently flatten nested columns; it would require an additional function like OPENJSON to parse the nested data first.

Full explanation →

176

MCQeasy

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2. You want to minimize cost and complexity. Which Azure Data Factory feature should you use?

A.Use a Lookup activity to get the maximum timestamp from the sink and filter the source

B.Use the 'Incremental copy' capability with change tracking enabled on the source

C.Use a Stored Procedure activity to delete and reinsert data

D.Use a Mapping Data Flow to compare source and sink

AnswerB

This is the simplest and most cost-effective method.

Why this answer

Option B is correct because Azure Data Factory's 'Incremental copy' capability with change tracking on Azure SQL Database automatically identifies and transfers only the changed rows since the last run, using the built-in change tracking mechanism. This minimizes cost and complexity by avoiding custom logic for watermark columns or full reloads, as it handles the delta extraction natively.

Exam trap

The trap here is that candidates often confuse the 'Incremental copy' capability with manual watermark-based approaches (Option A) or assume that a Mapping Data Flow (Option D) is the only way to compare datasets, overlooking the native, cost-optimized change tracking integration.

How to eliminate wrong answers

Option A is wrong because using a Lookup activity to get the maximum timestamp from the sink and filter the source requires manual implementation of a watermark column, which adds complexity and does not leverage Azure SQL Database's native change tracking, potentially missing deletes or updates. Option C is wrong because using a Stored Procedure activity to delete and reinsert data performs a full reload of the target, which is costly and inefficient for incremental loading, and does not minimize cost or complexity. Option D is wrong because using a Mapping Data Flow to compare source and sink requires a full scan of both datasets to identify differences, which is resource-intensive and expensive, defeating the goal of minimizing cost and complexity.

Full explanation →

177

MCQeasy

Your company uses Azure Cosmos DB for NoSQL to store user profiles. The application frequently reads profiles by user ID (the partition key). Occasionally, the application needs to query by email address, which is not part of the partition key. What should you do to optimize the occasional queries by email?

A.Create a secondary (composite) index on the email field.

B.Change the partition key to the email field.

C.Denormalize the data by storing a copy of the email in the partition key.

D.Use the Azure Cosmos DB change feed to maintain a separate container keyed by email.

AnswerA

A secondary index allows efficient queries on non-partition key fields.

Why this answer

Creating a secondary index on the email field allows Azure Cosmos DB for NoSQL to efficiently serve queries filtering by email without scanning all partitions. Since email is not the partition key, a secondary index (specifically a composite index if needed for multi-field queries, or a single-field index) enables index-based lookup across all physical partitions, optimizing the occasional query without redesigning the data model.

Exam trap

The trap here is that candidates often assume a secondary index is unnecessary or that changing the partition key is the only way to optimize non-key queries, but Azure Cosmos DB supports secondary indexes for non-partition key fields, and altering the partition key would disrupt the primary access pattern.

How to eliminate wrong answers

Option B is wrong because changing the partition key to email would break the primary access pattern (reads by user ID), causing cross-partition queries for the frequent user ID lookups and likely exceeding request unit (RU) costs. Option C is wrong because denormalizing by storing a copy of the email in the partition key does not change the partition key itself; the partition key remains user ID, so queries by email would still require a cross-partition scan unless a secondary index is used. Option D is wrong because using the change feed to maintain a separate container keyed by email introduces operational complexity and eventual consistency, and is overkill for occasional queries; a secondary index is simpler and directly supported.

Full explanation →

178

MCQmedium

Match each Azure data storage service to its primary use case.

A.Azure Blob Storage

B.Azure Cosmos DB

C.Azure Data Lake Storage Gen2

D.Azure SQL Database

Why this answer

Azure Blob Storage is for unstructured object storage. Azure Cosmos DB is for globally distributed, multi-model NoSQL data. Azure Data Lake Storage Gen2 is for big data analytics.

Azure SQL Database is for relational OLTP.

Exam trap

Candidates might confuse Azure Data Lake Storage Gen2 with Blob Storage, but ADLS Gen2 has hierarchical namespace for analytics.

Full explanation →

179

MCQeasy

You need to ensure that an Azure Data Factory pipeline retries a failed activity up to three times with a 5-minute delay between retries. How should you configure the activity?

A.Configure the Retry policy on the pipeline activity as 'Exponential' with count 3

B.Set retry to 3 and retryIntervalInSeconds to 300 in the activity policy

C.Set the activity timeout to 15 minutes and enable retry

D.Set maxRetries to 3 and delay to 5 minutes in the pipeline JSON

AnswerB

This configures 3 retries with 300 seconds (5 minutes) interval.

Why this answer

The retry property in activity policy sets the number of retries, and retryIntervalInSeconds sets the delay. Option B is wrong because there is no 'maxRetries' property; it's 'retry'. Option C is wrong because 'retryPolicy' is not a property; retry is at the activity level.

Option D is wrong because timeout is separate from retry.

Full explanation →

180

MCQmedium

You are optimizing a Slowly Changing Dimension (SCD) Type 2 load in Azure Synapse Analytics. The dimension table has 10 million rows. Which table distribution strategy minimizes data movement during the merge operation?

A.Round-robin distribution

B.Hash distribution on the surrogate key

C.Replicated distribution

D.Hash distribution on the business key

AnswerD

Hash on business key ensures same key values land on same distribution, minimizing data movement.

Why this answer

Option D is correct because Hash distribution on the business key (surrogate key source) ensures that matching rows are co-located, reducing data movement during MERGE. Option A is wrong because Round-robin distributes randomly, causing data movement. Option B is wrong because Replicated works for small tables, not 10 million rows.

Option C is wrong because Hash on an unrelated column does not help.

Full explanation →

181

Multi-Selecthard

Your organization uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled. You need to implement a monitoring strategy to detect and alert on unusual access patterns that could indicate a security breach. Which THREE services or features should you use? (Choose three.)

Select 3 answers

A.Enable Microsoft Defender for Storage to get security alerts about unusual access patterns.

B.Apply Azure Policy to enforce encryption and access policies.

C.Ingest the logs into Microsoft Sentinel and create analytics rules for anomalous patterns.

D.Enable diagnostic settings on the storage account to collect read, write, and delete logs.

E.Use Azure Monitor Metrics to track storage account transactions and latency.

AnswersA, C, D

Correct: Defender for Storage provides built-in threat detection for Azure Storage.

Why this answer

Options A, B, and D are correct. A: Diagnostic settings stream logs to Log Analytics. B: Microsoft Sentinel provides advanced security analytics.

D: Microsoft Defender for Storage detects anomalies. Option C is wrong because Azure Monitor Metrics do not include access logs. Option E is wrong because Azure Policy is for governance, not monitoring.

Full explanation →

182

Multi-Selectmedium

Which TWO actions should you take to ensure that only authorized users can access sensitive data in an Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.Configure Azure Active Directory authentication

B.Enable dynamic data masking on all columns

C.Implement row-level security

D.Implement column-level security

E.Enable transparent data encryption

AnswersC, D

Row-level security filters rows based on user identity to prevent unauthorized access.

Why this answer

Row-level security (RLS) and column-level security are the two correct actions because they directly restrict data access at the row and column granularity within a dedicated SQL pool. RLS uses security predicates to filter which rows a user can query, while column-level security denies access to specific columns for unauthorized principals. Both are native features of Azure Synapse dedicated SQL pools that enforce authorization on the data plane.

Exam trap

The trap here is confusing data protection features (masking, encryption) with access control features (RLS, column-level security), leading candidates to select dynamic data masking or TDE instead of the correct granular authorization mechanisms.

Full explanation →

183

MCQeasy

You use Azure Data Lake Storage Gen2 with a hierarchical namespace. You need to delegate permissions to a group of data scientists so they can create folders and upload files only within a specific directory path. What is the best way to achieve this?

A.Use a stored access policy to grant permissions to the directory.

B.Set ACL entries on the specific directory path granting read, write, and execute permissions to the users.

C.Generate a shared access signature (SAS) with permissions scoped to the specific directory.

D.Assign the Storage Blob Data Contributor role to the users at the storage account level.

AnswerB

Correct: ACLs allow fine-grained permissions on directories and files.

Why this answer

Option B is correct because Azure Data Lake Storage Gen2 supports POSIX-like ACLs that can be set on directory paths to grant granular permissions. Option A is wrong because RBAC roles are scoped to the entire storage account or container, not subdirectories. Option C is wrong because SAS tokens are scoped to the entire storage account or container.

Option D is wrong because access policies are for shared access signatures, not granular directory permissions.

Full explanation →

184

MCQmedium

You are designing a data processing pipeline in Azure Synapse Analytics. The pipeline must ingest streaming data from Azure Event Hubs, perform real-time aggregations, and store the results in a dedicated SQL pool. Which component should you use to perform the real-time transformations?

A.Azure Databricks with Structured Streaming

B.Azure Data Factory

C.Azure Synapse Pipelines

D.Azure Stream Analytics

AnswerD

Azure Stream Analytics is optimized for real-time stream processing and can output to Synapse dedicated SQL pool.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed, real-time analytics service designed specifically for processing streaming data from sources like Azure Event Hubs. It supports SQL-based query language for performing aggregations, windowing functions, and temporal joins, and can directly output results to a dedicated SQL pool in Azure Synapse Analytics. This makes it the optimal component for ingesting streaming data, performing real-time transformations, and storing aggregated results in a Synapse dedicated SQL pool.

Exam trap

Microsoft often tests the distinction between batch and real-time processing services, and the trap here is that candidates may confuse Azure Synapse Pipelines or Azure Data Factory as capable of real-time streaming, when in fact they are batch-oriented orchestration tools.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming, while capable of real-time processing, introduces additional complexity and cost, and is not the native or most straightforward choice for directly outputting to a dedicated SQL pool in Synapse; it typically requires a separate write path or connector. Option B is wrong because Azure Data Factory is an orchestration and ETL service for batch data movement and transformation, not designed for real-time streaming ingestion or continuous aggregation. Option C is wrong because Azure Synapse Pipelines are built on Azure Data Factory and share the same batch-oriented architecture, lacking native support for real-time stream processing and windowed aggregations.

Full explanation →

185

Multi-Selectmedium

Which TWO security features should be implemented to protect data in Azure Data Lake Storage Gen2?

Select 2 answers

A.Role-based access control (RBAC)

B.Azure Policy

C.Azure Active Directory (Microsoft Entra ID) integration

D.Firewall and virtual network rules

E.Azure Monitor alerts

AnswersA, D

RBAC controls access to storage account resources at the management plane and data plane for certain operations.

Why this answer

Option A and D are correct: RBAC provides coarse-grained access control, and firewall rules restrict network access. Option B is wrong because Azure AD (Microsoft Entra ID) is used for authentication, but it's not a security feature specific to data protection. Option C is wrong because Azure Policy is for governance, not direct data protection.

Option E is wrong because Azure Monitor is for monitoring, not security.

Full explanation →

186

MCQhard

You are examining a T-SQL script that creates an external table in Azure Synapse serverless SQL pool. The query SELECT * FROM dbo.Sales returns zero rows, but the folder /year=2024/ in ADLS Gen2 contains Parquet files. What is the most likely cause?

A.The credential used to access ADLS Gen2 does not have sufficient permissions.

B.The serverless SQL pool does not support reading Parquet files.

C.The external table definition is missing the SCHEMA_NAME parameter.

D.The DATA_COMPRESSION setting is incompatible with Parquet files.

AnswerA

Insufficient permissions (e.g., missing Storage Blob Data Reader role) cause zero rows.

Why this answer

Option A is correct because the most common reason for SELECT * FROM an external table returning zero rows despite data existing in the underlying ADLS Gen2 folder is that the serverless SQL pool lacks the necessary permissions to read the Parquet files. The credential used in the external data source must have at least 'Storage Blob Data Reader' role on the storage account or the container, and the identity (e.g., SAS token, service principal, or managed identity) must be correctly configured. Without this, the query executes successfully but returns no rows because the pool cannot access the data.

Exam trap

The trap here is that candidates assume a missing or misconfigured schema parameter (like SCHEMA_NAME) would cause zero rows, but in reality, permission issues are the primary cause of empty results when the data path is correct and the file format is supported.

How to eliminate wrong answers

Option B is wrong because Azure Synapse serverless SQL pool fully supports reading Parquet files, including partitioned data like /year=2024/. Option C is wrong because the SCHEMA_NAME parameter is optional in CREATE EXTERNAL TABLE and is used for schema binding, not for data access or row retrieval. Option D is wrong because DATA_COMPRESSION is not applicable to Parquet files; Parquet has its own internal compression (e.g., Snappy, gzip) and the setting is ignored or causes an error, not silent zero rows.

Full explanation →

187

MCQhard

You are a data engineer for a global e-commerce company. You need to design a data processing solution using Azure Databricks that processes real-time clickstream data from Azure Event Hubs. The solution must join the streaming data with a slowly changing dimension (SCD) table that stores product details. The SCD table is stored in Azure Data Lake Storage Gen2 as Delta format and is updated every few hours. The joined results must be written to a Delta table for near-real-time dashboards. The key requirement is to ensure that the join always uses the latest version of the SCD data without reprocessing the entire stream. The solution must minimize latency and cost. Which approach should you recommend?

A.Use Spark Structured Streaming with foreachBatch and read the SCD table inside the foreachBatch function.

B.Use Spark Structured Streaming with a static DataFrame for the SCD table and refresh it periodically using a trigger that reads the latest snapshot.

C.Use Spark Structured Streaming with a batch read of the SCD table in each micro-batch using spark.read.

D.Use Spark Structured Streaming with a streaming join on the SCD table by converting it to a stream using readStream.

AnswerB

Periodic refresh of a static DataFrame minimizes overhead and ensures latest data.

Why this answer

Option B is correct because it uses a static DataFrame for the SCD table and refreshes it periodically using a trigger (e.g., a time-based or file-based trigger). This approach ensures that the join always uses the latest version of the SCD data without reprocessing the entire stream, as the static DataFrame is re-read only when the SCD is updated. It minimizes latency and cost by avoiding the overhead of reading the SCD in every micro-batch (as in Option A) or using a streaming join (as in Option D), which is not suitable for batch-updated SCD tables.

Exam trap

The trap here is that candidates often assume that reading the SCD table in every micro-batch (Option A or C) is the simplest way to get the latest data, but they overlook the significant performance and cost penalties of full table scans in each micro-batch, especially for large SCD tables.

How to eliminate wrong answers

Option A is wrong because foreachBatch with a read of the SCD table inside the function would read the SCD in every micro-batch, causing high latency and cost due to repeated full scans of the Delta table, and it does not efficiently handle the periodic updates of the SCD. Option C is wrong because reading the SCD table in each micro-batch using spark.read would also perform a full scan of the Delta table every micro-batch, leading to excessive I/O and cost, and it does not leverage the periodic refresh pattern. Option D is wrong because converting the SCD table to a stream using readStream is inappropriate for a slowly changing dimension that is updated every few hours; streaming the SCD would require it to be a continuous stream of changes (e.g., from a change data capture feed), which is not the case here, and it would add unnecessary complexity and cost.

Full explanation →

188

MCQhard

A financial services company needs to store transaction data for audit purposes. The data must be immutable and cannot be modified or deleted for 7 years. Which Azure storage feature should be used?

A.Azure Blob Storage immutability policy (time-based retention).

B.Azure Blob Storage soft delete.

C.Azure Blob Storage versioning.

D.Azure Files share snapshots.

AnswerA

Immutability policies enforce WORM (Write Once, Read Many) for a specified duration.

Why this answer

Azure Blob Storage immutability policy with time-based retention ensures that blobs cannot be modified or deleted for a specified retention period (e.g., 7 years). This meets the audit requirement for immutable storage by locking the data at the storage level, preventing any writes or deletes until the retention interval expires. The policy is enforced at the container level and applies to all blobs within, making it the correct choice for regulatory compliance.

Exam trap

The trap here is that candidates confuse soft delete or versioning with immutability, not realizing that only a locked time-based retention policy provides the strict write-once, read-many guarantee required for audit data that cannot be modified or deleted for a fixed duration.

How to eliminate wrong answers

Option B is wrong because soft delete only protects against accidental deletion by retaining deleted blobs for a configurable period, but it does not prevent modification or provide true immutability; data can still be overwritten. Option C is wrong because versioning preserves previous versions of a blob when overwritten or deleted, but it does not block writes or deletes—new versions can be created, and the current version can be modified, violating immutability. Option D is wrong because Azure Files share snapshots are point-in-time read-only copies of a file share, but they do not enforce a write-once-read-many (WORM) state on the live share; the original files can still be modified or deleted.

Full explanation →

189

MCQeasy

You are designing a batch processing solution for a data lake. Source files arrive daily in Parquet format in Azure Data Lake Storage Gen2. The data must be cleaned, aggregated, and loaded into an Azure Synapse SQL pool. The solution should minimize compute costs and management overhead. Which technology should you use for the transformation?

A.Azure HDInsight with Spark jobs scheduled in Azure Data Factory.

B.Azure Synapse Pipelines with mapping data flows.

C.Azure Data Factory with a custom SSIS package.

D.Azure Databricks with an Auto Loader pipeline.

AnswerB

Mapping data flows in Synapse Pipelines provide serverless, code-free transformation with minimal management.

Why this answer

Option C is correct because Synapse Pipelines (built on ADF) can orchestrate mapping data flows that run on serverless Spark clusters, minimizing management overhead. Option A (HDInsight) requires cluster management. Option B (Databricks) incurs higher costs for simple batch.

Option D (SSIS) is legacy and not cloud-native.

Full explanation →

190

MCQhard

You are a data engineer for a healthcare company that processes patient data. You have an Azure Databricks workspace with a cluster configured for data processing. You need to implement a solution that processes streaming data from Azure Event Hubs, enriches it with reference data stored in Azure Cosmos DB, and writes the output to Delta Lake in Azure Data Lake Storage Gen2. The solution must ensure that the data processing is fault-tolerant and can handle schema evolution. The reference data is updated infrequently. You need to choose an approach that minimizes complexity and cost. What should you do?

A.Use Azure Databricks Auto Loader with Delta Live Tables to ingest streaming data, and use Change Data Capture from Cosmos DB to update the reference data inline.

B.Use Azure Data Factory to copy data from Event Hubs to Azure Data Lake Storage Gen2 in batches, then use Azure Databricks to process and enrich with Cosmos DB.

C.Use Azure Stream Analytics to ingest from Event Hubs, join with Cosmos DB reference data, and output to Azure Data Lake Storage Gen2 in Parquet format.

D.Use Azure Databricks Structured Streaming to read from Event Hubs, use a streaming static join to enrich with reference data from Cosmos DB, and write to Delta Lake. Enable schema evolution on the Delta table.

AnswerD

Simplifies processing and handles schema evolution.

Why this answer

Option B is correct because Auto Loader can incrementally process streaming data from Event Hubs (via Event Hubs connector), and it can automatically handle schema evolution. Broadening the reference data join to use a streaming static join in Structured Streaming is efficient for infrequently updated reference data. Option A is wrong because using Change Data Capture adds complexity.

Option C is wrong because using Azure Stream Analytics would require additional service cost and integration. Option D is wrong because using batch processing for streaming data would increase latency.

Full explanation →

191

MCQhard

You are a data engineer for a large e-commerce company. The company uses Azure Synapse Analytics dedicated SQL pool as its enterprise data warehouse. A new business requirement mandates that the Sales fact table, which contains 2 billion rows, must support real-time analytics with a maximum query latency of 1 second for aggregations on the most recent 24 hours of data. The table is currently hash-distributed on CustomerID and partitioned monthly by SaleDate. The current query performance for recent data is slow due to full partition scans. The data is ingested via Azure Event Hubs and processed by Azure Stream Analytics, which writes to staging tables every minute. You need to redesign the storage to meet the latency requirement while minimizing cost and maintaining data integrity. Which approach should you take?

A.Increase partition granularity to hourly partitions for the Sales table.

B.Change the Sales table to use a clustered index on SaleDate and a round-robin distribution.

C.Create a separate staging table for the last 24 hours of data with round-robin distribution and a clustered columnstore index. After each batch load, merge the staging table into the main partitioned Sales table. Queries for recent data should target the staging table.

D.Provision a second dedicated SQL pool optimized for real-time queries and replicate the recent data there.

AnswerC

The staging table is small, enabling fast queries for recent data; merging maintains the historical archive.

Why this answer

Option C is correct because it isolates the hot (recent 24-hour) data into a separate staging table with a clustered columnstore index, which is optimized for fast aggregations and high compression. This avoids full partition scans on the 2-billion-row main table, and merging into the partitioned table after each batch load maintains data integrity without requiring expensive repartitioning or additional dedicated SQL pools.

Exam trap

The trap here is that candidates assume finer partitioning (Option A) always improves query performance, but they overlook that partition elimination still requires scanning the entire partition, and that a separate hot table with columnstore indexing is more efficient for real-time aggregations on a sliding window of recent data.

How to eliminate wrong answers

Option A is wrong because increasing partition granularity to hourly partitions would increase metadata overhead and partition management complexity, and full partition scans would still occur on the most recent partition, failing to meet the 1-second latency requirement. Option B is wrong because changing to a clustered index on SaleDate with round-robin distribution eliminates data locality for joins on CustomerID and introduces full data movement for aggregations, while round-robin distribution is unsuitable for large fact tables due to lack of partition elimination. Option D is wrong because provisioning a second dedicated SQL pool duplicates storage and compute costs unnecessarily, and replicating data introduces latency and complexity in maintaining consistency, violating the cost-minimization requirement.

Full explanation →

192

MCQhard

A company uses Azure Data Factory to copy sensitive data from on-premises SQL Server to Azure Blob Storage. They must ensure that data is encrypted in transit and at rest. Which combination of features should they use?

A.Use Always Encrypted in SQL Server and customer-managed keys in Blob Storage.

B.Set up a VPN between on-premises and Azure, and use Azure Disk Encryption.

C.Configure the copy activity to use TLS and enable Azure Storage Service Encryption.

D.Use HTTPS for the copy activity and enable Azure Storage Service Encryption.

AnswerC

TLS encrypts data in transit; Storage Service Encryption encrypts at rest automatically.

Why this answer

Option C is correct because Azure Data Factory's copy activity uses TLS (Transport Layer Security) to encrypt data in transit between the on-premises SQL Server and Azure Blob Storage, and Azure Storage Service Encryption (SSE) automatically encrypts data at rest using 256-bit AES encryption. This combination satisfies both encryption requirements without additional complexity.

Exam trap

The trap here is that candidates often confuse HTTPS with TLS, thinking HTTPS is the encryption mechanism for Data Factory copy activities, when in fact TLS is the underlying protocol used by the self-hosted integration runtime for secure data transfer.

How to eliminate wrong answers

Option A is wrong because Always Encrypted in SQL Server encrypts data at the column level within the database, but it does not encrypt data in transit during the copy operation; customer-managed keys in Blob Storage are for at-rest encryption but do not address transit encryption. Option B is wrong because a VPN encrypts the network tunnel between on-premises and Azure, but it does not encrypt data at rest in Blob Storage; Azure Disk Encryption is for IaaS VMs, not PaaS Blob Storage. Option D is wrong because HTTPS is a protocol that encrypts data in transit, but it is not the default or recommended encryption method for Data Factory copy activities; TLS is the standard, and while SSE handles at-rest encryption, the option incorrectly specifies HTTPS instead of TLS.

Full explanation →

193

MCQhard

You are a data engineer for a healthcare company. You have a production Azure Synapse Analytics dedicated SQL pool (DW500c) that hosts patient data. The pool is used for both ETL and reporting. You need to ensure that reporting queries always get resources even during heavy ETL loads. You also need to monitor query performance and set up alerts when certain queries exceed a threshold. You have configured workload management using workload groups and classifiers. However, reporting queries are still waiting for resources when ETL is running. You check the sys.dm_pdw_exec_requests DMV and see that ETL queries are using the largest resource class. You need to modify the configuration to guarantee resources for reporting. What should you do?

A.Create a new workload group for reporting with min_percentage_resource set to 30%

B.Set importance to HIGH for the reporting workload group

C.Increase the DWU setting to DW1000c

D.Change the classifier for reporting queries to use the same workload group as ETL but with a different resource class

AnswerA

Guarantees a minimum resource allocation for reporting.

Why this answer

Workload isolation is achieved by creating separate workload groups with minimum resource allocation. By setting minimum resources for the reporting workload group, it ensures a baseline of resources even during heavy ETL. Option A (increase DWU) is costly and may not isolate.

Option B (importance) helps but without min resources, high concurrency can still starve. Option D (change classifier) does not guarantee resources.

Full explanation →

194

MCQeasy

Your company uses Azure Data Lake Storage Gen2. You need to ensure that data at rest is encrypted using a customer-managed key stored in Azure Key Vault. What should you configure?

A.Use Azure Policy to audit storage accounts without encryption.

B.Enable 'Azure Storage encryption' with customer-managed keys in the storage account's encryption blade.

C.Implement client-side encryption in the application code.

D.Enable 'Infrastructure encryption' for double encryption.

AnswerB

This configures server-side encryption with CMK.

Why this answer

Option A is correct because Azure Storage encryption with customer-managed keys is configured at the storage account level. Option B is wrong because infrastructure encryption is an additional layer. Option C is wrong because client-side encryption is done by the application, not at rest.

Option D is wrong because Azure Policy can enforce encryption but does not configure it.

Full explanation →

195

MCQeasy

You are analyzing the exhibit from an Azure Monitor metric query for a storage account. What is the primary purpose of this query?

A.To calculate the average number of block blobs in the hot tier.

B.To identify the time period with the highest blob count.

C.To measure the total size of all block blobs in the account.

D.To retrieve the average count of block blobs per hour.

AnswerD

Metric BlobCount with aggregation Average and filter on BlobType equals BlockBlob.

Why this answer

The query uses the 'avg' aggregation on the 'BlobCount' metric, which calculates the average number of blobs over the specified time granularity (e.g., per hour). The result shows the average count of block blobs per hour, not the total count or the count in a specific tier. This aligns with option D, as the query is designed to retrieve the average count of block blobs per hour.

Exam trap

The trap here is that candidates often confuse 'avg' with 'sum' or 'max', leading them to incorrectly think the query calculates total blob count or identifies peak periods, rather than recognizing that 'avg' specifically computes the average value over the time granularity.

How to eliminate wrong answers

Option A is wrong because the query does not filter by blob tier (hot, cool, or archive); it retrieves the average count of all block blobs, not just those in the hot tier. Option B is wrong because the query uses the 'avg' aggregation, which returns an average value over the time period, not the maximum or peak blob count; to identify the time period with the highest blob count, you would need to use the 'max' aggregation. Option C is wrong because the query measures 'BlobCount', which is the number of blobs, not their size; to measure total size, you would use the 'BlobCapacity' metric.

Full explanation →

196

MCQhard

You are designing a data storage solution for a global IoT application that ingests millions of events per second. The data is write-heavy with occasional reads for real-time dashboards. Which Azure storage option and configuration would provide the lowest latency writes with high throughput?

A.Azure Cosmos DB with multi-region writes and eventual consistency

B.Azure Cosmos DB with single-region writes and strong consistency

C.Azure Data Lake Storage Gen2 with hierarchical namespace

D.Azure Blob Storage with hot tier and append blobs

AnswerA

Why this answer

Azure Cosmos DB with multi-region writes and eventual consistency provides the lowest latency writes for a global IoT application because it allows each region to accept writes independently without cross-region coordination, and eventual consistency removes the need for quorum confirmations, reducing write latency. This configuration also offers high throughput by distributing write load across multiple regions, making it ideal for write-heavy, high-volume IoT scenarios where occasional reads for dashboards can tolerate stale data.

Exam trap

The trap here is that candidates often assume strong consistency is required for real-time dashboards, but eventual consistency is sufficient for write-heavy IoT scenarios where occasional stale reads are acceptable, and multi-region writes drastically reduce latency compared to single-region writes.

Why the other options are wrong

B

Strong consistency increases write latency and single-region limits throughput.

C

Data Lake Storage is optimized for analytics, not low-latency writes.

D

Blob storage has higher write latency and append blobs are not ideal for high-throughput ingestion.

Full explanation →

197

MCQmedium

Your company uses Azure Data Lake Storage Gen2 for a data lake. You need to implement a folder structure that separates data by sensitivity level. Which access control method should you use?

A.Use a storage account firewall and virtual network service endpoints

B.Use storage account keys for all access

C.Use Azure RBAC at the resource group level and ACLs on directories

D.Use shared access signatures (SAS) for each folder

AnswerC

C is correct because RBAC provides coarse control and ACLs provide fine-grained folder-level permissions.

Why this answer

Option C is correct because Azure Data Lake Storage Gen2 supports both Azure RBAC at the resource group level for coarse-grained control and POSIX-like ACLs on directories for fine-grained, sensitivity-based access. This combination allows you to assign read/write/execute permissions per directory, enabling a folder structure that separates data by sensitivity level without compromising security.

Exam trap

The trap here is that candidates often confuse network-level controls (firewall, VNet) or shared access signatures with the directory-level ACLs required for sensitivity-based folder separation, overlooking that only ACLs provide the POSIX-style granularity needed for hierarchical data lakes.

How to eliminate wrong answers

Option A is wrong because a storage account firewall and virtual network service endpoints control network-level access to the entire storage account, not granular folder-level permissions by sensitivity. Option B is wrong because storage account keys provide full administrative access to the entire account, bypassing any folder-level sensitivity controls and violating the principle of least privilege. Option D is wrong because shared access signatures (SAS) grant time-limited, delegated access to specific containers or blobs, but they cannot enforce POSIX ACLs on directories and are not designed for persistent, sensitivity-based folder structures.

Full explanation →

198

MCQmedium

A company is designing a data storage solution for streaming IoT telemetry data. The data is JSON-formatted, arrives at up to 10,000 events per second, and must be stored for at least 30 days for real-time dashboards and ad-hoc querying. The solution must minimize operational overhead and query latency. Which Azure service should they use?

A.Azure Blob Storage with Azure Data Lake Storage Gen2

B.Azure Data Explorer (ADX)

C.Azure Cosmos DB with analytical store

D.Azure SQL Database with elastic query

AnswerB

ADX is built for high-speed ingestion of streaming data, supports JSON, and provides sub-second query performance for dashboards.

Why this answer

Azure Data Explorer (ADX) is purpose-built for high-velocity telemetry and time-series data, ingesting up to 10,000 events per second with low latency. Its columnar storage and indexing enable sub-second queries on JSON data for real-time dashboards, while the 30-day retention is natively configurable via caching and soft-delete policies. This minimizes operational overhead by eliminating the need for manual partitioning or index tuning.

Exam trap

The trap here is that candidates confuse Azure Data Explorer with Azure Data Lake Storage, assuming that a data lake can serve real-time dashboards, but ADLS Gen2 lacks the indexing and query engine needed for sub-second latency on streaming data.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage with ADLS Gen2 is optimized for large-scale batch analytics and data lakes, not for real-time, sub-second queries on streaming data; querying JSON blobs directly incurs high latency and requires additional compute (e.g., Azure Synapse or Databricks). Option C is wrong because Azure Cosmos DB with analytical store is designed for transactional workloads with real-time analytics on operational data, but its ingestion throughput for 10,000 events/second of pure telemetry would be costly and over-provisioned, and the analytical store is better suited for hybrid transactional/analytical processing (HTAP) rather than pure streaming telemetry. Option D is wrong because Azure SQL Database with elastic query is a relational OLTP system not optimized for high-velocity JSON ingestion or time-series queries; it would require extensive schema design, indexing, and sharding to handle 10,000 events/second, and query latency would be higher due to row-based storage.

Full explanation →

199

MCQeasy

A retail company uses Azure Synapse Analytics dedicated SQL pool to store sales data. The data is loaded nightly from Azure Data Lake Storage Gen2 using PolyBase. Recently, the load process started failing with the error 'External table 'sales' is not accessible because the location does not exist or is used by another process.' You verify that the storage account, container, and file path are correct. The file is a CSV file named 'sales_20250301.csv' and it exists. Other files in the same container load successfully. What is the most likely cause of the error?

A.The network connectivity between Synapse and the storage account is intermittent.

B.The CSV file has an incorrect number of columns or contains a header row that mismatches the schema.

C.The storage account key used in the external data source has expired.

D.The file is being written to or is locked by another process during the PolyBase read.

AnswerD

The error 'used by another process' indicates a file lock, typically because the file is still being written or another reader has an exclusive lock.

Why this answer

The error 'location does not exist or is used by another process' specifically indicates that the file is locked by another process. In Azure Data Lake Storage Gen2, when a file is being written or modified, it can be locked by the writing process (e.g., a data ingestion pipeline or another ETL job). PolyBase attempts to read the file while it is still being written, causing the error.

Option D correctly identifies this concurrency issue.

Exam trap

The trap here is that candidates often confuse file-locking errors with authentication or schema issues, but the specific wording 'used by another process' directly points to a concurrency/lock conflict rather than connectivity or data format problems.

How to eliminate wrong answers

Option A is wrong because intermittent network connectivity would typically cause timeout or connection reset errors, not a 'location does not exist or is used by another process' error. Option B is wrong because schema mismatches (incorrect column count or header row) would produce data conversion or parsing errors, not a location accessibility error. Option C is wrong because an expired storage account key would result in an authentication failure (e.g., 403 Forbidden), not a location-not-found or file-locked error.

Full explanation →

200

MCQeasy

You need to store JSON files from an external partner in Azure Blob Storage. The files contain sensitive financial data. Which access method provides the highest security while allowing the partner to upload files?

A.Share the storage account access key with the partner

B.Configure a firewall to allow only the partner's IP address

C.Generate a shared access signature (SAS) with write-only permission and expiry time

D.Enable anonymous public access to a container

AnswerC

B is correct because SAS provides scoped, time-limited access.

Why this answer

Option C is correct because a Shared Access Signature (SAS) with write-only permission and an expiry time provides delegated, time-limited access to Azure Blob Storage without exposing the storage account key. This ensures the partner can upload JSON files but cannot read, modify, or list existing blobs, and access automatically revokes after the expiry, meeting the highest security requirement for sensitive financial data.

Exam trap

The trap here is that candidates often confuse IP-based firewalls (Option B) as a security method, but firewalls do not authenticate users or control data-plane permissions, and they can be bypassed if the partner's IP changes or if the partner uses a shared network.

How to eliminate wrong answers

Option A is wrong because sharing the storage account access key grants full administrative control (read, write, delete, list) over all blobs and containers, violating the principle of least privilege and exposing the key to potential compromise. Option B is wrong because a firewall restricting to the partner's IP address does not authenticate the partner's identity or control permissions; it only limits network-level access, and the partner's IP may change, leading to access failures or security gaps. Option D is wrong because enabling anonymous public access allows anyone on the internet to read and list blobs without any authentication, completely exposing sensitive financial data.

Full explanation →

201

MCQhard

You are designing a data lake in Azure Data Lake Storage Gen2 for a large enterprise. You need to ensure that only authorized users can access the data, and you must implement the principle of least privilege. Which security mechanism should you use to grant fine-grained access to specific directories and files without modifying the underlying storage account firewall settings?

A.Azure RBAC roles combined with POSIX-like ACLs

B.Managed identities for Azure resources

C.Storage account firewall rules

D.Shared access signatures (SAS)

AnswerA

RBAC roles grant coarse permissions (e.g., Storage Blob Data Contributor) while ACLs provide fine-grained permissions on directories and files, enabling least privilege.

Why this answer

Option D is correct because Azure RBAC with ACLs allows fine-grained permissions at the directory and file level. Option A is wrong because SAS tokens grant time-limited access but are not fine-grained. Option B is wrong because managed identities provide identity-based access but still require RBAC or ACLs for fine-grained control.

Option C is wrong because firewall rules apply at the account level, not directory/file level.

Full explanation →

202

MCQmedium

You are designing a change data capture (CDC) pipeline to ingest incremental changes from an on-premises SQL Server database into Azure Data Lake Storage Gen2. The pipeline must run every 5 minutes and handle high-volume DML changes. Which Azure service should you use to capture the changes with low latency?

A.Azure Data Share to share the SQL Server data and capture changes.

B.Azure Data Factory with a change data capture (CDC) source in the mapping data flow.

C.Azure Synapse Pipelines with a copy activity that uses a query to capture changes.

D.Azure Databricks with Auto Loader and Delta Live Tables to capture changes.

AnswerB

ADF supports CDC from SQL Server with low latency via mapping data flows.

Why this answer

Azure Data Factory's mapping data flow includes a native CDC source that can connect to SQL Server and capture incremental DML changes (inserts, updates, deletes) with low latency. This approach uses change tracking or change data capture features in SQL Server to identify changes, and the pipeline can run every 5 minutes to meet the high-volume requirement without custom coding.

Exam trap

The trap here is that candidates often confuse Azure Data Factory's copy activity (which requires manual change tracking) with the mapping data flow's native CDC source, leading them to choose option C or D, which are better suited for different ingestion patterns.

How to eliminate wrong answers

Option A is wrong because Azure Data Share is designed for sharing snapshots or incremental data between organizations, not for capturing low-latency CDC from an on-premises SQL Server into Azure Data Lake Storage Gen2. Option C is wrong because Azure Synapse Pipelines' copy activity with a query-based approach typically requires manual watermarking and cannot natively capture DML changes with low latency; it lacks the built-in CDC source that mapping data flow provides. Option D is wrong because Azure Databricks with Auto Loader and Delta Live Tables is optimized for streaming file ingestion (e.g., from cloud storage), not for directly capturing CDC from an on-premises SQL Server database.

Full explanation →

203

MCQmedium

You are designing a streaming data solution for IoT devices that generate 10,000 events per second. The data must be processed with sub-second latency and then stored in Azure Data Lake Storage Gen2 for archival. Which Azure service should you use for the stream processing?

A.Azure HDInsight including Spark Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Event Hubs

AnswerB

Azure Stream Analytics provides sub-second latency stream processing and native output to Azure Data Lake Storage Gen2.

Why this answer

Azure Stream Analytics is designed for real-time stream processing with sub-second latency and can output to Azure Data Lake Storage Gen2. Option A (HDInsight including Spark) is batch-oriented and has higher latency. Option C (Data Factory) is for orchestration, not streaming.

Option D (Event Hubs) is an ingestion service, not a processing service.

Full explanation →

204

Multi-Selecteasy

You need to design a storage solution for IoT device telemetry data that will be queried by time range. The data is append-only and arrives at high velocity. Which TWO features should you use to optimize query performance and reduce costs?

Select 2 answers

A.Store data in columnar format (e.g., Parquet)

B.Create indexes on all columns

C.Enable row-level security

D.Partition the data by date

E.Enable geo-redundant storage

AnswersA, D

Columnar format reduces I/O and improves compression.

Why this answer

Columnar formats like Parquet store data by column rather than by row, which significantly reduces I/O when querying only a subset of columns (common in time-range queries). This compression and column pruning directly lowers storage costs and speeds up scan-heavy analytical queries on append-only IoT telemetry data.

Exam trap

The trap here is that candidates often confuse indexing (B) with partitioning, but for append-only analytical workloads, indexes add write overhead and cost without benefit, while date partitioning directly enables partition elimination for time-range queries.

Full explanation →

205

MCQmedium

You are designing a data processing solution for an e-commerce company that uses Azure Synapse Analytics. The solution must process clickstream data from a web application. The data arrives in JSON format through Azure Event Hubs. You need to load the data into a dedicated SQL pool every 5 minutes with minimal latency. The data volume is about 100 MB every 5 minutes. You want to use PolyBase for loading. Which approach should you use?

A.Use Azure Stream Analytics to transform the JSON data and output directly to the dedicated SQL pool.

B.Use Azure Data Factory with a Copy activity to copy data from Event Hubs to Azure Data Lake Storage Gen2 as JSON files, then use a PolyBase activity to load from ADLS Gen2 to the dedicated SQL pool.

C.Use Azure Databricks to read from Event Hubs, transform the data, and write to the dedicated SQL pool using JDBC.

D.Use PolyBase directly from Event Hubs to dedicated SQL pool by creating an external data source that points to Event Hubs.

AnswerB

Simplifies staging and leverages PolyBase for efficient loading.

Why this answer

Option A is correct because using Azure Data Factory to copy JSON data to ADLS Gen2 as a staging area, then using PolyBase to load into dedicated SQL pool, is efficient for small frequent loads. Option B is wrong because loading directly from Event Hubs using PolyBase is not supported. Option C is wrong because using Azure Stream Analytics adds unnecessary complexity.

Option D is wrong because using Azure Databricks would require a cluster running continuously, increasing cost.

Full explanation →

206

MCQeasy

You are configuring Azure Synapse Analytics dedicated SQL pool. To optimize query performance for a large fact table that is frequently filtered on date and region columns, which table distribution and indexing strategy should you recommend?

A.Hash distribution on date column with clustered index.

B.Replicated distribution with heap index.

C.Round-robin distribution with clustered index.

D.Hash distribution on region column with clustered columnstore index.

AnswerD

Hash distribution on a high-cardinality column improves joins and filtering; columnstore is efficient for large tables.

Why this answer

Option B is correct because hash distribution on region spreads data evenly if region has high cardinality, and clustered columnstore index is optimal for large fact tables. Option A is wrong because round-robin distribution is for staging tables. Option C is wrong because heap index is for temporary data.

Option D is wrong because replicated distribution is for small dimension tables.

Full explanation →

207

MCQhard

You are optimizing a pipeline in Azure Data Factory that copies data from Azure Blob Storage to Azure Synapse Analytics. The pipeline uses a copy activity with PolyBase. The data is partitioned by date in Blob Storage. You notice that the load is slow. What is the most likely cause?

A.The source files are stored in Azure Blob Storage instead of Data Lake Storage Gen2

B.The source files are in CSV format instead of Parquet

C.The source files are too many and too small (e.g., thousands of 1 MB files)

D.The sink table has a clustered columnstore index

AnswerC

Many small files cause overhead; PolyBase is optimized for fewer, larger files.

Why this answer

PolyBase in Azure Synapse Analytics performs best when reading large, contiguous files. When the source contains thousands of small files (e.g., 1 MB each), PolyBase must initiate a separate read operation for each file, causing excessive overhead from file open/close operations and metadata requests. This dramatically reduces throughput compared to reading fewer, larger files.

Exam trap

The trap here is that candidates often focus on file format (Parquet vs. CSV) or storage type (Blob vs. ADLS Gen2) as the primary performance factor, when in reality the number and size of files is a more common and impactful bottleneck in PolyBase loads.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage is fully supported as a PolyBase source; Data Lake Storage Gen2 offers hierarchical namespace benefits but does not inherently improve PolyBase load speed. Option B is wrong because while Parquet is more efficient for analytics due to columnar storage and compression, CSV is still a valid PolyBase source and the primary bottleneck here is file count, not format. Option D is wrong because a clustered columnstore index is actually recommended for Synapse Analytics tables to improve query performance and compression; it does not slow down the PolyBase load itself.

Full explanation →

208

MCQhard

You are tuning a dedicated SQL pool in Azure Synapse Analytics. A query that joins two large tables (fact_sales and dim_product) is slow. The fact_sales table is hash-distributed on product_id, and dim_product is replicated. You notice that the query plan shows a shuffle move. What is the most likely cause?

A.The fact_sales table uses clustered columnstore index.

B.The dim_product table is replicated, causing a broadcast join.

C.Statistics are out of date on both tables.

D.The join condition does not include the distribution key for fact_sales.

AnswerD

Joins on non-distribution keys require data movement.

Why this answer

Option D is correct because when the join condition does not include the distribution key (product_id) of the hash-distributed fact_sales table, the SQL engine cannot perform a collocated join. Instead, it must shuffle data across distributions to satisfy the join, which introduces expensive data movement. The shuffle move in the query plan directly indicates this redistribution.

Exam trap

The trap here is that candidates often confuse a shuffle move with a broadcast join or blame indexing, but the root cause is the mismatch between the join key and the distribution key, which forces data movement regardless of other optimizations.

How to eliminate wrong answers

Option A is wrong because a clustered columnstore index is optimized for large fact tables and typically improves query performance; it would not cause a shuffle move. Option B is wrong because a replicated table (dim_product) is designed to avoid shuffles by having a copy on each distribution, enabling a broadcast join without data movement. Option C is wrong because out-of-date statistics can lead to suboptimal plans but do not directly force a shuffle move; the shuffle is a structural requirement based on the join key not matching the distribution key.

Full explanation →

209

Multi-Selecthard

You are optimizing an Azure Synapse Analytics dedicated SQL pool that is experiencing high concurrency and frequent resource class contention. You need to improve query performance and reduce contention without changing the workload. Which two actions should you take? (Choose two.)

Select 2 answers

A.Increase the DWU (Data Warehouse Units) to allocate more resources.

B.Implement workload isolation to separate critical queries into dedicated resource groups.

C.Create materialized views to pre-aggregate data.

D.Use workload classification to assign importance and resource allocation to different queries.

E.Enable result set caching to reduce repeated query execution.

AnswersB, D

Workload isolation prevents resource contention by allocating dedicated resources to specific workloads.

Why this answer

Options B and D are correct. Workload isolation and workload classification help manage resource allocation and reduce contention. Option A is wrong because increasing DWU may help but does not address contention directly.

Option C is wrong because result set caching helps read workloads but not contention. Option E is wrong because materialized views improve performance but do not reduce contention.

Full explanation →

210

MCQmedium

You are configuring security for an Azure Synapse Analytics workspace that uses a serverless SQL pool. The workspace is connected to Azure Data Lake Storage Gen2 via a managed identity. You need to ensure that only the Synapse workspace can access the storage account, and no other Azure service or user can access it directly. The storage account should not be accessible from the public internet. What should you do?

A.Configure the storage account firewall to allow only the Synapse workspace's public IP address.

B.Enable 'Allow trusted Microsoft services to access this storage account' on the firewall.

C.Use Azure RBAC to assign the Storage Blob Data Contributor role to the Synapse workspace managed identity.

D.Configure a private endpoint for the storage account in the same virtual network as the Synapse workspace, and disable public network access.

AnswerD

Private endpoint ensures private connectivity; disabling public access restricts others.

Why this answer

Option A is correct because a private endpoint for the storage account, combined with disabling public network access, ensures only the Synapse workspace can access it via the managed identity. Option B is wrong because firewall rules alone do not restrict access to a specific service. Option C is wrong because allowing trusted Microsoft services is broader than just Synapse.

Option D is wrong because managed identity alone does not restrict network access.

Full explanation →

211

Multi-Selectmedium

Which TWO Azure services can be used to perform real-time data processing on streaming data?

Select 2 answers

A.Azure Synapse Analytics (dedicated SQL pool)

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Logic Apps

E.Azure Databricks

AnswersA, B

Synapse dedicated SQL pool can ingest and query streaming data.

Why this answer

Azure Synapse Analytics (dedicated SQL pool) can ingest and process streaming data using PolyBase or the COPY INTO command to load data from Azure Stream Analytics or Event Hubs into staging tables, then perform real-time analytics via materialized views or incremental statistics. This enables near-real-time processing on high-velocity data by combining batch and streaming ingestion patterns.

Exam trap

The trap here is that candidates often confuse Azure Data Factory or Logic Apps as real-time streaming services because they can handle event-driven triggers, but they lack the continuous, low-latency query engine required for true stream processing, unlike Stream Analytics and Synapse's dedicated SQL pool with streaming ingestion.

Full explanation →

212

Multi-Selecteasy

Which TWO techniques can you use to handle schema drift in Azure Data Factory mapping data flows?

Select 2 answers

A.Enable 'Allow schema drift' in the source transformation

B.Use derived column transformation to handle each new column manually

C.Use column pattern matching to automatically map columns with similar names

D.Use assertion rules to reject rows with unknown columns

E.Use a fixed schema mapping to ignore unknown columns

AnswersA, C

Allows columns to be added without breaking the pipeline.

Why this answer

Options A and D are correct. 'Allow schema drift' allows flexible column mapping based on incoming schema. Column pattern matching can map dynamic columns to a target. Option B is wrong because using a fixed schema mapping is the opposite of handling drift.

Option C is wrong because deriving columns is for static transformations. Option E is wrong because assertion rules are for validation, not adaptation.

Full explanation →

213

MCQhard

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from a CSV file in Blob Storage to a Synapse dedicated SQL pool table named dbo.Sales. The pipeline fails. The error message indicates that the 'Amount' column in the sink table does not allow NULLs but the source contains NULL values. What is the best way to resolve this issue without losing data?

A.Use a Mapping Data Flow with a Derived Column transformation to replace NULLs with 0

B.Add a filter in the copy activity to exclude rows with NULL Amount

C.Modify the sink table to have a default value for the Amount column

D.Change the sink table column to allow NULLs

AnswerA

Mapping Data Flow allows you to handle NULLs by providing a default value, ensuring data integrity.

Why this answer

Option D is correct: Adding a derived column transformation in a Mapping Data Flow allows you to replace NULLs with a default value. Option A is wrong because it changes the sink schema which may not be allowed. Option B is wrong because filtering out rows with NULLs would lose data.

Option C is wrong because setting the column to allow NULLs might not be permitted and could impact downstream processes.

Full explanation →

214

MCQhard

A company is running a Spark job on Azure Databricks that processes 500 GB of data daily. The job frequently fails with 'OutOfMemoryError' during shuffles. The cluster uses 10 workers of type Standard_DS3_v2 (14 GB memory each). Which configuration change should you make to improve stability without over-provisioning?

A.Set spark.sql.shuffle.partitions to a higher value, e.g., 500.

B.Increase the driver memory to 28 GB.

C.Increase the number of workers to 20.

D.Reduce spark.sql.shuffle.partitions to 100.

AnswerA

Reduces data per partition, easing memory.

Why this answer

The 'OutOfMemoryError' during shuffles indicates that individual partitions are too large for the executor memory. Increasing `spark.sql.shuffle.partitions` to 500 reduces the amount of data per partition, lowering memory pressure during shuffle operations. This directly addresses the error without adding more hardware.

Exam trap

The trap here is that candidates often assume adding more workers (Option C) is the only way to fix memory errors, but the question tests understanding that partition size, not just cluster size, is the root cause of shuffle OOM errors.

How to eliminate wrong answers

Option B is wrong because increasing driver memory does not help with executor-side shuffle memory issues; the driver is not involved in shuffle data processing. Option C is wrong because adding more workers increases parallelism but does not reduce the size of each partition unless the number of partitions is also increased; it would over-provision resources without fixing the root cause. Option D is wrong because reducing `spark.sql.shuffle.partitions` to 100 would make each partition larger, worsening the memory pressure and increasing the likelihood of OutOfMemoryError.

Full explanation →

215

MCQmedium

Your Azure Synapse Analytics dedicated SQL pool is experiencing performance degradation. You notice that some queries are being queued due to resource class conflicts. What should you implement to optimize performance and reduce queuing?

A.Scale the dedicated SQL pool to a higher DWU level

B.Configure workload management with workload groups and classifiers

C.Create materialized views for the most common aggregations

D.Enable result-set caching for frequently run queries

AnswerB

Workload management allows you to assign appropriate resources to queries based on their priority, reducing conflicts.

Why this answer

Option C is correct because workload management in Synapse allows you to assign queries to different workload groups with resource classes, preventing conflicts. Option A is wrong because scaling the pool adds cost and may not resolve the specific resource class conflict. Option B is wrong because result-set caching does not address concurrency issues.

Option D is wrong because materialized views improve query performance but do not resolve queuing.

Full explanation →

216

Multi-Selecthard

You are designing a data processing solution in Azure Synapse Analytics. You need to load data from multiple sources into a dedicated SQL pool. Which THREE of the following are best practices for loading data?

Select 3 answers

A.Use round-robin distribution for staging tables

B.Use a staging table to load data and then insert into the final table

C.Split data into many small files to increase parallelism

D.Use PolyBase or COPY statement for parallel loading

E.Enable clustered columnstore index during load to improve query performance

AnswersA, B, D

Round-robin distributes data evenly for fast load.

Why this answer

Round-robin distribution is recommended for staging tables because it distributes data evenly across all distributions without requiring a hash key, minimizing data movement during the load process. Since staging tables are temporary and used for intermediate storage, the even distribution ensures that the subsequent INSERT...SELECT operation into the final table benefits from maximum parallelism and reduced skew.

Exam trap

The trap here is that candidates often assume more files always increase parallelism, but Azure Synapse Analytics optimizes for fewer, larger files to minimize metadata operations and maximize throughput.

Full explanation →

217

MCQeasy

You have an Azure Data Factory pipeline that copies data from an FTP server to Azure Blob Storage. The pipeline runs successfully most of the time, but occasionally fails with a 'FTP server connection refused' error during peak hours. You need to minimize these failures with minimal cost. What should you do?

A.Add a retry policy to the copy activity with a backoff interval.

B.Set up Azure ExpressRoute to improve network reliability.

C.Migrate the FTP server to SFTP.

D.Increase the parallel copy count in the copy activity.

AnswerA

Retry with backoff handles transient connection failures.

Why this answer

Option D is correct because adding a retry policy with a short delay handles transient connection issues. Option A is wrong because increasing the number of parallel copies does not solve connection refused errors. Option B is wrong because Azure ExpressRoute is costly and unnecessary for FTP.

Option C is wrong because migrating to SFTP may not resolve connection refused if the server is overloaded.

Full explanation →

218

MCQhard

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. The solution will query data stored in Parquet files in Azure Data Lake Storage Gen2. You need to ensure that the queries are optimized for performance. Which action should you take?

A.Increase the MAXDOP setting in the query.

B.Convert the Parquet files to CSV format for faster parsing.

C.Create materialized views on the external tables.

D.Partition the Parquet files by date and use partition pruning in the query.

AnswerD

Partition pruning limits the data scanned, improving performance.

Why this answer

Option B is correct because partitioning the data by a commonly filtered column, such as date, allows the serverless SQL pool to use partition elimination, reducing the amount of data scanned. Option A is wrong because increasing MAXDOP may not help and can cause resource contention. Option C is wrong because usingOPENROWSET with CSV is less efficient than Parquet.

Option D is wrong because materialized views are not supported in serverless SQL pool.

Full explanation →

219

Multi-Selectmedium

Which TWO actions should you take to ensure that data at rest is encrypted in Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.Enable Always Encrypted with secure enclaves.

B.Enable infrastructure-level encryption using double encryption.

C.Apply column-level encryption using ENCRYPTBYPASSPHRASE.

D.Create a customer-managed key in Azure Key Vault.

E.Enable Transparent Data Encryption (TDE) using service-managed keys.

AnswersB, E

Azure Storage double encryption provides additional layer at rest.

Why this answer

Options A and D are correct. TDE encrypts data at rest for dedicated SQL pool, and enabling double encryption provides additional protection. Option B is wrong because Always Encrypted is for client-side encryption, not at rest.

Option C is wrong because column-level encryption is for specific columns, not the whole pool. Option E is wrong because creating a customer-managed key is only part of TDE configuration, not a separate action.

Full explanation →

220

MCQhard

Refer to the exhibit. You have an Azure Synapse pipeline that runs a Spark notebook daily. The notebook uses the inputDate parameter to filter data. The notebook successfully processes data for '2024-01-01' but fails for '2024-01-02' with an error that the 'sales' table does not exist. The 'sales' table is created daily by a preceding job. What is the most likely cause?

A.The notebook expects a table named 'sales_20240102' but the preceding job creates 'sales_20240101'

B.The notebook activity should have a dependency on the job that creates the table

C.The Spark pool does not have permissions to read the storage account where the table data is stored

D.The pipeline parameter 'inputDate' is not being passed to the notebook correctly

AnswerA

The notebook likely constructs table name from the date parameter, and the table for the new date hasn't been created.

Why this answer

The error indicates that the notebook is looking for a table named 'sales_20240102' (based on the inputDate parameter for '2024-01-02'), but the preceding job creates a table named 'sales_20240101' (the previous day's table). This mismatch occurs because the notebook dynamically constructs the table name using the inputDate parameter, and the preceding job likely creates the table with a date suffix that does not align with the current inputDate. The correct answer is A because the table naming convention is inconsistent between the two processes.

Exam trap

The trap here is that candidates may assume the error is due to missing dependencies or permissions, but the real issue is a logical mismatch in table naming conventions between the table creation job and the notebook's expected table name.

How to eliminate wrong answers

Option B is wrong because adding a dependency on the job that creates the table would only ensure the job runs before the notebook, but it would not fix the naming mismatch between the table created and the table expected. Option C is wrong because the error message explicitly states the 'sales' table does not exist, not a permission issue; a permissions error would typically manifest as an 'Access Denied' or 'AuthorizationFailure' exception. Option D is wrong because if the inputDate parameter were not passed correctly, the notebook would likely fail for all dates or use a default value, not fail specifically for '2024-01-02' while succeeding for '2024-01-01'.

Full explanation →

221

MCQeasy

You need to store semi-structured JSON data from a web application. The data schema may change over time. The solution must support low-latency queries and be globally distributed. Which Azure data service should you use?

A.Azure Table Storage

B.Azure Cosmos DB

C.Azure Data Lake Storage Gen2

D.Azure SQL Database

AnswerB

Cosmos DB natively supports JSON documents with automatic indexing and global distribution.

Why this answer

Azure Cosmos DB is the correct choice because it natively supports semi-structured JSON documents with a flexible schema, offers single-digit-millisecond latency for queries, and provides global distribution with turnkey multi-region replication. Its API for MongoDB or SQL API allows direct ingestion of JSON data, and its schema-agnostic indexing adapts automatically to schema changes over time.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's key-value model with document storage, overlooking that it lacks native JSON support and global low-latency query capabilities, while Azure Cosmos DB is specifically designed for these requirements.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a key-value store that does not natively support JSON documents or complex nested structures; it requires manual serialization and lacks global distribution with low-latency guarantees. Option C is wrong because Azure Data Lake Storage Gen2 is optimized for large-scale batch analytics and data lakes, not for low-latency transactional queries on semi-structured data. Option D is wrong because Azure SQL Database requires a fixed relational schema and does not handle dynamic schema changes without manual migrations, nor does it offer the same turnkey global distribution as Cosmos DB.

Full explanation →

222

Drag & Dropmedium

Drag and drop the steps to implement incremental data loading using Azure Data Factory into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Incremental loading requires a watermark column. The pipeline retrieves the last watermark, copies data changed since then, then updates the watermark for next run.

Full explanation →

223

MCQmedium

You are developing a data processing pipeline in Azure Synapse Analytics that uses a SQL script to transform data in a dedicated SQL pool. The pipeline currently runs in the built-in serverless pool, but you want to migrate it to a dedicated SQL pool for better performance. What must you change in the pipeline?

A.Change the linked service from serverless to dedicated SQL pool.

B.Replace the SQL script with a Mapping Data Flow activity.

C.Update the integration runtime to a self-hosted IR.

D.Modify the SQL script to use T-SQL compatible with dedicated SQL pool.

AnswerD

Dedicated SQL pool has a different T-SQL surface area; some functions may need to be rewritten.

Why this answer

Option A is correct because dedicated SQL pool uses T-SQL with specific syntax differences from serverless SQL pool. Option B (linked service) may also need updating but is not the primary change. Option C (integration runtime) is not necessarily required if both pools are in the same workspace.

Option D (data flow) is a different activity type.

Full explanation →

224

MCQhard

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored in Delta Lake format. Which of the following statements is true regarding querying Delta Lake tables with serverless SQL pool?

A.Serverless SQL pool supports Delta Lake only if the files are in Parquet format.

B.You can use the OPENROWSET function with the BULK option and FORMAT='DELTA' to query Delta Lake tables.

C.Serverless SQL pool cannot query Delta Lake tables; you must convert them to Parquet first.

D.You must create an external table in serverless SQL pool using the CREATE EXTERNAL TABLE statement with the Delta format.

AnswerB

OPENROWSET with FORMAT='DELTA' allows querying Delta Lake tables directly.

Why this answer

Serverless SQL pool in Azure Synapse Analytics supports querying Delta Lake tables directly using the OPENROWSET function with the BULK option and FORMAT='DELTA'. This allows you to read Delta Lake data stored in Azure Data Lake Storage Gen2 without needing to convert it to Parquet or create external tables first. The Delta format is natively supported, enabling time travel and schema evolution features.

Exam trap

The trap here is that candidates assume Delta Lake requires special handling or conversion to Parquet, but serverless SQL pool natively supports Delta via OPENROWSET with FORMAT='DELTA', making options A and C incorrect, while option D misleads by suggesting external tables are the primary method.

How to eliminate wrong answers

Option A is wrong because serverless SQL pool supports Delta Lake natively, not only when files are in Parquet format; Delta Lake itself uses Parquet as the underlying storage format but adds transaction logs and metadata. Option C is wrong because serverless SQL pool can query Delta Lake tables directly using OPENROWSET with FORMAT='DELTA', so conversion to Parquet is unnecessary. Option D is wrong because while you can create external tables for Delta Lake, the correct and simplest method is to use OPENROWSET with FORMAT='DELTA', not a CREATE EXTERNAL TABLE statement with Delta format (which is not supported for external tables in serverless SQL pool).

Full explanation →

225

Multi-Selectmedium

Which THREE options are valid ways to transform data in Azure Synapse Analytics?

Select 3 answers

A.Use Power Query online in Synapse pipelines.

B.Use T-SQL scripts in a dedicated SQL pool.

C.Use Mapping Data Flows in Synapse pipelines.

D.Use Spark notebooks in Synapse Spark pools.

E.Use Azure Machine Learning pipelines for data wrangling.

AnswersB, C, D

T-SQL is a primary way to transform data in Synapse SQL pools.

Why this answer

Option B is correct because T-SQL scripts are a native and primary method for transforming data within a dedicated SQL pool in Azure Synapse Analytics. You can use CREATE TABLE AS SELECT (CTAS), INSERT...SELECT, and other T-SQL statements to perform complex transformations like aggregations, joins, and data cleansing directly on the distributed data, leveraging the MPP (Massively Parallel Processing) engine for high performance.

Exam trap

The trap here is that candidates often confuse Power Query Online (a Power BI/ADF feature) with Mapping Data Flows (a Synapse pipeline activity), or assume Azure Machine Learning pipelines are valid for data wrangling in Synapse, when in fact they are separate services for ML lifecycle management.

Full explanation →

Page 3 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →