Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 301–375

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 5 of 12

301

MCQmedium

You need to partition a large Azure SQL Database table by date to improve query performance and manageability. Which partitioning strategy should you use?

A.CREATE PARTITION FUNCTION myDateRange (datetime2) AS RANGE RIGHT FOR VALUES ('2023-01-01', '2023-02-01', ...)

B.CREATE PARTITION FUNCTION myDateRange (datetime2) AS RANGE LEFT FOR VALUES ('2023-01-01', '2023-02-01', ...)

C.Use Azure SQL Database automatic partitioning feature

D.Apply a clustered columnstore index with partitioning

AnswerA

Why this answer

Option A is correct because `RANGE RIGHT` ensures that each boundary value belongs to the right partition, which is the standard approach for date-based partitioning. This means values less than '2023-01-01' go into the first partition, values >= '2023-01-01' and < '2023-02-01' go into the second, and so on, aligning with typical date range queries and simplifying partition management (e.g., switching out old partitions).

Exam trap

The trap here is that candidates often confuse `RANGE LEFT` and `RANGE RIGHT`, mistakenly thinking `LEFT` is the default or more natural for date ranges, when in fact `RANGE RIGHT` is the standard for non-overlapping, sliding-window date partitions.

Why the other options are wrong

B

RANGE LEFT includes the boundary value in the left partition, which can lead to uneven data distribution.

C

Azure SQL Database does not have automatic partitioning; you must create it manually.

D

Columnstore indexes are for analytics, not the primary partitioning mechanism.

Full explanation →

302

Multi-Selecteasy

Which of the following are valid activities in an Azure Data Factory pipeline? (Choose two.)

Select 2 answers

A.Copy

B.Databricks Notebook

C.Assign

D.Execute Pipeline

AnswersA, B

Why this answer

A is correct because the Copy activity is a fundamental data movement activity in Azure Data Factory (ADF) that allows you to ingest data from a wide range of supported source stores (e.g., Azure Blob Storage, SQL Server) to sink stores (e.g., Azure Synapse Analytics, Azure Data Lake Storage). It uses the underlying integration runtime to perform the actual data transfer, supporting both built-in connectors and self-hosted runtimes for on-premises sources. This makes it a core, valid activity for building ETL/ELT pipelines.

Exam trap

The trap here is that candidates may confuse the 'Execute Pipeline Activity' (a valid control flow activity) with the incomplete 'Execute Pipeline' option, or mistakenly think 'Assign' is a real activity due to its similarity to variable assignment in programming languages, when in fact ADF uses 'Set Variable' for that purpose.

Why the other options are wrong

C

No such activity in ADF; the correct term is 'Set Variable'.

D

This is a valid activity but the question asks for two, and the first two are more common.

Full explanation →

303

MCQeasy

Your organization uses Microsoft Purview to catalog data assets. You need to ensure that sensitive data such as credit card numbers are automatically detected and labeled. Which Purview feature should you configure?

A.Create an Azure Policy to enforce tagging.

B.Configure a scan rule set with built-in classification rules for sensitive data types.

C.Enable the Data Catalog self-service search.

D.Enable Microsoft Information Protection for the data sources.

AnswerB

Scan rule sets enable automatic detection of sensitive data.

Why this answer

Option A is correct because Microsoft Purview Data Map includes automated scanning and classification of sensitive data types. Option B is wrong because Data Catalog is for searching and governance, not scanning. Option C is wrong because Information Protection is for labeling in Microsoft 365, not data catalog.

Option D is wrong because Azure Policy is for compliance rules, not scanning.

Full explanation →

304

MCQhard

You are responsible for securing an Azure Synapse Analytics workspace. The workspace contains dedicated SQL pools and serverless SQL pools. You need to ensure that only users with specific Microsoft Entra ID roles can query serverless SQL pools, while dedicated SQL pools use SQL authentication. What should you do?

A.Create SQL logins for serverless SQL pools and assign permissions to Microsoft Entra ID groups.

B.Configure serverless SQL pools to use Microsoft Entra ID authentication only, and dedicated SQL pools to allow SQL authentication.

C.Use a managed identity for serverless SQL pools and assign it to Microsoft Entra ID roles.

D.Disable SQL authentication for all pools and enforce Microsoft Entra ID authentication only.

AnswerB

Serverless SQL pools support only Entra ID; dedicated SQL pools can use both.

Why this answer

Option B is correct because serverless SQL pools in Azure Synapse Analytics natively support Microsoft Entra ID authentication, allowing you to restrict access to specific Entra ID roles by disabling SQL authentication for those pools. Dedicated SQL pools can independently be configured to allow SQL authentication, enabling coexistence of both authentication methods as required. This separation ensures that only users with the designated Entra ID roles can query serverless SQL pools, while dedicated SQL pools remain accessible via SQL logins.

Exam trap

The trap here is that candidates may assume SQL logins can be created for serverless SQL pools (Option A) or that managed identities can be used to control user access (Option C), when in fact serverless pools only support Microsoft Entra ID authentication and managed identities are for service principals, not user permissions.

How to eliminate wrong answers

Option A is wrong because serverless SQL pools do not support SQL logins; they rely exclusively on Microsoft Entra ID authentication, so creating SQL logins for them is not possible. Option C is wrong because managed identities are used for service-to-service authentication (e.g., connecting to Azure Storage), not for granting user access to serverless SQL pools via Entra ID roles. Option D is wrong because it would disable SQL authentication entirely, contradicting the requirement that dedicated SQL pools must use SQL authentication.

Full explanation →

305

MCQhard

You are designing a data processing solution for a financial services company. They need to process sensitive customer data in Azure Databricks while complying with GDPR. The data must be encrypted at rest and in transit, and access must be audited. You need to recommend a configuration that meets these requirements. Which combination of actions should you take?

A.Enable Azure Storage Service Encryption (SSE) with platform-managed keys and configure Azure Databricks to use a VNet injection.

B.Apply an Azure Policy to require encryption and assign a built-in GDPR blueprint.

C.Use Azure Security Center to monitor for data exposure and enable Azure Defender for Storage.

D.Enable double encryption using Azure Key Vault with customer-managed keys and enable Azure Databricks audit logs.

AnswerD

Double encryption provides encryption at rest and in transit; customer-managed keys give control; audit logs track access.

Why this answer

Option B is correct because enabling double encryption and using Azure Key Vault for encryption keys ensures encryption at rest and in transit; Azure Databricks audit logs capture access. Option A (Azure Storage Service Encryption) encrypts at rest but not necessarily with customer-managed keys. Option C (Azure Policy) enforces compliance but does not directly encrypt data.

Option D (Azure Security Center) provides monitoring but not encryption.

Full explanation →

306

MCQhard

Your organization uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled. You need to grant a service principal read and write access to a specific directory without granting access to the parent directories. What should you use?

A.Assign the Storage Blob Data Contributor role at the directory level using RBAC.

B.Use a managed identity and assign it to the directory.

C.Create a stored access policy on the directory.

D.Set ACLs on the directory with default ACLs for the service principal.

AnswerA

RBAC roles can be scoped to directories in Azure Data Lake Storage Gen2.

Why this answer

Option A is correct because Azure RBAC with scope at the directory level can be assigned using the Storage Blob Data Contributor role. Option B is wrong because ACLs are used for fine-grained permissions but are scoped to the file system level for the default ACL. Option C is wrong because managed identity is an identity, not a permission mechanism.

Option D is wrong because access policies are used for shared access signatures, not for service principals.

Full explanation →

307

MCQmedium

You are building a data pipeline in Azure Synapse Analytics that ingests JSON files from Azure Data Lake Storage Gen2, transforms the data using a mapping data flow, and loads it into a dedicated SQL pool. The pipeline must support incremental loads based on a LastModified timestamp in the source files. Which configuration should you use in the mapping data flow source transformation?

A.Use a 'Column pattern' to detect new columns and filter rows.

B.Configure 'Row sampling' settings to take only new rows.

C.Set 'Source' to 'New rows only' in the sink transformation.

D.Enable 'Incremental extraction' and specify the timestamp column for filtering.

AnswerD

Incremental extraction filters rows based on a watermark column like LastModified.

Why this answer

Option C is correct because the 'Incremental extraction' option in a mapping data flow source transformation allows you to filter rows based on a last modified timestamp column, enabling incremental loads. Option A is wrong because 'Column pattern' is used for detecting changes in column names, not row-level timestamps. Option B is wrong because 'Row sampling' selects a random subset of rows, not based on timestamps.

Option D is wrong because 'Source metadata' in the sink transformation is unrelated to incremental loading.

Full explanation →

308

MCQeasy

You are a data engineer at a retail company. You have designed a near real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, which receives clickstream events from the company's e-commerce website. The output is written to an Azure SQL Database table for reporting. Each event includes fields: UserId, ProductId, EventType (e.g., 'click', 'purchase'), and Timestamp. The requirement is to calculate the number of purchases per product in a 5-minute tumbling window and update a SQL table. The Stream Analytics job has been running for a week, but the reporting team notices that the purchase counts in SQL are consistently lower than expected compared to a direct count from Event Hubs. You suspect that late-arriving events are being dropped. The job's configuration includes a 5-minute tumbling window with no late arrival policy. What should you do to fix the issue without losing data?

A.Modify the query to use a larger tumbling window (e.g., 10 minutes) and add a late arrival policy with a 5-minute grace period to allow late events to be included.

B.Modify the query to use TIMESTAMP BY on the EventHubs enqueued time instead of the event's Timestamp field.

C.Change the tumbling window to a hopping window with a 1-minute hop size to increase the frequency of output updates.

D.Add a second Stream Analytics job to process late-arriving events separately and union the results.

AnswerA

A larger window with a late arrival policy captures late-arriving events.

Why this answer

Option A is correct because the current 5-minute tumbling window has no late arrival policy, so any event that arrives after the window ends is dropped. By increasing the window size to 10 minutes and adding a 5-minute late arrival grace period, you allow events that arrive up to 5 minutes late to still be included in the correct window, matching the actual purchase count from Event Hubs.

Exam trap

The trap here is that candidates may think increasing the window size or changing the window type (hopping) will fix the issue, but the core problem is the lack of a late arrival policy to handle events that arrive after the window closes.

How to eliminate wrong answers

Option B is wrong because using TIMESTAMP BY on the Event Hubs enqueued time does not solve the late arrival issue; it only changes the timestamp used for windowing, but late events still arrive after the window closes and would be dropped without a late arrival policy. Option C is wrong because changing to a hopping window with a 1-minute hop size increases output frequency but does not address late-arriving events; late events would still be dropped if they arrive after the window end time. Option D is wrong because adding a second Stream Analytics job to process late events separately is unnecessarily complex and introduces data duplication and reconciliation challenges; the correct approach is to use a late arrival policy within a single job.

Full explanation →

309

Multi-Selectmedium

Which TWO actions can you take to optimize the performance of an Azure Synapse Analytics dedicated SQL pool? (Choose two.)

Select 2 answers

A.Scale up the SQL pool to a higher DWU.

B.Replicate small dimension tables.

C.Use heap indexes for fact tables.

D.Use round-robin distribution for all large fact tables.

E.Use hash distribution on a column used in joins and aggregations.

AnswersB, E

Replication reduces data movement for joins with fact tables.

Why this answer

Options A and E are correct. A improves query performance by distributing data for parallelism. E reduces data movement by colocating joins on the same distribution.

B is wrong because round-robin distributes data evenly but does not reduce data movement. C is wrong because heaps are not optimal for data warehousing. D is wrong because increasing DWU may help but is not always the best optimization.

Full explanation →

310

Multi-Selecteasy

You are designing a data storage solution for a real-time dashboard that displays streaming data from Azure Event Hubs. The data must be stored in a format that supports both real-time and batch analytics with minimal latency. Which TWO technologies should you use?

Select 2 answers

A.Azure Stream Analytics

B.Azure Data Factory

C.Azure Synapse Analytics

D.Azure Analysis Services

E.Azure SQL Database

AnswersA, C

Stream Analytics processes streaming data in real-time.

Why this answer

Azure Stream Analytics is correct because it provides real-time stream processing with low latency, directly ingesting data from Event Hubs and outputting to storage or analytics services. It enables both real-time dashboard queries and batch analytics by writing to a staging store like Azure Data Lake Storage, which can then be queried by Azure Synapse Analytics for historical analysis.

Exam trap

The trap here is that candidates often confuse Azure Data Factory as a real-time processing tool, but it is strictly a batch orchestration service with no native stream processing capability.

Full explanation →

311

MCQhard

Your team uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is inconsistent, and some queries take a long time to execute. You need to improve query performance. What should you do?

A.Increase the MAXDOP setting in the query

B.Create statistics on the columns used in joins and filters

C.Move the data to a dedicated SQL pool

D.Convert the Parquet files to CSV format

AnswerB

Statistics help the optimizer generate efficient plans.

Why this answer

Option B (Create statistics on the columns used in joins and filters) is correct because serverless SQL pool relies on statistics for optimal query plans. Option A (Increase the maximum degree of parallelism) is not directly applicable. Option C (Convert to CSV) would degrade performance.

Option D (Use a dedicated SQL pool) may be an option but not the best immediate step.

Full explanation →

312

MCQmedium

A data engineering team is building a batch processing solution for a financial services company. Data is ingested daily from multiple sources into Azure Data Lake Storage Gen2 in CSV format. The data must be transformed (filtered, aggregated, joined) and loaded into Azure Synapse Analytics dedicated SQL pool. The team must optimize for cost and performance. The total data volume is 2 TB per day. The team has the following options: Option A: Use Azure Data Factory pipelines with copy activity to load raw CSV files into Synapse staging tables, then use T-SQL stored procedures in Synapse to perform transformations. Option B: Use Azure Databricks with Auto Loader to incrementally ingest CSV files, perform transformations in Spark, and write the results to Synapse using the Spark Synapse connector. Option C: Use Azure Data Factory with mapping data flows to transform the data in a serverless environment and then write to Synapse. Option D: Use Azure Synapse Pipelines (built on ADF) with a notebook activity that runs a PySpark notebook in Synapse Spark pool to transform and load data. Which option should the team choose to minimize cost and management overhead while meeting performance requirements?

A.Option B

B.Option C

C.Option A

D.Option D

AnswerB

Serverless, cost-effective, low overhead.

Why this answer

Option C is correct because mapping data flows run on a serverless ADF integration runtime, which scales automatically and incurs cost only per execution, minimizing management overhead. Option A requires managing staging tables and stored procedures. Option B requires managing Spark clusters.

Option D requires managing Synapse Spark pools. Mapping data flows are cost-effective for batch transformations of this size.

Full explanation →

313

Multi-Selecthard

A company uses Azure Synapse Analytics with a dedicated SQL pool. Data engineers need to implement column-level security so that only users with the 'Manager' role can see salary columns. Which TWO actions should they take?

Select 2 answers

A.Create a stored procedure that checks the user's role and returns the appropriate columns.

B.Grant the 'Manager' role SELECT permission on the security policy.

C.Create a security policy with a filter predicate on the salary column using the function, and set the state to ON with BLOCK predicate.

D.Create an inline table-valued function that returns 1 if the user is a member of the 'Manager' role, else 0.

E.Use GRANT SELECT ON OBJECT::[dbo].[Employee](Salary) TO [Manager] to grant access to the salary column.

AnswersC, D

The security policy with BLOCK prevents updates/inserts that would expose the column.

Why this answer

Option C is correct because column-level security in Azure Synapse Analytics dedicated SQL pools is implemented via a security policy that uses a filter predicate (a function) to restrict access to specific columns. Setting the state to ON with a BLOCK predicate ensures that unauthorized users cannot see the salary column. Option D is correct because the inline table-valued function is the predicate function that checks role membership (e.g., using IS_MEMBER('Manager')) and returns 1 or 0, which the security policy uses to filter rows.

Exam trap

The trap here is that candidates often confuse column-level security with row-level security or assume that GRANT statements can be used to restrict column access, but Azure Synapse dedicated SQL pools require a security policy with a predicate function for column-level restrictions.

Full explanation →

314

MCQhard

Refer to the exhibit. You have created the custom RBAC role shown and assigned it to a security group. Members of the group report that they can read blobs in the storage account but cannot list the contents of the container. What is the most likely reason for this issue?

A.Custom roles are not supported for Azure Data Lake Storage Gen2.

B.The role is scoped to the storage account but not to the container.

C.The role does not include the permission to list blobs in a container.

D.The role lacks the 'read' data action for blobs.

AnswerC

To list blobs, the role needs 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read' permission, but that only reads individual blobs. The 'list' action requires 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read' with the 'list' permission, or the role must include 'Microsoft.Storage/storageAccounts/blobServices/containers/read' which allows listing container contents.

Why this answer

Option A is correct because the role lacks the 'list' action on containers. The 'read' action on containers only allows reading container properties and metadata, not listing blobs. Option B is wrong because the 'read' data action on blobs is included.

Option C is wrong because the scope is correct for the storage account. Option D is wrong because custom roles can be used for ADLS Gen2.

Full explanation →

315

MCQeasy

A data engineer needs to process a large dataset stored in Azure Blob Storage using Azure Databricks. The dataset consists of millions of small CSV files. The processing job is slow due to the overhead of reading many small files. Which technique should be used to improve performance?

A.Increase the number of worker nodes in the cluster

B.Convert the CSV files to Parquet format

C.Coalesce the small files into larger files using a Databricks notebook

D.Use Delta Lake caching to store the data in memory

AnswerC

Reduces file count and improves read performance.

Why this answer

Option C is correct because coalescing the millions of small CSV files into larger files reduces the metadata overhead and I/O operations when reading from Azure Blob Storage. Databricks can then process fewer, larger files more efficiently, as each task handles a substantial data chunk rather than incurring the cost of opening and closing many small files.

Exam trap

The trap here is that candidates often assume performance issues are always solved by scaling out (Option A) or by switching formats (Option B), but the DP-203 exam specifically tests the understanding that small file overhead is a distinct problem requiring file consolidation.

How to eliminate wrong answers

Option A is wrong because simply adding more worker nodes does not address the root cause of small file overhead; it may even exacerbate the problem by increasing the number of tasks that each try to read a small file, leading to more scheduler and I/O contention. Option B is wrong because converting CSV to Parquet improves compression and columnar read performance but does not reduce the number of files; the overhead of opening millions of small Parquet files remains similar to CSV. Option D is wrong because Delta Lake caching stores data in memory after it is read, but it does not reduce the initial read overhead of millions of small files; the first read still suffers from the same small file penalty.

Full explanation →

316

MCQhard

You are migrating an on-premises SQL Server database to Azure. The database has a large fact table (500 GB) and several dimension tables (10 GB total). Reporting queries join the fact table with dimension tables and aggregate by date. Which Azure service and table design should you recommend to minimize query latency?

A.Azure Synapse SQL Pool with replicated tables for both fact and dimension tables

B.Azure SQL Database Hyperscale with columnstore indexes

C.Azure Synapse SQL Pool with hash distribution on the fact table's foreign key and round-robin for dimension tables

D.Azure SQL Database with rowstore indexes and a single database

AnswerC

Hash distribution enables co-location joins, improving query performance.

Why this answer

Option C is correct because Azure Synapse SQL Pool with hash distribution on the fact table's foreign key ensures that related rows from the fact and dimension tables are co-located on the same compute node, minimizing data movement during joins. Round-robin distribution for the small dimension tables is appropriate since they are under 1 GB each and can be broadcast to all nodes, further reducing shuffle overhead. This design optimizes parallel query execution for large fact table aggregations by date.

Exam trap

The trap here is that candidates often confuse replicated tables as a universal performance booster, not realizing that replicating a large fact table is impractical and that hash distribution on the join key is the correct pattern for large fact tables in a distributed MPP environment.

How to eliminate wrong answers

Option A is wrong because replicated tables are designed for small dimension tables (typically under 1 GB), but replicating a 500 GB fact table would consume excessive storage and cause high replication overhead, defeating the purpose of minimizing query latency. Option B is wrong because Azure SQL Database Hyperscale is optimized for transactional workloads with high concurrency and large database sizes, not for large-scale analytical aggregations across a massive fact table; columnstore indexes help but the single-node architecture cannot match the distributed parallel processing of Synapse SQL Pool. Option D is wrong because rowstore indexes and a single database instance lack the distributed compute and storage needed to efficiently join and aggregate a 500 GB fact table, leading to high I/O and long query times.

Full explanation →

317

MCQhard

A financial services firm uses Azure Synapse Analytics to process daily trade data. The data is stored in a dedicated SQL pool as partitioned tables by date. Recently, queries that filter on a specific date range have become slow. You suspect that partition pruning is not working effectively. What should you do to improve query performance?

A.Rebuild the columnstore indexes on the table

B.Convert the table to a rowstore heap

C.Create statistics on the date column

D.Increase the number of partitions for the table

AnswerA

Rebuilding reduces fragmentation and improves partition pruning.

Why this answer

Option D (Rebuild the columnstore indexes) is correct because fragmented columnstore indexes can cause poor partition pruning and overall slow query performance. Option A (Increase the number of partitions) may worsen performance due to partition overhead. Option B (Switch to rowstore) would degrade performance for large tables.

Option C (Create statistics on the date column) helps but is not the primary issue here; the problem is likely index fragmentation.

Full explanation →

318

Multi-Selecthard

You are optimizing an Azure Synapse Analytics dedicated SQL pool that runs a mix of reporting and ETL workloads. The ETL jobs often encounter resource wait times due to concurrent reporting queries. You need to ensure that ETL jobs always get the resources they need. Which two actions should you take? (Choose two.)

Select 2 answers

A.Increase the DWU (Data Warehouse Units) to provide more overall resources.

B.Assign HIGH importance to the ETL workload classifier.

C.Enable result set caching for reporting queries.

D.Create materialized views for common reporting aggregations.

E.Create a workload group for ETL with a minimum resource percentage and assign it to a dedicated resource pool.

AnswersB, E

HIGH importance ensures ETL queries are prioritized over lower importance reporting queries.

Why this answer

Options A and C are correct. Workload isolation with separate resource pools ensures ETL gets dedicated resources. Assigning HIGH importance ensures ETL queries are prioritized.

Option B is wrong because increasing DWU adds resources but does not guarantee priority. Option D is wrong because result set caching benefits reporting, not ETL. Option E is wrong because materialized views improve query performance but do not guarantee resource allocation.

Full explanation →

319

MCQmedium

You are developing a data processing pipeline in Azure Data Factory. The pipeline must copy data from an on-premises SQL Server to Azure Blob Storage. The data volumes are large, and the network connection is unreliable. Which configuration should you use to ensure resilience and resume capability?

A.Use a staging copy with Azure Blob Storage as an intermediate store.

B.Configure the copy activity with fault tolerance and enable 'Skip incompatible rows'. Use a self-hosted integration runtime for the source.

C.Increase the 'parallelCopies' setting to 10 and use a staging table.

D.Use PolyBase to load data directly into Azure Synapse Analytics.

AnswerB

Fault tolerance allows the copy to continue despite errors, and self-hosted IR supports checkpointing for resume.

Why this answer

Option B is correct because using a self-hosted integration runtime enables connectivity to on-premises SQL Server, and enabling fault tolerance with 'Skip incompatible rows' allows the copy activity to continue despite transient failures or data inconsistencies. This configuration ensures resilience by automatically retrying failed operations and skipping rows that cause errors, which is critical for large data volumes over an unreliable network.

Exam trap

Microsoft often tests the misconception that staging or parallelization alone provides resilience, but the key is that fault tolerance with row-level skipping is the only option that directly handles data errors and network interruptions without failing the entire copy.

How to eliminate wrong answers

Option A is wrong because using Azure Blob Storage as an intermediate store (staging copy) adds unnecessary complexity and cost without addressing the core issue of network unreliability; it is typically used for cross-region or cross-account copies, not for resilience against transient failures. Option C is wrong because increasing 'parallelCopies' to 10 can improve throughput but does not provide fault tolerance or resume capability; it may even exacerbate failures by creating more concurrent connections that can fail independently. Option D is wrong because PolyBase is a technology for loading data into Azure Synapse Analytics, not for copying data to Azure Blob Storage, and it does not address the on-premises source or network resilience requirements.

Full explanation →

320

MCQeasy

You are monitoring an Azure Data Factory pipeline that runs hourly. The pipeline executes a stored procedure in an Azure SQL Database. Recently, you have observed that the pipeline occasionally fails with a 'Deadlock' error when the stored procedure runs. The Azure SQL Database is configured with the 'Read Committed Snapshot' isolation level enabled. You need to resolve the deadlock issue with minimal impact on performance. The stored procedure updates multiple tables in a single transaction and is critical for reporting. What should you do?

A.Change the stored procedure to use NOLOCK hints

B.Remove the transaction from the stored procedure

C.Add retry logic in the Data Factory pipeline for the stored procedure activity

D.Disable the 'Read Committed Snapshot' isolation level

AnswerC

Retries handle transient deadlocks gracefully

Why this answer

Option A is correct because implementing retry logic in the pipeline will handle transient deadlock errors by re-executing the activity. Option B is wrong because disabling Read Committed Snapshot might reduce concurrency and increase blocking. Option C is wrong because it removes the transaction, risking data inconsistency.

Option D is wrong because it does not address the deadlock directly and may not be allowed.

Full explanation →

321

Multi-Selectmedium

Which TWO Azure services can be used to implement a data lakehouse architecture with Delta Lake?

Select 2 answers

A.Azure Databricks

B.Azure Data Factory

C.Azure Cosmos DB

D.Azure Synapse Analytics serverless SQL pool

E.Azure SQL Database

AnswersA, D

Databricks is the primary platform for Delta Lake.

Why this answer

Azure Databricks is correct because it provides a unified analytics platform with native Delta Lake support, enabling ACID transactions, schema enforcement, and time travel on data lakes. Delta Lake is an open-source storage layer that brings reliability to data lakes, and Azure Databricks is the primary service for running Delta Lake workloads at scale.

Exam trap

The trap here is that candidates often assume Azure Data Factory can implement a data lakehouse because it can copy data to ADLS, but they miss that a data lakehouse requires a compute engine (like Spark or serverless SQL) and a transactional storage layer (Delta Lake), not just data movement.

Full explanation →

322

Multi-Selectmedium

You are optimizing an Azure Synapse Pipeline that processes large volumes of data using a Mapping Data Flow. The pipeline is taking too long due to data skew. Which THREE techniques can help mitigate data skew?

Select 3 answers

A.Use a broadcast hash join to distribute the smaller table to all nodes.

B.Increase the Data Integration Units (DIU) in the data flow.

C.Use the 'Optimize' tab in the data flow to enable skewed data optimization.

D.Apply a salting technique by adding a random salt to the join key.

E.Repartition the data on a different column using a random distribution.

AnswersA, C, D

Broadcast join avoids shuffling large data, reducing skew impact.

Why this answer

Option B (use skewed data optimization), Option D (broadcast hash join), and Option E (salting the key) are techniques to handle skew in data flows. Option A (increasing DIU) adds resources but doesn't address skew. Option C (repartition) can help but is not a direct skew mitigation; broadcast join is more effective.

Full explanation →

323

MCQeasy

You have an Azure Data Lake Storage Gen2 account that stores sensitive customer data. You need to implement security controls to prevent data exfiltration by a malicious insider who has Contributor role access. Which Azure feature should you use?

A.Enable diagnostic settings to log all access to the storage account.

B.Remove the Contributor role and assign a custom role with read-only permissions.

C.Configure network firewall rules to allow only trusted IP addresses.

D.Apply an Azure Policy that denies data access from unapproved locations.

AnswerD

Azure Policy can enforce network restrictions to prevent data exfiltration.

Why this answer

Option A is correct because diagnostic settings log activity to a Log Analytics workspace, which can be used for monitoring and alerting on suspicious data access, but the question asks to prevent exfiltration. Actually, the best prevention is to use Azure Policy to restrict data access. Wait, re-reading: The question is about preventing exfiltration.

The correct answer is to use Azure Policy with deny effect to restrict access to certain data. But among options, Option A (diagnostic settings) is monitoring, not prevention. Option B (Azure Policy with deny effect) is prevention.

Option C (RBAC with read-only) could help but insider has Contributor. Option D (firewall rules) can restrict network access. The best is to use Azure Policy to enforce data access rules.

Given the options, Option B is most relevant. However, I need to ensure correctness. Let me re-evaluate: The insider has Contributor role, which allows write/read.

To prevent exfiltration, you can use Azure Policy to deny access to specific data or use service endpoints. Option B is correct because Azure Policy can enforce that data cannot be accessed from outside the corporate network. Option A is wrong because it only monitors.

Option C is wrong because the user already has Contributor. Option D is wrong because firewall rules can be bypassed by the insider if they are inside the network. So correct is B.

Full explanation →

324

MCQeasy

A small business uses Azure SQL Database as their operational data store. They want to build a simple reporting solution that refreshes daily. The data volume is about 10 GB. They have limited budget and no dedicated data engineering team. They need to extract data from Azure SQL Database, perform basic transformations (e.g., aggregations, joins), and load into a reporting table in the same database. They want to minimize administrative effort and cost. Which approach should they take? Option A: Use Azure Data Factory with a copy activity to copy data from the source to a staging table, then use a stored procedure activity to run T-SQL transformations. Option B: Use Azure Logic Apps to periodically query the source and insert transformed data into the reporting table. Option C: Use Azure Databricks notebooks scheduled via Azure Data Factory to perform the ETL. Option D: Use Azure SQL Database elastic job to run a T-SQL script that performs the ETL within the database.

A.Option B

B.Option A

C.Option D

D.Option C

AnswerC

Simple, cost-effective, and fits requirements.

Why this answer

Option D is correct because using an Azure SQL Database elastic job to run a T-SQL script directly within the database is the simplest, lowest-cost, and lowest-administration approach for this scenario. It avoids external services, requires no data movement, and leverages existing SQL skills, making it ideal for a small business with limited budget and no dedicated data engineering team.

Exam trap

The trap here is that candidates often over-engineer the solution by choosing Azure Data Factory or Databricks, assuming that any ETL requires a dedicated orchestration or big-data tool, when in fact a simple in-database job is the most cost-effective and administratively lightweight option for small-scale, in-place transformations.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory with a copy activity and stored procedure activity introduces unnecessary complexity, cost, and administrative overhead for a simple in-database ETL that could be done with a single T-SQL script. Option B is wrong because Azure Logic Apps is designed for lightweight orchestration and integration, not for performing complex T-SQL transformations or handling 10 GB of data efficiently; it would be slow, costly, and lack proper error handling for data processing. Option C is wrong because Azure Databricks is a powerful but overkill solution for a 10 GB workload; it requires cluster management, incurs significant cost, and demands specialized Spark knowledge, which contradicts the requirement to minimize administrative effort and cost.

Full explanation →

325

Drag & Dropmedium

Drag and drop the steps to implement Azure Data Lake Storage Gen2 lifecycle management to move data to cool and archive tiers into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

After creating the account, define a lifecycle rule with conditions and actions, then apply it.

Full explanation →

326

MCQhard

Refer to the exhibit. A data engineer notices that Spark jobs on this cluster are running slower than expected. The cluster is using spot instances with fallback. Which factor is most likely causing the performance degradation?

A.The spark.sql.adaptive settings are misconfigured

B.The cluster is using spot instances which may be frequently reclaimed

C.The node type Standard_DS3_v2 is too small

D.The autoscale configuration limits max workers to 8

AnswerB

Spot instances are cheaper but can be terminated at any time, causing job failures or delays due to recomputation.

Why this answer

Spot instances can be preempted, causing delays. The configuration sets 'first_on_demand' to 1, meaning only 1 node is on-demand, and the rest are spot. Spot instances can be reclaimed, leading to recomputation and slower performance.

The adaptive query execution settings are generally beneficial, not harmful.

Full explanation →

327

MCQmedium

You are designing a data solution in Azure that requires all data in transit between Azure Databricks and Azure Storage to be encrypted using a customer-managed key. Which configuration meets this requirement?

A.Enable 'Secure transfer required' on the storage account

B.Configure a service endpoint and a firewall rule to restrict access to Azure Databricks

C.Create a customer-managed key in Azure Key Vault and assign it to the storage account for encryption

D.Set the minimum TLS version to 1.2 on the storage account

AnswerA

This ensures data is encrypted in transit with Microsoft-managed keys, not customer-managed. Customer-managed keys for transit are not supported; client-side encryption would be needed.

Why this answer

Option A is correct because enabling 'Secure transfer required' on the storage account enforces HTTPS for all requests, which encrypts data in transit using TLS. This ensures that all traffic between Azure Databricks and Azure Storage is encrypted, and the encryption keys used for TLS are customer-managed when combined with a customer-managed key for the TLS session (though the question focuses on in-transit encryption, the key management for TLS itself is handled by the platform; the customer-managed key requirement is satisfied by using Azure Key Vault for the storage account's encryption key, but the in-transit encryption is enforced by HTTPS).

Exam trap

The trap here is that candidates confuse encryption at rest (customer-managed keys for Azure Storage Service Encryption) with encryption in transit, and mistakenly select Option C, thinking it covers both, when it only addresses data at rest.

How to eliminate wrong answers

Option B is wrong because configuring a service endpoint and firewall restricts network access to the storage account from Azure Databricks but does not encrypt data in transit; it only controls which source IPs or virtual networks can connect. Option C is wrong because creating a customer-managed key in Azure Key Vault and assigning it to the storage account encrypts data at rest (Azure Storage Service Encryption), not data in transit. Option D is wrong because setting the minimum TLS version to 1.2 only specifies the allowed TLS protocol version for connections but does not enforce encryption itself; encryption is already provided by HTTPS, and this setting does not involve customer-managed keys.

Full explanation →

328

MCQhard

You are designing a real-time analytics solution for IoT devices that emit telemetry data every second. The data must be aggregated every minute and stored in Azure SQL Database for historical analysis. You need to minimize latency and operational overhead. Which approach should you recommend?

A.Use Azure Databricks with Structured Streaming to aggregate and write to SQL Database

B.Use Event Hubs Capture to store raw data in blob storage, then use Azure Data Factory to load into SQL Database hourly

C.Use Azure Stream Analytics with a tumbling window of 1 minute and output to Azure SQL Database

D.Use Azure Functions to process events and write to SQL Database

AnswerC

Minimal latency and operational overhead.

Why this answer

Option C is correct because Azure Stream Analytics natively supports real-time stream processing with tumbling windows, allowing you to aggregate IoT telemetry data every minute and output directly to Azure SQL Database with minimal latency. This approach avoids the overhead of managing clusters (Databricks) or orchestrating batch loads (Data Factory), directly meeting the requirement for low latency and operational simplicity.

Exam trap

The trap here is that candidates often over-engineer the solution by choosing Databricks (Option A) for its flexibility, overlooking that Stream Analytics is purpose-built for low-latency, windowed aggregations with minimal operational overhead, while Databricks adds unnecessary complexity for simple time-based aggregations.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming introduces significant operational overhead for cluster management and is overkill for simple minute-level aggregation, plus it adds latency from Spark job initialization and checkpointing. Option B is wrong because Event Hubs Capture stores raw data in blob storage, and using Azure Data Factory to load hourly into SQL Database introduces at least 60 minutes of latency, failing the real-time requirement. Option D is wrong because Azure Functions are stateless and event-driven, lacking built-in windowing capabilities for time-based aggregation, so you would need to implement custom state management (e.g., using Durable Functions or external storage), increasing complexity and latency.

Full explanation →

329

Multi-Selectmedium

You are building a real-time processing solution using Azure Stream Analytics. The solution must handle out-of-order events and late arrivals. Which THREE mechanisms should you configure in the Stream Analytics job?

Select 3 answers

A.Set an 'Out-of-order tolerance' window in the event ordering settings.

B.Adjust the 'Streaming units' to handle higher throughput.

C.Configure a 'Late-arrival tolerance' window.

D.Enable 'Event hub capture' to store raw events for reprocessing.

E.Choose an output adapter that supports exactly-once delivery.

AnswersA, C, E

This defines how late events can be reordered.

Why this answer

Option A is correct because Azure Stream Analytics allows you to configure an 'Out-of-order tolerance' window in the event ordering settings. This window defines the maximum time difference that out-of-order events can be reordered before being considered late. By setting this tolerance, you ensure that events arriving slightly out of sequence are still processed correctly, which is critical for real-time analytics where event order matters.

Exam trap

The trap here is that candidates confuse scaling mechanisms (like Streaming units) or storage features (like Event Hubs Capture) with event ordering controls, which are specifically designed to manage temporal anomalies in streaming data.

Full explanation →

330

MCQmedium

You are designing a data pipeline in Azure Data Factory that processes sensitive customer data. The pipeline must use a copy activity to move data from Azure Blob Storage to Azure Synapse Analytics. You need to ensure that data is encrypted in transit and at rest, and that the pipeline uses the most secure authentication method. Which authentication method should you use for the sink dataset?

A.Managed Identity

B.Storage account key

C.Service principal

D.SQL authentication

AnswerA

Managed Identity eliminates the need for secrets and provides secure, seamless authentication.

Why this answer

Option C is correct because Managed Identity provides the most secure authentication with no secrets stored, and it supports Azure Synapse Analytics. Option A is wrong because SQL authentication passes credentials in connection strings. Option B is wrong because service principal requires secret management.

Option D is wrong because storage account key is a shared secret.

Full explanation →

331

MCQeasy

You are an administrator for an Azure Synapse Analytics dedicated SQL pool. You execute the T-SQL statements shown in the exhibit. The external table 'dbo.Orders' is created. Which statement about querying this external table is true?

A.Querying the external table automatically imports data into a round-robin distribution.

B.The table cannot be queried until the data is imported into the dedicated SQL pool.

C.You can query the external table using standard T-SQL SELECT statements.

D.You must first create a PolyBase external table before querying.

AnswerC

External tables support SELECT queries.

Why this answer

Option C is correct because an external table in Azure Synapse Analytics dedicated SQL pool is a read-only abstraction over data stored externally (e.g., in Azure Blob Storage or Azure Data Lake Store). You can query it directly using standard T-SQL SELECT statements without importing data into the pool, leveraging PolyBase to push down predicate filtering and read only the required data.

Exam trap

The trap here is that candidates often assume external tables require an explicit import step before querying, but in reality, PolyBase allows direct querying of external data without any data movement into the dedicated SQL pool.

How to eliminate wrong answers

Option A is wrong because querying an external table does not automatically import data into a round-robin distribution; external tables remain external and data is not stored in the pool unless you explicitly use CREATE TABLE AS SELECT (CTAS) to import it. Option B is wrong because the external table can be queried immediately after creation without importing data; the data stays in external storage and is accessed on-the-fly by PolyBase. Option D is wrong because the T-SQL statements in the exhibit already create a PolyBase external table (using CREATE EXTERNAL TABLE with an external data source and file format), so no additional PolyBase external table creation is needed before querying.

Full explanation →

332

Multi-Selectmedium

You are designing a data storage solution for a healthcare company that must comply with HIPAA. The solution needs to store structured patient records and unstructured medical images. Data must be encrypted at rest and in transit. Which TWO storage solutions meet these requirements?

Select 2 answers

A.Azure Redis Cache

B.Azure Blob Storage

C.Azure Table Storage

D.Azure Cosmos DB

E.Azure SQL Database

AnswersB, E

Blob Storage supports encryption at rest and in transit, and is suitable for images.

Why this answer

Azure Blob Storage supports storing unstructured data like medical images and offers encryption at rest via Storage Service Encryption (SSE) and in transit via HTTPS/TLS. It is HIPAA-eligible when configured with appropriate access controls and logging, making it suitable for the unstructured image component of the solution.

Exam trap

The trap here is that candidates may choose Cosmos DB or Table Storage for structured data, overlooking that Azure SQL Database is the preferred HIPAA-compliant relational store for structured patient records, while Blob Storage is the correct choice for large unstructured images.

Full explanation →

333

Multi-Selectmedium

Which TWO actions should you take to secure data at rest in Azure Data Lake Storage Gen2? (Choose TWO)

Select 2 answers

A.Use Azure RBAC to grant least-privilege access to the storage account.

B.Apply dynamic data masking to sensitive columns.

C.Enable Azure Storage Service Encryption (SSE) for data at rest.

D.Configure firewall rules to restrict network access.

E.Enable audit logging for the storage account.

AnswersA, C

RBAC controls access, a security measure for data at rest.

Why this answer

Correct answers: A and D. A: Enable encryption at rest using Azure Storage Service Encryption (SSE) which is enabled by default. D: Use Azure RBAC to control access to the storage account.

B is wrong because data masking is for databases, not storage. C is wrong because firewall rules secure network access, not data at rest. E is wrong because audit logging is for monitoring, not encryption.

Full explanation →

334

MCQmedium

Your organization is using Azure Synapse Analytics dedicated SQL pool. You notice that queries are running slower than expected. Upon reviewing the execution plans, you see that some queries are performing table scans instead of seeks on large fact tables. What is the most likely cause?

A.The statistics on the tables are outdated or missing.

B.The tables are distributed using round-robin distribution.

C.Result-set caching is disabled.

D.The resource class for the user is set to smallrc.

AnswerA

Outdated or missing statistics can cause the optimizer to choose suboptimal access methods like table scans.

Why this answer

Outdated or missing statistics prevent the Azure Synapse Analytics dedicated SQL pool query optimizer from accurately estimating row counts and data distribution. Without reliable statistics, the optimizer may incorrectly choose a table scan over a more efficient index seek or partition elimination, leading to slower query performance on large fact tables.

Exam trap

The trap here is that candidates often confuse performance issues caused by distribution type or resource class with the optimizer's reliance on statistics, overlooking that even with optimal distribution and sufficient resources, stale statistics force scans instead of seeks.

How to eliminate wrong answers

Option B is wrong because round-robin distribution evenly distributes data across distributions without considering join keys, which can cause data movement but does not directly cause table scans instead of seeks; scans are a symptom of missing statistics or poor index usage. Option C is wrong because result-set caching only affects repeated execution of the same query by storing results, not the initial query plan choice between scan and seek. Option D is wrong because the resource class (e.g., smallrc) controls memory and concurrency slots for the user, not the query optimizer's decision to use scans versus seeks; scans occur regardless of resource class if statistics are stale.

Full explanation →

335

Multi-Selecteasy

Which TWO methods can you use to optimize the cost of storing data in Azure Data Lake Storage Gen2?

Select 2 answers

A.Use customer-managed keys for encryption.

B.Configure lifecycle management policies to move older data to the cool or archive tier.

C.Enable soft delete for blobs.

D.Use Azure Blob Storage access tiers: hot, cool, and archive.

E.Enable geo-redundant storage (GRS) for disaster recovery.

AnswersB, D

Reduces storage cost by moving data to cheaper tiers.

Why this answer

Options A and C are correct. Option A: Lifecycle management policies can move data to cooler tiers. Option C: Using Azure Blob Storage access tiers (hot, cool, archive) directly reduces cost.

Option B is wrong because redundancy options like GRS increase cost. Option D is wrong because enabling soft delete adds storage overhead. Option E is wrong because encryption does not affect storage cost.

Full explanation →

336

Multi-Selecthard

You are a Data Engineer at Contoso Ltd. The company operates an e-commerce platform that generates streaming data from user interactions (clicks, page views, purchases) at a rate of 50,000 events per second. The data is ingested into Azure Event Hubs. You need to design a data processing solution that meets the following requirements: 1. Real-time dashboards must display aggregated metrics (e.g., total sales, active users) with a latency of less than 2 seconds. 2. Historical data must be stored in Azure Data Lake Storage Gen2 in Delta Lake format for batch analytics. 3. The solution must support exactly-once semantics for the streaming output to both the dashboard and the data lake. 4. You must use Azure Stream Analytics as the primary processing engine. 5. The output to the dashboard should use Azure Power BI, while the output to the data lake should use Azure Synapse Analytics serverless SQL pool for querying. You have configured an Azure Stream Analytics job with Event Hubs as input. For output, you added a Power BI dataset and an Azure Data Lake Storage Gen2 output. However, you discover that the Power BI dataset is being updated with duplicate records, and the data lake output sometimes misses small windows of data during job restarts. Which two actions should you take to meet the requirements? (Choose two.) A. Change the Power BI output to use the 'Exactly once' output mode. B. Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once. C. Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output. D. Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.

Select 2 answers

A.Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output.

B.Change the Power BI output to use the 'Exactly once' output mode.

C.Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once.

D.Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.

AnswersA, B

Checkpointing prevents data loss; exactly-once mode ensures no duplicates.

Why this answer

Option A is correct because Power BI output supports 'Exactly once' mode to avoid duplicates. Option C is correct because enabling checkpointing ensures no data loss during restarts, and 'Exactly once' mode for ADLS Gen2 ensures exactly-once delivery. Option B is wrong because compatibility level 1.0 is outdated and does not guarantee exactly-once.

Option D is wrong because it adds complexity and does not solve the issue of duplicates on Power BI.

Full explanation →

337

MCQmedium

A company uses Azure Databricks for data processing. They want to monitor the performance of Spark jobs and set up alerts for job failures. Which Azure service should they use?

A.Azure Advisor

B.Azure Sentinel

C.Azure Log Analytics

D.Azure Monitor

AnswerD

Azure Monitor collects metrics, logs, and enables alerts for Azure resources including Databricks.

Why this answer

Option C is correct because Azure Monitor can collect metrics and logs from Azure Databricks and set up alerts. Option A is wrong because Azure Log Analytics is part of Azure Monitor but is not the service name for alerting. Option B is wrong because Azure Sentinel is a SIEM, not for monitoring job performance.

Option D is wrong because Azure Advisor provides recommendations, not monitoring and alerts.

Full explanation →

338

MCQmedium

Your organization uses Azure Synapse Analytics with serverless SQL pools. You need to ensure that only users with specific Microsoft Entra ID roles can query external tables referencing Azure Data Lake Storage Gen2. What should you configure?

A.Assign a managed identity to the serverless SQL pool and grant it Storage Blob Data Reader on the storage account.

B.Use Azure RBAC to assign Storage Blob Data Reader role to the users on the storage account.

C.Configure a storage account firewall to allow only the Synapse workspace IP range.

D.Grant SELECT permission on the external table to specific Microsoft Entra ID users or groups.

AnswerD

This restricts query access based on user identity.

Why this answer

Option B is correct because Azure Synapse workspace firewall controls network access, but for data access, you need to grant specific permissions on the external data source. Option A is wrong because storage account firewall controls network access, not user identity. Option C is wrong because managed identity for the serverless pool is not user-specific.

Option D is wrong because RBAC on the storage account alone does not restrict query access through Synapse.

Full explanation →

339

MCQmedium

You are reviewing an Azure Resource Manager template for an Azure SQL Database auditing policy. Based on the exhibit, which of the following is true?

A.Audit logs will be retained indefinitely.

B.The audit policy will use default audit actions and groups.

C.Audit logs will be sent to Azure Log Analytics.

D.Audit logs will be written to Azure Blob Storage.

AnswerD

The storageEndpoint property specifies the blob storage account for audit logs.

Why this answer

Option A is correct because the template enables auditing (state: Enabled) and specifies storageEndpoint, which sends audit logs to Azure Blob Storage. Option B is wrong because retentionDays is set to 90, not infinity. Option C is wrong because auditActionsAndGroups is provided, so default actions are not used.

Option D is wrong because the property 'storageEndpoint' indicates blob storage, not Log Analytics.

Full explanation →

340

MCQhard

You are reviewing an Azure Policy assignment that uses the above JSON to define a role-based access control (RBAC) action. What is the primary purpose of this policy?

A.To assign RBAC roles to users for the storage account.

B.To enable delegation of access to a specific blob.

C.To allow users to set permissions on storage account containers.

D.To authorize generation of a shared access signature (SAS) token for the storage account.

AnswerD

The 'listAccountSas' action generates an account-level SAS token.

Why this answer

The policy JSON defines a role-based access control (RBAC) action that grants the 'Microsoft.Storage/storageAccounts/listAccountSas/action' permission. This specific action authorizes the generation of a shared access signature (SAS) token at the storage account level, not at the container or blob level. Therefore, the primary purpose is to allow the generation of an account SAS token, which provides delegated access to storage services.

Exam trap

The trap here is that candidates confuse the account-level SAS generation action with container or blob-level delegation, or mistakenly think the policy itself assigns roles rather than defining a permission that can be used in a custom role.

How to eliminate wrong answers

Option A is wrong because the policy does not assign RBAC roles to users; it defines a permission action that can be included in a role definition, not the assignment of roles themselves. Option B is wrong because delegation of access to a specific blob requires a service SAS or user delegation SAS, which uses different actions (e.g., 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/...'), not the account-level listAccountSas action. Option C is wrong because setting permissions on storage account containers is managed via container-level RBAC actions (e.g., 'Microsoft.Storage/storageAccounts/blobServices/containers/write') or container ACLs, not the account-level SAS generation action.

Full explanation →

341

MCQeasy

You are a data engineer at a financial services company. The company uses Azure Cosmos DB for NoSQL to store customer transaction data. The data is partitioned by customerId. The application team needs to run analytical queries that aggregate transactions by date across all customers. These queries are currently slow and consume high RUs. You need to enable faster analytical queries without impacting the transactional workload. What should you do?

A.Increase the provisioned RU/s on the container to handle both transactional and analytical queries.

B.Change the partition key to /date to optimize for analytical queries.

C.Create a materialized view using the change feed and store aggregated data in a separate container.

D.Enable the Azure Cosmos DB analytical store (Synapse Link) and query the data using Azure Synapse Serverless SQL.

AnswerD

Analytical store provides a separate columnar store for efficient analytical queries.

Why this answer

Option D is correct because enabling the Azure Cosmos DB analytical store (Synapse Link) creates a separate column-oriented store optimized for large-scale analytical queries without consuming RUs from the transactional workload. By querying this analytical store using Azure Synapse Serverless SQL, you can run fast aggregations across all customers by date while the transactional container remains unaffected.

Exam trap

The trap here is that candidates may think increasing RU/s or changing the partition key is a simpler fix, but the DP-203 exam specifically tests the understanding that analytical workloads must be isolated from transactional workloads using a dedicated analytical store like Synapse Link.

How to eliminate wrong answers

Option A is wrong because simply increasing provisioned RU/s does not separate the analytical workload from the transactional workload; both would still compete for the same throughput, and analytical queries would continue to consume high RUs, potentially throttling transactions. Option B is wrong because changing the partition key to /date would destroy the existing container's partitioning strategy, causing hot partitions for high-volume dates and severely degrading transactional performance for customer-based lookups; partition keys cannot be changed after creation without data migration. Option C is wrong because creating a materialized view using the change feed requires custom code to maintain the view, adds operational complexity, and still consumes RUs on the source container for change feed processing, whereas the analytical store provides a fully managed, zero-ETL solution.

Full explanation →

342

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slower than expected. The files are large (500 MB each) and not partitioned. You need to improve query performance without moving data. What should you do?

A.Convert the Parquet files to Delta format using Azure Databricks.

B.Change the files to CSV format with a header row.

C.Use OPENROWSET with BULK and specify ROWSET_OPTIONS for better performance.

D.Create external tables with explicit file format and partition elimination hints.

AnswerD

External tables with file format optimization can improve query performance by enabling metadata-based pruning.

Why this answer

Creating external tables with file format options allows the serverless SQL pool to use metadata for optimization. Changing the file format to CSV or using OPENROWSET without external tables does not improve performance as much. Converting to Delta format requires data movement.

Partitioning the files would be best, but that requires reorganizing files.

Full explanation →

343

MCQmedium

You are designing a data storage solution for real-time analytics on IoT telemetry. The system must ingest 10,000 events per second and support sub-second query latency. Which Azure data store should you use?

A.Azure Table Storage.

B.Azure SQL Database with in-memory OLTP.

C.Azure Cosmos DB with analytical store.

D.Azure Data Explorer (ADX).

AnswerD

ADX is designed for real-time analytics on large volumes of streaming data with sub-second query latency.

Why this answer

Azure Data Explorer (ADX) is purpose-built for high-velocity telemetry and log analytics, ingesting 10,000+ events per second with sub-second query latency via its columnar storage and distributed query engine. It supports real-time analytics on streaming IoT data without requiring pre-defined schemas or indexing, making it the optimal choice for this scenario.

Exam trap

The trap here is that candidates often confuse Azure Cosmos DB's analytical store with a real-time analytics solution, but it is designed for hybrid transactional/analytical processing (HTAP) on operational data, not for high-velocity streaming telemetry analytics where ADX excels.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for simple lookups and batch operations, not for real-time analytics with sub-second query latency on high-velocity streaming data. Option B is wrong because Azure SQL Database with in-memory OLTP is optimized for transactional workloads (OLTP) with high concurrency, not for analytical queries on high-volume telemetry; it cannot sustain 10,000 events per second with sub-second analytical latency. Option C is wrong because Azure Cosmos DB with analytical store is designed for globally distributed operational data with eventual consistency for analytics, but its analytical store uses a separate columnar format optimized for read-heavy workloads, not real-time sub-second queries on streaming IoT telemetry; it introduces higher latency for ingestion and query compared to ADX.

Full explanation →

344

MCQeasy

You need to monitor the health of your Azure Data Lake Storage Gen2 account. Which metric should you use to track the number of successful and failed requests?

A.Transactions.

B.Success E2E Latency.

C.Blob Capacity.

D.Ingress.

AnswerA

Transactions metric counts the number of requests to the storage service.

Why this answer

Option A is correct because Transactions metric tracks all requests. Option B is wrong because Ingress is about data incoming, not request count. Option C is wrong because SuccE2ELatency measures latency, not count.

Option D is wrong because Blob Capacity measures storage size.

Full explanation →

345

MCQeasy

Your team is building a real-time dashboard using Azure Stream Analytics. The data source is an Azure Event Hub that receives clickstream events. You need to output aggregated data (counts per page per minute) to an Azure SQL Database for reporting. The query must handle late-arriving events and ensure exactly-once semantics. Which Stream Analytics feature should you use?

A.Use a temporal window function with a 'late arrival' policy specified in the query.

B.Use the Input Order section in the Stream Analytics job configuration to set a late arrival tolerance window.

C.Define a watermark in the query to specify a maximum out-of-order tolerance.

D.Set the event ordering policy to 'Adjust' to reorder events within a certain time window.

AnswerB

Input Order policy allows handling late events, and Stream Analytics ensures exactly-once delivery to SQL Database.

Why this answer

Option C is correct because the temporal window functions (e.g., TumblingWindow) with a late arrival policy and exactly-once semantics are built into Stream Analytics. Option A is wrong because watermarks are a concept in Spark, not Stream Analytics. Option B is wrong because the Input Order policy in Stream Analytics allows handling late events, but exactly-once semantics are guaranteed by the combination of checkpointing and output adapters.

Option D is wrong because the event ordering policy is for out-of-order events, not for late arrival.

Full explanation →

346

MCQhard

You are designing a data processing pipeline in Azure Synapse Analytics that reads streaming data from Azure Event Hubs, performs aggregations in real time, and writes results to Azure Cosmos DB for a dashboard. The data volume is 10,000 events per second with 2 KB each. The latency requirement is under 5 seconds from event ingestion to dashboard visibility. Which technology should you use for the real-time aggregation?

A.Azure Synapse Spark with Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory mapping data flows

D.Azure Synapse dedicated SQL pool with T-SQL queries

AnswerB

Sub-second latency, native Event Hubs and Cosmos DB connectors.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed, real-time analytics service designed specifically for low-latency stream processing. It can ingest data from Azure Event Hubs, perform windowed aggregations (e.g., tumbling, hopping, sliding windows) with sub-second latency, and output directly to Azure Cosmos DB, meeting the 5-second latency requirement for the dashboard.

Exam trap

The trap here is that candidates often confuse Azure Synapse Spark Structured Streaming (which is micro-batch, not true streaming) with a real-time engine, or they assume Azure Data Factory can handle streaming data because it supports 'real-time' triggers, but it cannot perform in-flight aggregations with sub-second latency.

How to eliminate wrong answers

Option A is wrong because Azure Synapse Spark with Structured Streaming is a batch-micro-batch engine with higher latency (typically seconds to minutes) and is not optimized for sub-5-second real-time aggregation; it is better suited for complex transformations on large datasets. Option C is wrong because Azure Data Factory mapping data flows are designed for batch ETL/ELT operations on data at rest, not for real-time streaming ingestion or low-latency aggregations. Option D is wrong because Azure Synapse dedicated SQL pool with T-SQL queries is a massively parallel processing (MPP) data warehouse for analytical queries on stored data, not for real-time stream processing; it cannot directly ingest streaming data from Event Hubs and perform sub-5-second aggregations.

Full explanation →

347

MCQmedium

You are implementing a data pipeline in Azure Synapse Analytics that uses PolyBase to load data from Azure Data Lake Storage Gen2 into a dedicated SQL pool. The pipeline runs nightly and processes approximately 500 GB of data. You notice that the load operation is slow and frequently times out. What should you do to improve performance?

A.Create a columnstore index on the target table.

B.Use a clustered index on the target table.

C.Increase the resource class for the loading user.

D.Change the distribution of the target table to round-robin.

AnswerC

Higher resource class allocates more resources to the load operation.

Why this answer

Option C is correct because increasing the resource class in dedicated SQL pool provides more resources (CPU, memory) to the PolyBase load operation, which directly addresses the timeout issue. Option A is wrong because columnstore indexes are relevant for query performance, not load speed. Option B is wrong because round-robin distribution can improve load speed, but the primary issue is resource allocation.

Option D is wrong because the data type of the clustered index does not impact PolyBase load performance.

Full explanation →

348

Multi-Selectmedium

Which THREE factors should you consider when designing a real-time streaming solution using Azure Stream Analytics to process IoT data from thousands of devices?

Select 3 answers

A.The need to join input data with reference data that changes every few seconds.

B.The batch size for output to Azure Synapse Analytics to minimize write transactions.

C.The windowing function and late arrival policy for handling out-of-order events.

D.The latency requirements for writing output to Azure Cosmos DB for NoSQL.

E.The number of streaming units and partitioning scheme for the input Event Hubs.

AnswersB, C, E

Batching reduces transaction costs and improves throughput.

Why this answer

Option A is correct because output batching reduces the number of write operations. Option B is correct because partitioning the input stream improves parallel processing. Option D is correct because late arrival policies handle out-of-order events.

Option C is wrong because reference data is typically loaded once and does not require real-time joins. Option E is wrong because Azure Stream Analytics cannot directly write to Azure Cosmos DB for NoSQL without a custom output, and even then, it is not the primary consideration for real-time processing.

Full explanation →

349

Multi-Selectmedium

You are designing a data processing solution in Azure that must handle both batch and streaming data. The solution should use a common storage layer for both and support schema evolution. Which TWO technologies should you recommend?

Select 2 answers

A.Azure Event Hubs

B.Azure SQL Database

C.Apache Kafka on HDInsight

D.Delta Lake (on Azure Databricks)

E.Azure Data Lake Storage Gen2

AnswersA, D

Common ingestion for batch and streaming.

Why this answer

Azure Event Hubs is correct because it is a fully managed, real-time data ingestion service that can capture streaming data and store it in Azure Data Lake Storage Gen2 or Blob Storage, enabling a unified storage layer for both batch and streaming pipelines. It supports schema evolution through Avro or JSON serialization, allowing downstream consumers to adapt to schema changes without breaking existing processes.

Exam trap

The trap here is that candidates often confuse Azure Data Lake Storage Gen2 as a processing technology rather than a storage layer, or they overlook that Event Hubs is the streaming ingestion service that pairs with Delta Lake for unified batch/streaming and schema evolution.

Full explanation →

350

MCQeasy

Your organization needs to ensure that all data stored in Azure Data Lake Storage Gen2 is encrypted at rest using Microsoft-managed keys. What is the default encryption method?

A.Storage Service Encryption (SSE) with Microsoft-managed keys.

B.Transparent Data Encryption (TDE) on the storage account.

C.Client-side encryption with keys stored in Azure Key Vault.

D.Azure Disk Encryption on the storage nodes.

AnswerA

SSE is enabled by default for all Azure Storage accounts, encrypting data at rest.

Why this answer

Option A is correct because Azure Storage automatically encrypts all data at rest using SSE with Microsoft-managed keys. Option B is wrong because CMK is optional. Option C is wrong because Azure Disk Encryption is for VMs.

Option D is wrong because TDE is for SQL databases.

Full explanation →

351

MCQeasy

Your team is developing a data processing solution in Azure Synapse Analytics. You need to ensure that the solution can automatically scale compute resources based on workload demand for serverless SQL pools. Which feature should you configure?

A.Set a cache size for the serverless SQL pool

B.Configure a dedicated SQL pool with auto-scaling

C.Use workload classification to assign resources

D.Enable auto-resume and auto-pause on the serverless SQL pool endpoint

AnswerD

Serverless SQL pools automatically scale compute resources.

Why this answer

Option A is correct because serverless SQL pools automatically scale based on query demand without manual configuration. Option B is wrong because it's for dedicated SQL pools. Option C is wrong because serverless pools don't use explicit cache.

Option D is wrong because serverless pools do not support workload classification.

Full explanation →

352

MCQmedium

You have an Azure Synapse Analytics dedicated SQL pool that stores sensitive customer data. You need to ensure that only users with a specific Microsoft Entra ID role can access the data, and all access must be logged for auditing. What should you implement?

A.Dynamic data masking

B.Azure RBAC at the SQL pool level

C.Row-level security (RLS) with a security policy

D.Column-level security

AnswerC

RLS restricts row access based on user context, and can be tied to Microsoft Entra ID roles.

Why this answer

Option B is correct because row-level security (RLS) controls access at the row level based on user context, and it can be integrated with Microsoft Entra ID roles. Option A (column-level security) restricts columns, not rows. Option C (dynamic data masking) obfuscates data but does not restrict access.

Option D (Azure RBAC) controls resource management, not data access inside the SQL pool.

Full explanation →

353

MCQmedium

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with timeout errors. You need to improve reliability. What should you do?

A.Use a self-hosted integration runtime with high availability

B.Enable fault tolerance and use staging

C.Change the source to Azure SQL Database

D.Increase the degree of copy parallelism

AnswerB

Fault tolerance allows the copy activity to retry on transient errors, and staging improves performance and reliability.

Why this answer

Option B is correct because enabling fault tolerance with staging in Azure Data Factory allows the copy activity to automatically retry transient failures (such as timeout errors) by staging intermediate data in Azure Blob Storage. This mechanism uses a two-phase commit approach: data is first written to a staging location, and then committed to the final sink only after successful validation, which isolates the pipeline from intermittent source or sink failures.

Exam trap

The trap here is that candidates often confuse high availability of the integration runtime (Option A) with fault tolerance of the copy activity, not realizing that HA only protects the IR nodes, not the data transfer itself.

How to eliminate wrong answers

Option A is wrong because using a self-hosted integration runtime with high availability improves the reliability of the integration runtime itself (e.g., node failures), but does not address timeout errors caused by the copy activity's interaction with the source or sink; it does not provide retry logic for data transfer failures. Option C is wrong because changing the source to Azure SQL Database does not resolve timeout errors in the copy activity; it merely shifts the source location, and the pipeline could still experience timeouts due to network latency or throttling. Option D is wrong because increasing the degree of copy parallelism can actually exacerbate timeout issues by overwhelming the source or sink with concurrent requests, and it does not provide any fault tolerance or retry mechanism for transient failures.

Full explanation →

354

MCQmedium

Your company uses Azure Data Lake Storage Gen2 for a data lake. You need to implement a security strategy that meets the following requirements: 1) All data must be encrypted at rest using customer-managed keys stored in Azure Key Vault. 2) Access to the storage account must be restricted to specific virtual networks. 3) Users must authenticate using Microsoft Entra ID and be granted read-only access to the 'landing' container. You have configured the storage account with Azure Storage Service Encryption (SSE) using customer-managed keys. You have also configured firewall rules to allow access only from the required virtual network. However, users cannot access the 'landing' container even though they have the Storage Blob Data Reader role. What is the most likely issue?

A.The users have not been granted access to the Key Vault

B.The users do not have the Storage Blob Data Reader role assigned at the container scope

C.The firewall is blocking the users' IP addresses even though they are in the virtual network

D.The Key Vault firewall is blocking access from the storage account

AnswerD

Key Vault firewall must allow Azure services or specific storage account.

Why this answer

When using SSE with customer-managed keys, the storage account must have the 'Allow trusted Microsoft services to access this storage account' setting enabled, or the Key Vault must allow access from the storage account. Option A is wrong because RBAC is correctly assigned. Option B is wrong because firewall is set.

Option D is wrong because they have read role.

Full explanation →

355

MCQeasy

You need to incrementally load new and updated records from a source SQL Server database to Azure Synapse Dedicated SQL Pool. The source table has a LastModifiedDate column. Which Azure Data Factory feature should you use to implement incremental loading efficiently?

A.Alter Row transformation

B.Incremental copy (watermark) pattern using a Lookup activity and a Copy activity

C.Schedule trigger

D.Lookup activity alone

AnswerB

The watermark pattern uses a lookup to get the last watermark value and a copy activity to copy data changed after that watermark.

Why this answer

The incremental copy (watermark) pattern using a Lookup activity and a Copy activity is the correct approach because it allows you to query the source table for the maximum LastModifiedDate value (the watermark), store it in a control table or variable, and then use a Copy activity with a WHERE clause to load only rows where LastModifiedDate is greater than the last run's watermark. This pattern is purpose-built for efficiently handling new and updated records in Azure Data Factory without reprocessing the entire dataset.

Exam trap

The trap here is that candidates often confuse a scheduling mechanism (Schedule trigger) with the actual data processing logic required for incremental loads, or they mistakenly think a single activity like Lookup or Alter Row can handle the entire incremental copy workflow without understanding the need for a watermark pattern.

How to eliminate wrong answers

Option A is wrong because Alter Row transformation is a data flow transformation used to mark rows for insert, update, upsert, or delete in a sink, but it does not provide the incremental loading logic or watermark mechanism needed to identify new/updated rows from a source. Option C is wrong because a Schedule trigger only defines when a pipeline runs (e.g., every hour), but it does not implement the incremental copy logic itself; you still need the watermark pattern inside the pipeline. Option D is wrong because a Lookup activity alone can retrieve the watermark value but cannot copy data; it must be combined with a Copy activity to actually move the incremental rows.

Full explanation →

356

MCQmedium

You are designing a data pipeline in Azure Synapse Analytics to ingest data from Azure Blob Storage into a dedicated SQL pool. The source files are CSV with varying row lengths, and you need to ensure optimal performance for reads. Which file format and compression should you recommend?

A.Avro with Deflate compression

B.CSV with Gzip compression

C.Parquet with Snappy compression

D.ORC with Zlib compression

AnswerC

Parquet is columnar and Snappy provides fast compression/decompression, ideal for Synapse dedicated SQL pool.

Why this answer

Parquet with Snappy compression is optimal for dedicated SQL pools in Azure Synapse Analytics because Parquet is a columnar format that enables efficient predicate pushdown and column pruning, reducing I/O. Snappy provides fast compression/decompression with minimal CPU overhead, which is critical for high-throughput reads in a distributed MPP environment.

Exam trap

Microsoft often tests the misconception that row-based formats like Avro or CSV are suitable for analytical workloads, but the trap here is that columnar formats (Parquet/ORC) are required for optimal read performance in Synapse dedicated SQL pools, and Snappy is preferred over Zlib for speed-critical pipelines.

How to eliminate wrong answers

Option A is wrong because Avro is a row-based format that does not support columnar pruning, leading to higher I/O for analytical queries on dedicated SQL pools. Option B is wrong because CSV with Gzip compression is row-oriented and not splittable at the row level, causing poor parallelism and slower read performance in Synapse. Option D is wrong because ORC with Zlib compression offers higher compression ratios but significantly slower decompression compared to Snappy, which can bottleneck read performance in Synapse's MPP engine.

Full explanation →

357

MCQmedium

Your Azure Synapse Analytics dedicated SQL pool is experiencing performance degradation. Queries that previously completed in seconds now take minutes. You suspect memory pressure and concurrency issues. What should you first review to diagnose the problem?

A.sys.dm_pdw_resource_waits

B.sys.dm_pdw_waits

C.sys.dm_pdw_query_stats_xe

D.sys.dm_pdw_exec_requests

AnswerD

Shows currently running queries with resource consumption

Why this answer

Option C is correct because sys.dm_pdw_exec_requests shows currently running queries and their resource consumption, helping identify memory pressure. Option A is wrong because it shows only resource waits, not the active queries. Option B is wrong because it shows only top resource consumers over time, not current state.

Option D is wrong because it shows query execution details but not real-time resource waits.

Full explanation →

358

Multi-Selecthard

Which THREE are required to run a stored procedure in Azure SQL Database from Azure Data Factory?

Select 3 answers

A.A linked service to Azure SQL Database.

B.A dataset that references the stored procedure.

C.A self-hosted integration runtime.

D.A staging blob storage account.

E.A stored procedure activity in the pipeline.

AnswersA, B, E

The linked service provides connection details.

Why this answer

A linked service to Azure SQL Database is required because it defines the connection string and authentication method (e.g., SQL authentication or managed identity) that Azure Data Factory (ADF) uses to connect to the database. Without this connection configuration, ADF cannot authenticate or communicate with the database to execute the stored procedure.

Exam trap

The trap here is that candidates often think a dataset is mandatory for any activity, but for stored procedure execution, the dataset is only needed if you want to pass parameters or map output, not for the execution itself.

Full explanation →

359

MCQhard

You are designing a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow with Azure Integration Runtime (IR). The pipeline runs slowly and you notice that the IR's data movement is limited by the number of cores. Which configuration should you adjust to improve performance?

A.Enable staging for the copy activity within the data flow.

B.Increase the 'Data Flow Compute Type' and 'Core Count' in the Azure IR settings.

C.Use a Self-Hosted IR instead of Azure IR for data flows.

D.Increase the 'Number of partitions' in the source transformation.

AnswerB

These settings directly allocate more resources to mapping data flows.

Why this answer

The Azure Integration Runtime (IR) for mapping data flows uses a Spark cluster, and its performance is directly tied to the compute resources allocated. By increasing the 'Data Flow Compute Type' (e.g., from General Purpose to Memory Optimized) and the 'Core Count' (e.g., from 4 to 8 or 16 cores), you provide more parallel processing power, which directly addresses the core-limited data movement bottleneck.

Exam trap

The trap here is that candidates confuse the 'Number of partitions' setting (which controls data parallelism within the flow) with the Azure IR's core count (which controls the Spark cluster's overall compute capacity), leading them to pick D instead of B.

How to eliminate wrong answers

Option A is wrong because enabling staging for the copy activity is used to offload data movement to a staging blob store for PolyBase or COPY statement scenarios, not to increase the core count of the Azure IR for mapping data flows. Option C is wrong because Self-Hosted IR is designed for on-premises or private network data sources and does not improve the Spark-based compute performance of a mapping data flow; in fact, it adds network latency. Option D is wrong because increasing the 'Number of partitions' in the source transformation can improve parallelism within the data flow, but it does not change the underlying Azure IR's core count or compute type, which is the root cause of the core-limited bottleneck.

Full explanation →

360

MCQeasy

You are developing a data processing solution that requires aggregating sales data from multiple CSV files stored in Azure Data Lake Storage Gen2. The data should be cleansed and transformed before loading into Azure Synapse Analytics. Which Azure service should you use to implement a code-free transformation pipeline?

A.Azure HDInsight with Hive

B.Azure Analysis Services

C.Azure Data Factory with Mapping Data Flows

D.Azure Databricks with PySpark

AnswerC

Mapping Data Flows provide code-free data transformation at scale.

Why this answer

Azure Data Factory with Mapping Data Flows allows code-free visual transformations. Azure Databricks and HDInsight require code. Azure Analysis Services is for tabular modeling, not data processing.

Full explanation →

361

MCQmedium

You need to ensure that an Azure Synapse Analytics dedicated SQL pool automatically pauses after 2 hours of inactivity to save costs. Which feature should you configure?

A.Maintenance window

B.Data Exfiltration Prevention

C.Auto-pause feature

D.Workload Management

AnswerC

Automatically pauses after specified inactivity.

Why this answer

Azure Synapse Analytics supports auto-pause for dedicated SQL pools (Gen2). You can set an auto-pause delay in minutes. Option A (Data Exfiltration Prevention) is security.

Option B (Workload Management) is for concurrency. Option D (Maintenance Window) is for updates.

Full explanation →

362

MCQmedium

You are reviewing an Azure Data Factory JSON definition for a linked service. The linked service uses a service principal to connect to Azure Data Lake Storage Gen1. What is a security concern with this configuration?

A.The subscription ID and resource group are specified.

B.The linked service uses a service principal instead of a managed identity.

C.Using Azure Data Lake Storage Gen1 instead of Gen2.

D.The service principal key is stored as a SecureString but is visible in the JSON definition.

AnswerD

Storing secrets in linked service JSON is insecure; should use Azure Key Vault.

Why this answer

Option B is correct because storing the service principal key as plain text in the JSON is insecure. Option A is wrong because using Data Lake Storage Gen1 is not a security concern. Option C is wrong because service principal authentication is acceptable.

Option D is wrong because subscription and resource group are necessary for resource management.

Full explanation →

363

MCQmedium

You are designing a data storage solution for a retail company that expects high volumes of small, time-series sensor data from thousands of IoT devices. The data must be stored cost-effectively and queried by time range with low latency. Which Azure data store should you recommend?

A.Azure Cosmos DB with analytical store

B.Azure SQL Database with columnstore indexes

C.Azure Blob Storage with Azure Data Lake Storage Gen2

D.Azure Data Explorer (ADX)

AnswerD

ADX is designed for high-speed ingestion and real-time analytics on time-series data with built-in optimization.

Why this answer

Azure Data Explorer (ADX) is optimized for high-volume, time-series data ingestion and low-latency queries over time ranges. It uses a columnar storage engine and automatic indexing, making it cost-effective for sensor data from thousands of IoT devices.

Exam trap

The trap here is that candidates often choose Azure Blob Storage with Data Lake Storage Gen2 because it is cheap for storage, but they overlook that it lacks native query capabilities for low-latency time-series queries, requiring additional compute layers.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB with analytical store is designed for globally distributed, multi-model data with operational and analytical workloads, not specifically for high-throughput time-series ingestion and time-range queries at low cost. Option B is wrong because Azure SQL Database with columnstore indexes is a relational database optimized for transactional workloads and analytical queries on structured data, but it cannot efficiently handle the high ingestion rates and time-series query patterns of thousands of IoT devices without significant cost and performance overhead. Option C is wrong because Azure Blob Storage with Azure Data Lake Storage Gen2 is a hierarchical file storage for big data analytics, not a query engine; it requires additional compute services like Azure Synapse or Spark to query time-series data, adding latency and complexity.

Full explanation →

364

MCQhard

Your Azure Data Factory pipeline uses a Copy activity to load data from an on-premises SQL Server to Azure Blob Storage. You notice that the pipeline is running slower than expected. You need to identify the bottleneck. Which Data Factory monitoring metric should you analyze first?

A.Source queue length

B.Pipeline duration

C.Activity run count

D.Data Integration Unit (DIU) consumption

AnswerD

High DIU consumption indicates the Copy activity is resource-constrained.

Why this answer

Option B is correct because Data Integration Unit (DIU) consumption indicates whether the Copy activity is resource-bound. Option A is wrong because pipeline duration is a result, not a bottleneck indicator. Option C is wrong because activity run count is not relevant to performance.

Option D is wrong because source queue length is for integration runtime, not directly for Copy activity throughput.

Full explanation →

365

MCQmedium

Your organization uses Azure Synapse Analytics dedicated SQL pool. You need to ensure that all data at rest in the SQL pool is encrypted using a customer-managed key stored in Azure Key Vault. What should you configure?

A.Implement Always Encrypted with column encryption keys stored in Azure Key Vault.

B.Configure Dynamic Data Masking to obfuscate sensitive data.

C.Enable Azure Storage Service Encryption with a customer-managed key.

D.Enable Transparent Data Encryption (TDE) with a customer-managed key in Azure Key Vault.

AnswerD

TDE with customer-managed key provides encryption at rest for the entire database, meeting the requirement.

Why this answer

Option C is correct because Transparent Data Encryption (TDE) with customer-managed keys in Azure Key Vault provides the required encryption. Option A is wrong because Azure Storage Service Encryption is for storage accounts, not SQL pools. Option B is wrong because Always Encrypted protects data in transit and at rest in application logic, not at rest in the database.

Option D is wrong because Dynamic Data Masking does not encrypt data.

Full explanation →

366

MCQmedium

You have a Synapse Analytics dedicated SQL pool. You need to load 100 GB of CSV data from Azure Data Lake Storage Gen2 into a fact table. The table has a hash-distributed column. Which pattern is most efficient for loading with minimal impact on concurrent queries?

A.Use PolyBase INSERT...SELECT with rowstore table

B.Use CREATE TABLE AS SELECT (CTAS) with the hash-distributed column

C.Use COPY INTO command with a round-robin distribution

D.Use Azure Data Factory Copy activity with staging enabled

AnswerB

Why this answer

Option B is correct because CTAS with a hash-distributed column loads data directly into the target table with the same distribution scheme, avoiding data movement and minimizing resource contention. This pattern is optimized for bulk loading large datasets into a hash-distributed fact table, as it leverages the Synapse SQL pool's MPP architecture to parallelize the operation without blocking concurrent queries.

Exam trap

The trap here is that candidates often assume PolyBase or COPY INTO are always the fastest for bulk loading, but they overlook that CTAS with the correct distribution key avoids the costly data redistribution step required by other methods.

Why the other options are wrong

A

INSERT...SELECT logs each row and may cause concurrency issues; not as efficient as CTAS for large loads.

C

COPY INTO is efficient but round-robin distribution may cause data movement later; CTAS with hash distribution avoids extra steps.

D

Copy activity with staging uses PolyBase under the hood but adds overhead; CTAS is more direct.

Full explanation →

367

MCQmedium

Your team has deployed an Azure Stream Analytics job that reads from an Event Hubs input and writes to Azure Synapse Analytics. The job is falling behind, causing a growing backlog in Event Hubs. You have already scaled the Stream Analytics job to maximum streaming units. What should you do to improve throughput?

A.Increase the streaming units further

B.Configure a late arrival window to drop late events

C.Increase the throughput units of the Event Hubs namespace

D.Partition the input Event Hubs and the output Synapse table, and adjust the Stream Analytics query to use PARTITION BY

AnswerD

Partitioning allows Stream Analytics to process data in parallel, increasing throughput.

Why this answer

Option A is correct because partitioning the input and output can increase parallelism. Option B is wrong because late arrival events deal with out-of-order data, not throughput. Option C is wrong because the job is already at maximum streaming units.

Option D is wrong because increasing Event Hubs throughput units may not help if the bottleneck is the output sink.

Full explanation →

368

MCQmedium

Your Azure Synapse Analytics dedicated SQL pool is experiencing performance degradation. You suspect that the workload is generating excessive data movement due to suboptimal distribution. Which dynamic management view (DMV) should you query to identify queries that are causing significant data movement?

A.sys.dm_pdw_node_status

B.sys.dm_pdw_exec_requests

C.sys.dm_pdw_errors

D.sys.dm_pdw_waits

AnswerB

This DMV shows the execution steps of requests, including data movement operations (shuffle moves, broadcast moves) that can degrade performance.

Why this answer

Option B is correct because sys.dm_pdw_exec_requests shows query steps including data movement operations. Option A is wrong because sys.dm_pdw_node_status shows node health. Option C is wrong because sys.dm_pdw_errors shows error details.

Option D is wrong because sys.dm_pdw_waits shows wait states.

Full explanation →

369

MCQeasy

Your company uses Azure Data Lake Storage Gen2 as the central data lake. You need to process batch data using serverless Spark jobs that can be scheduled daily. Which Azure service should you use?

A.Azure Batch with custom Spark containers.

B.Azure Synapse Analytics serverless Spark pool with pipelines.

C.Azure Databricks with a job cluster.

D.Azure Machine Learning with Spark compute.

AnswerB

Synapse provides serverless Spark pools with automatic scaling and built-in scheduling via pipelines.

Why this answer

Option C is correct because Azure Synapse Analytics provides serverless Spark pools with a built-in scheduler. Option A (Azure Databricks) is not serverless (requires cluster management). Option B (Azure Batch) is for custom compute jobs.

Option D (Azure Machine Learning) is for ML workloads.

Full explanation →

370

MCQmedium

You have an Azure Synapse Analytics dedicated SQL pool that handles both high-priority real-time queries and low-priority batch jobs. You need to ensure that high-priority queries always get the resources they need, while batch jobs do not starve. What should you configure?

A.Enable result-set caching for the high-priority queries

B.Enable data compression on the tables used by batch jobs

C.Create workload groups for high-priority and low-priority queries, assigning appropriate importance and resource percentages

D.Create materialized views for the batch job queries

AnswerC

Workload groups allow you to control resource allocation and query importance.

Why this answer

Option C is correct because workload management with workload groups allows you to set importance and resource allocation. Option A is wrong because result-set caching does not prioritize queries. Option B is wrong because materialized views improve performance but do not prioritize.

Option D is wrong because data compression reduces storage but does not affect prioritization.

Full explanation →

371

Multi-Selecteasy

Which TWO of the following are required components to set up a data pipeline that uses Change Data Capture (CDC) to incrementally load data from SQL Server to Azure Synapse using Azure Data Factory?

Select 2 answers

A.CDC enabled on the source SQL Server database and tables

B.A staging Azure Blob Storage account

C.A Lookup activity to get the last watermark

D.A stored procedure in the source database to capture changes

E.A linked service to the Azure Synapse dedicated SQL pool

AnswersA, E

CDC must be enabled on the source to track changes.

Why this answer

Option A is correct because Change Data Capture (CDC) must be enabled on the source SQL Server database and the specific tables you intend to track. Without CDC enabled, SQL Server will not generate the change tracking tables (e.g., cdc.<capture_instance>_CT) that Azure Data Factory’s CDC connector reads to identify inserts, updates, and deletes. This is a prerequisite for any incremental load using the native CDC mechanism in ADF.

Exam trap

The trap here is that candidates often confuse CDC-based incremental loading with watermark-based incremental loading, leading them to incorrectly select a Lookup activity (Option C) or a staging storage account (Option B) as required components.

Full explanation →

372

MCQhard

You are designing a data pipeline using Azure Synapse Pipelines. The pipeline ingests data from multiple sources, performs transformations using a notebook, and loads the results into a dedicated SQL pool. You need to ensure that if the notebook fails, the entire pipeline stops and sends an alert. What is the most efficient way to configure this?

A.Set the notebook activity's error path to a webhook activity that sends an alert, and then set a wildcard error path for the pipeline.

B.Set the notebook activity's retry count to 0, and configure an alert on the pipeline run failure.

C.Add a 'Fail' activity after the notebook activity and connect the notebook's failure output to it. Configure the Fail activity to send an alert.

D.No configuration needed; by default, a failed activity stops the entire pipeline.

AnswerC

Correct: The Fail activity terminates the pipeline with an error, and you can trigger alerts based on pipeline failure.

Why this answer

Option B is correct because in Azure Synapse Pipelines (or Azure Data Factory), you can set the activity's 'Failure path' to go to a 'Fail' activity that terminates the pipeline and can trigger an alert via webhook or email. Option A is wrong because setting retry to 0 does not stop the pipeline; it just doesn't retry. Option C is wrong because a wildcard error path would still allow other activities to run if not explicitly failed.

Option D is wrong because the default behavior is to continue if there's no error path defined.

Full explanation →

373

MCQmedium

You are building a real-time dashboard to monitor user activity on a website. The data is ingested via Azure Event Hubs and must be aggregated every minute with a 30-second late-arrival tolerance. The aggregated results should be stored in Azure Cosmos DB for low-latency reads. Which Azure service should you use to perform the windowed aggregation?

A.Azure Stream Analytics with a tumbling window of 1 minute and a late-arrival policy of 30 seconds.

B.Azure Functions triggered by Event Hubs to aggregate data and write to Cosmos DB.

C.Azure Databricks with structured streaming and a sliding window.

D.Azure Analysis Services to process streaming data directly from Event Hubs.

AnswerA

Stream Analytics provides built-in windowing functions and late-arrival handling, perfect for this scenario.

Why this answer

Azure Stream Analytics is the correct choice because it natively supports windowed aggregations (tumbling, hopping, sliding, session) and allows you to define a late-arrival policy to handle out-of-order events. A tumbling window of 1 minute with a late-arrival tolerance of 30 seconds meets the requirement exactly, and the output can be directly written to Azure Cosmos DB for low-latency reads.

Exam trap

The trap here is that candidates often confuse tumbling windows (fixed, non-overlapping) with sliding windows (continuous, overlapping) or assume that any compute service (like Functions or Databricks) can easily replicate Stream Analytics' built-in windowing and late-arrival handling, ignoring the complexity of state management and exactly-once semantics.

How to eliminate wrong answers

Option B is wrong because Azure Functions triggered by Event Hubs do not provide built-in windowing or late-arrival policy support; you would have to manually implement stateful aggregation, which is complex and error-prone. Option C is wrong because Azure Databricks with structured streaming uses a sliding window, not a tumbling window, and does not offer a native late-arrival policy configuration as simple as Stream Analytics; it also introduces unnecessary overhead for this real-time dashboard scenario. Option D is wrong because Azure Analysis Services is an OLAP engine for analytical queries on pre-aggregated data, not a real-time stream processing service; it cannot directly process streaming data from Event Hubs.

Full explanation →

374

Multi-Selecthard

Which THREE factors should be considered when designing a partitioning strategy for a large fact table in Azure Synapse Analytics dedicated SQL pool?

Select 3 answers

A.Index type

B.Data compression

C.Distribution key

D.Partition column

E.Number of partitions

AnswersB, D, E

Partition size affects compression efficiency.

Why this answer

Data compression is a key factor in partitioning strategy because it directly impacts storage efficiency and query performance. In Azure Synapse Analytics dedicated SQL pool, columnstore compression is applied per partition, and choosing the right compression level (e.g., COLUMNSTORE vs. COLUMNSTORE_ARCHIVE) can significantly reduce I/O and improve scan speeds for large fact tables.

Exam trap

The trap here is that candidates confuse distribution key (which controls data placement across nodes) with partition column (which controls data organization within a node), leading them to incorrectly select distribution key as a partitioning factor.

Full explanation →

375

Multi-Selecteasy

You are designing a data processing solution in Azure Data Factory that uses mapping data flows. You need to perform type conversions on incoming data. Which three transformations can be used to change data types? (Choose three.)

Select 3 answers

A.Conditional Split

B.Derived Column

C.Assert

D.Sort

E.Select

AnswersB, C, E

Derived Column allows type conversion via expressions.

Why this answer

Options A, B, and C are correct. Derived Column can change types via expressions. Select can cast columns during projection.

Assert can check and convert types. Option D is wrong because Sort only orders data. Option E is wrong because Conditional Split routes rows based on conditions, not type conversion.

Full explanation →

Page 5 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →