Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 1–75

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 1 of 12

1

Multi-Selectmedium

You are securing an Azure Data Lake Storage Gen2 account that contains sensitive data. Which TWO of the following should you implement to protect data from unauthorized access?

Select 2 answers

A.Configure ACLs to grant least privilege to users and groups

B.Use private endpoints to restrict access to the storage account

C.Set the default ACL to allow read access for all authenticated users

D.Enable CORS rules to allow only specific origins

E.Enable large file shares on the storage account

AnswersA, B

ACLs allow granular permissions at the file and directory level.

Why this answer

Option A is correct because private endpoints ensure data traffic does not traverse the public internet. Option C is correct because ACLs provide fine-grained access control at the directory/file level. Option B is wrong because cross-origin resource sharing (CORS) is for web applications, not security.

Option D is wrong because a default ACL of 'read' for all authenticated users would grant broad access. Option E is wrong because large file shares are for Azure Files, not ADLS Gen2.

Full explanation →

2

MCQeasy

You are monitoring Azure Data Factory pipelines. You need to set up an alert when a pipeline run fails. What is the most efficient way to achieve this?

A.Add an activity in each pipeline to log failures to a custom table in Log Analytics.

B.Enable diagnostic settings to send pipeline runs to a Log Analytics workspace and create a log alert.

C.Create an Azure Monitor alert based on the 'Failed pipeline runs' metric.

D.Use Application Insights to monitor pipeline executions.

AnswerC

This is the simplest method using built-in metrics.

Why this answer

Option A is correct because Azure Monitor alerts can be configured on Data Factory metrics like 'Failed pipeline runs'. Option B is wrong because custom logging in a sink requires additional effort. Option C is wrong because diagnostic settings send logs to a workspace, but alerts are easier with metrics.

Option D is wrong because Application Insights is not natively integrated with Data Factory for pipeline runs.

Full explanation →

3

MCQhard

Your company runs a streaming job in Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job is failing with a 'Watermark delay' alert and the output to Synapse is delayed by over 30 minutes. The input rate is 5,000 events per second. The job uses a 1-minute tumbling window. What is the most likely cause of the delay?

A.The output schema in Synapse does not match the Stream Analytics output.

B.The Event Hubs has a large number of late-arriving events.

C.The tumbling window size is too large.

D.The Stream Analytics job is under-provisioned in terms of Streaming Units (SUs).

AnswerD

Insufficient SUs cause backpressure and watermark delay.

Why this answer

Option D is correct because a watermark delay alert in Azure Stream Analytics indicates that the job is falling behind in processing incoming data. With an input rate of 5,000 events per second and a 1-minute tumbling window, the job requires sufficient Streaming Units (SUs) to keep up. Under-provisioned SUs cause backpressure, leading to output delays exceeding 30 minutes.

Exam trap

The trap here is that candidates may confuse a watermark delay alert with late-arriving events (Option B), but the alert indicates the job is falling behind overall, not just handling late data, and the 30-minute delay points to insufficient compute resources rather than data timing issues.

How to eliminate wrong answers

Option A is wrong because a schema mismatch between Stream Analytics output and Synapse would cause data write errors or failures, not a watermark delay alert or a 30-minute output delay. Option B is wrong because a large number of late-arriving events would increase the watermark delay but the alert specifically indicates the job is falling behind overall processing, not just handling late data; late events are managed by the late arrival policy and do not inherently cause a 30-minute delay. Option C is wrong because a 1-minute tumbling window is small and appropriate for real-time analytics; a larger window would reduce processing frequency, not cause delay.

Full explanation →

4

Multi-Selecteasy

Which TWO methods can you use to authenticate an Azure Data Factory pipeline to access Azure Data Lake Storage Gen2? (Choose TWO)

Select 2 answers

A.Use a service principal with a client secret or certificate.

B.Use a shared access signature (SAS) token.

C.Use a managed identity assigned to the Data Factory.

D.Use a user account with password.

E.Use the storage account access key.

AnswersA, C

Service principal is a standard authentication method.

Why this answer

Correct answers: A and C. A: Managed identity is a recommended method for Azure services. C: Service principal is another common method.

B is wrong because storage account key is not recommended for production due to security risks. D is wrong because SAS tokens are less secure and have expiration. E is wrong because user password is not supported.

Full explanation →

5

MCQhard

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data in Azure Databricks with encryption at rest and in transit, and must also support row-level security (RLS). Which combination of features should you use?

A.Use Azure SQL Database with row-level security and connect Databricks via JDBC

B.Use Azure Databricks with Azure Key Vault for encryption and implement RLS in application code

C.Use Azure Synapse Analytics dedicated SQL pool with dynamic data masking and always encrypted

D.Use Azure Databricks with Delta Lake and enable row-level security on the Delta table

AnswerD

Delta Lake in Databricks supports RLS natively, and encryption is handled automatically.

Why this answer

Option D is correct because Delta Lake in Azure Databricks now supports row-level security (RLS) natively through Delta Sharing and Spark SQL predicates, allowing you to enforce data access policies directly on the Delta table without external systems. Combined with Azure Databricks' built-in encryption at rest (using Azure Storage Service Encryption or customer-managed keys) and in transit (TLS 1.2+), this meets all requirements without introducing additional services.

Exam trap

The trap here is that candidates often assume row-level security is only available in traditional relational databases like SQL Server or Azure SQL Database, but Delta Lake in Azure Databricks has supported RLS since runtime 10.4 LTS, making it a fully native option for data lakehouse architectures.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database with row-level security requires JDBC connectivity, which introduces a separate database service and does not leverage Databricks' native Delta Lake capabilities; it also adds latency and complexity for encryption management. Option B is wrong because implementing RLS in application code is not a native feature of Azure Databricks and violates the principle of defense in depth—RLS must be enforced at the storage layer, not in application logic, to prevent bypass. Option C is wrong because Azure Synapse Analytics dedicated SQL pool with dynamic data masking and always encrypted does not provide row-level security; dynamic data masking only obfuscates data at query results, and always encrypted protects data at rest and in transit but does not filter rows based on user permissions.

Full explanation →

6

MCQhard

You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?

A.Increase the executor memory and cores in the Spark pool configuration to handle larger shuffles.

B.Repartition the data on the 'product_category' column with a higher number of partitions (e.g., 2000).

C.Implement incremental processing using Auto Loader with 'directoryListing' mode to process only new files since the last run.

D.Use a broadcast join hint on the fact table to reduce shuffle operations.

AnswerC

Auto Loader incrementally ingests new files, avoiding a full scan of the 5 TB dataset daily. This directly reduces the data processed and speeds up the job.

Why this answer

Option C is correct because the job reads the entire 5 TB dataset daily, which is inefficient when only new data needs processing. Auto Loader with 'directoryListing' mode incrementally identifies and processes only new files since the last run, drastically reducing the data volume and execution time. This directly addresses the root cause of the SLA breach—reading unchanged historical data repeatedly—rather than tuning resources or partitioning.

Exam trap

The trap here is that candidates focus on tuning Spark parameters (memory, partitions, joins) to handle the existing workload, but the real issue is the unnecessary reprocessing of unchanged data, which only incremental loading can solve.

How to eliminate wrong answers

Option A is wrong because increasing executor memory and cores only improves shuffle performance for existing data volumes; it does not reduce the amount of data read or processed, so the job would still read the full 5 TB daily and likely remain over 6 hours. Option B is wrong because repartitioning on 'product_category' with 2000 partitions may help with data skew but does not eliminate the need to read the entire dataset each day; it also introduces a costly shuffle operation that could worsen performance. Option D is wrong because a broadcast join hint is used to optimize joins by broadcasting a small table to all executors, but the fact table (sales transactions) is massive (5 TB) and cannot be broadcast; this approach would cause out-of-memory errors or be ignored by Spark.

Full explanation →

7

MCQhard

You are monitoring an Azure Data Lake Storage Gen2 account using Metrics and Audit logs. You notice that the 'Ingress' metric shows a sudden spike but the 'Egress' metric remains stable. There are no new storage events in the audit log. What is the most likely cause?

A.The storage account is configured with geo-redundant storage (GRS) and data is being replicated to the secondary region.

B.A Spark job is reading large amounts of data in parallel.

C.An Azure Data Factory pipeline is writing intermediate results to the storage account.

D.An Azure Function is triggered by blob creation events and writes logs to the same account.

AnswerC

Writes increase ingress, and if the pipeline is using staging or intermediate storage, it may not log each write as a separate storage event.

Why this answer

Option C is correct because an Azure Data Factory pipeline writing intermediate results to the storage account would cause a spike in 'Ingress' (data written into the account) without a corresponding increase in 'Egress' (data read from the account). The absence of new storage events in the audit log suggests the writes are not triggering blob-level events (e.g., BlobCreated events), which is consistent with Data Factory writing intermediate files using the Azure Blob Storage REST API or SDK without enabling event grid notifications for those specific operations.

Exam trap

The trap here is that candidates confuse 'Ingress' with 'Egress' or assume any write operation must generate a storage event, but Azure Storage events are opt-in and not all write operations (e.g., Data Factory intermediate writes) are configured to emit them.

How to eliminate wrong answers

Option A is wrong because geo-redundant storage (GRS) replication is asynchronous and occurs at the storage system level, not as user-visible 'Ingress' or 'Egress' metrics; replication traffic is internal and does not appear in the account's ingress/egress metrics. Option B is wrong because a Spark job reading data in parallel would increase 'Egress' (data read from storage), not 'Ingress', and the question states Egress remains stable. Option D is wrong because an Azure Function triggered by blob creation events would generate new storage events in the audit log (e.g., BlobCreated events), but the question explicitly states there are no new storage events.

Full explanation →

8

MCQeasy

You are implementing a data processing solution in Azure Databricks. The solution must read data from Azure Data Lake Storage Gen2, transform it using PySpark, and write the results back to a different location in the same storage account. You need to authenticate to the storage account securely without storing secrets in the notebook. What should you use?

A.Service Principal with a client secret stored in the notebook

B.Azure Key Vault-backed secret scope

C.Shared access signature (SAS) token

D.Storage account access key

AnswerB

Secret scopes securely reference secrets from Key Vault.

Why this answer

Option A is correct because Azure Key Vault-backed secret scopes allow you to securely reference secrets without storing them in the notebook. Option B is wrong because Service Principals require managing client secrets. Option C is wrong because Access keys are long-lived and insecure.

Option D is wrong because Shared access signatures can expose tokens.

Full explanation →

9

MCQmedium

Your company uses Azure Data Lake Storage Gen2 and needs to implement a data retention policy that automatically deletes files older than 90 days in a specific container. What should you use?

A.Azure Data Factory pipeline with a Delete activity scheduled to run daily.

B.Azure Policy with a deny effect for files older than 90 days.

C.Azure Storage lifecycle management rule with a filter for the container and a delete action after 90 days.

D.Azure Purview data lifecycle policy.

AnswerC

Lifecycle management can automatically delete blobs based on age, reducing manual effort.

Why this answer

Option B is correct because Azure Storage lifecycle management policies can automatically delete blobs based on age. Option A is wrong because Azure Purview is for governance, not deletion. Option C is wrong because Azure Data Factory can be scheduled to delete, but lifecycle management is simpler.

Option D is wrong because Azure Policy enforces rules, not automated deletion.

Full explanation →

10

MCQmedium

You are developing a streaming solution with Azure Stream Analytics that outputs results to Azure Synapse Analytics. The output requires exactly-once semantics to avoid duplicates. Which configuration should you use?

A.Use an Azure Function output to write data idempotently to Synapse.

B.Configure the output to Azure Synapse Analytics with 'Exactly once' output mode.

C.Set the event serialization format to Avro and enable 'Exactly once' event delivery.

D.Output to Azure Blob Storage and then use a COPY statement in Synapse.

AnswerB

Stream Analytics supports exactly-once output to Synapse dedicated SQL pool.

Why this answer

Azure Stream Analytics supports an 'Exactly once' output mode for Azure Synapse Analytics, which ensures that each event is written exactly once to the destination, preventing duplicates. This mode uses a combination of checkpointing and transactional writes to guarantee idempotent delivery without requiring custom logic.

Exam trap

The trap here is that candidates often confuse 'exactly-once' with 'at-least-once' and assume that idempotent writes via Azure Functions or staging in Blob Storage can achieve the same guarantee, but Stream Analytics' built-in output mode is the only supported way to guarantee exactly-once delivery to Synapse without custom complexity.

How to eliminate wrong answers

Option A is wrong because Azure Functions do not natively provide exactly-once semantics for Stream Analytics outputs; they are at-least-once by default, and implementing idempotency in a function would require custom state management and is not a built-in feature. Option C is wrong because Avro serialization and 'Exactly once event delivery' are not supported output configurations for Azure Synapse Analytics; Stream Analytics uses its own output mode for Synapse, not event delivery settings. Option D is wrong because writing to Blob Storage and then using a COPY statement introduces at-least-once semantics from Stream Analytics to Blob, and the COPY statement can still produce duplicates if the same file is loaded multiple times, failing to achieve end-to-end exactly-once semantics.

Full explanation →

11

MCQeasy

You need to secure data at rest for an Azure Data Lake Storage Gen2 account that contains sensitive financial data. Which configuration should you enable to ensure that data is encrypted using a customer-managed key stored in Azure Key Vault, and that access to the key is logged?

A.Enable Azure Storage encryption with Microsoft-managed keys

B.Implement client-side encryption using Azure Key Vault

C.Enable infrastructure encryption for double encryption

D.Configure Azure Storage encryption with customer-managed keys in Azure Key Vault and enable Key Vault logging

AnswerD

This allows you to control and rotate keys, and logging provides an audit trail.

Why this answer

Option D is correct because Azure Storage encryption with customer-managed keys in Key Vault provides control and logging. Option A is wrong because Microsoft-managed keys are the default but do not provide customer control. Option B is wrong because client-side encryption requires managing keys on the client side.

Option C is wrong because infrastructure encryption adds a second layer but does not use customer-managed keys.

Full explanation →

12

MCQmedium

A company runs a mission-critical Azure Data Factory pipeline that ingests data every hour from Azure Blob Storage into Azure Synapse Dedicated SQL Pool. Recently, the pipeline has been failing with timeout errors during the copy activity. The source blob files are around 500 MB each. Which configuration change would MOST effectively reduce the likelihood of timeout errors?

A.Decrease the 'Batch size' for the copy activity.

B.Change the sink to use PolyBase with staging enabled.

C.Increase the Data Integration Unit (DIU) to 8.

D.Enable 'Enable staging' and set 'Degree of copy parallelism' to a higher value.

AnswerD

Increases parallelism, reducing copy time and timeout likelihood.

Why this answer

Option D is correct because enabling staging allows the copy activity to use Azure Blob Storage as an intermediate staging area, which breaks the 500 MB files into manageable chunks and uses parallel staging writes to the Dedicated SQL Pool. This reduces the load on the single copy session and prevents timeout errors by leveraging the staging engine's retry and parallelization capabilities.

Exam trap

The trap here is that candidates often assume increasing DIUs or decreasing batch size will solve timeout issues, but they fail to recognize that staging is specifically designed to handle large file transfers by breaking them into parallel chunks and providing built-in retry logic.

How to eliminate wrong answers

Option A is wrong because decreasing 'Batch size' reduces the number of rows per batch, which can increase the number of round trips and actually worsen timeout issues for large files. Option B is wrong because PolyBase with staging enabled is a valid approach for bulk loading, but the question specifically asks for a configuration change to reduce timeout errors; PolyBase itself does not inherently address timeout errors during the copy activity—it is a different loading method that may still encounter timeouts if not combined with staging. Option C is wrong because increasing Data Integration Units (DIU) to 8 only improves parallelism within the copy activity for file splits, but for a single 500 MB file, the copy activity still processes it as one unit unless staging is enabled to split it; DIU alone does not mitigate timeout errors caused by long-running single-file transfers.

Full explanation →

13

MCQhard

You have a mission-critical pipeline that processes financial transactions in Azure Synapse Analytics. The pipeline uses Azure Data Factory with a mapping data flow to transform data. You need to ensure high availability and minimal data loss in case of a regional failure. What should you implement?

A.Store the source data in a GRS storage account and use Azure Data Factory to copy from the secondary endpoint.

B.Configure the pipeline to retry on failure and manually restore from backup.

C.Use Azure SQL Database active geo-replication as the source.

D.Use Azure Synapse Link for Cosmos DB to enable near real-time analytics with multi-region writes.

AnswerD

Synapse Link with Cosmos DB's multi-master supports high availability and minimal data loss.

Why this answer

Option D is correct because Azure Synapse Link for Cosmos DB provides near real-time sync with automatic failover. Option A (active geo-replication for SQL pool) is not native to Synapse. Option B (manual restore) has data loss.

Option C (backup to GRS) does not provide automatic failover.

Full explanation →

14

MCQmedium

Your Azure Stream Analytics job is failing to start because the job's storage account is configured to allow access from selected networks. The job uses managed identity for authentication. What should you do to resolve the issue?

A.Create a private endpoint for the Stream Analytics job to access the storage account.

B.Disable the firewall on the storage account to allow all networks.

C.Add the Stream Analytics job's managed identity as a trusted service in the storage account firewall.

D.Generate a SAS token for the storage account and configure the job to use it.

AnswerC

Trusted services bypass firewall rules when using managed identity.

Why this answer

Option C is correct because the Stream Analytics job's managed identity must be added as a trusted service to bypass the network restrictions. Option A is wrong because firewall rules apply to all services; managed identity must be explicitly allowed. Option B is wrong because private endpoints require additional configuration and are not required.

Option D is wrong because SAS tokens are not needed with managed identity.

Full explanation →

15

Multi-Selectmedium

Which THREE security features are available for Azure SQL Database? (Choose three.)

Select 3 answers

A.Always On availability groups

B.Auditing

C.Azure Active Directory authentication

D.Firewall rules

E.Transparent Data Encryption (TDE)

AnswersB, C, E

Correct. Auditing tracks database events.

Why this answer

Auditing is a built-in security feature for Azure SQL Database that tracks database events and writes them to an audit log in your Azure Storage account, Log Analytics workspace, or Event Hubs. It helps maintain regulatory compliance, understand database activity, and gain insights into discrepancies and anomalies that could indicate security concerns.

Exam trap

The trap here is that candidates often confuse high-availability features like Always On availability groups with security features, or they mistakenly think that firewall rules are a database-level security feature rather than a server-level network access control.

Full explanation →

16

MCQhard

Refer to the exhibit. You have an Azure Data Lake Storage Gen2 account with hierarchical namespace enabled. The exhibit shows an Azure CLI command output that lists access permissions. User user@contoso.com is a member of group-analysts@contoso.com. What level of access does user@contoso.com have to the /data/analytics/ directory?

A.Access denied because of the directory-level reader role

B.Read-only access

C.Read and write access

D.No access

AnswerC

The user's root-level contributor role provides read and write access to all directories.

Why this answer

Option C is correct because RBAC permissions are additive. The user has Storage Blob Data Contributor at the root scope, which grants read/write access to all directories. The reader role at the directory scope does not override the broader contributor role.

Option A is wrong because the user has write access via the root scope. Option B is wrong because the user has read access from the root scope already. Option D is wrong because the user has explicit permissions.

Full explanation →

17

MCQhard

You have a streaming pipeline using Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job has a high watermark delay and is falling behind. You need to reduce the latency. Which action should you take?

A.Add more partitions to the Event Hubs.

B.Increase the number of Streaming Units (SUs) for the Stream Analytics job.

C.Replace the output with Azure Functions for each event.

D.Change the input to a reference data input.

AnswerB

More SUs provide more compute power to process events faster.

Why this answer

Increasing the number of Streaming Units (SUs) for the Stream Analytics job allocates more compute resources, reducing latency. Adding more Event Hubs partitions may improve throughput but not directly reduce latency if the job is already bottlenecked. Switching to reference data input does not help.

Using Azure Functions for output may add overhead.

Full explanation →

18

Multi-Selecthard

Which TWO techniques should you use to secure sensitive data in Azure Synapse Analytics dedicated SQL pools when implementing column-level security?

Select 2 answers

A.Define dynamic data masking rules to obfuscate sensitive columns for unauthorized users.

B.Enable Transparent Data Encryption (TDE) on the database.

C.Create a security policy with filter predicates to restrict access to specific columns.

D.Use GRANT SELECT on specific columns to authorized users.

E.Implement Always Encrypted to encrypt columns at the application level.

AnswersA, D

Dynamic data masking hides sensitive data from non-privileged users.

Why this answer

Option B is correct because column-level security uses GRANT on specific columns. Option C is correct because dynamic data masking obfuscates data at query time. Option A is wrong because row-level security filters rows, not columns.

Option D is wrong because Always Encrypted is for client-side encryption, not column-level security in Synapse. Option E is wrong because Transparent Data Encryption encrypts data at rest, not at the column level.

Full explanation →

19

Multi-Selecthard

You are optimizing the performance of a large-scale batch processing job in Azure Databricks. The job reads data from Azure Data Lake Storage Gen2, performs transformations, and writes results back. You notice that the job is I/O bound. Which THREE strategies can improve performance? (Choose three.)

Select 3 answers

A.Use Delta Lake format and optimize the table with Z-ordering on frequently filtered columns.

B.Increase the number of partitions in the DataFrame to improve parallelism.

C.Cache the DataFrame in memory after reading to avoid re-reading from disk.

D.Reduce the number of shuffle partitions to minimize data movement.

E.Enable autoscaling on the cluster to add more nodes during processing.

AnswersA, B, C

Z-ordering reduces the amount of data read by pruning files.

Why this answer

Options A, B, and E are correct. Option A: Increasing the number of partitions can improve parallelism. Option B: Using Delta Lake and optimizing the file layout reduces I/O.

Option E: Caching intermediate results reduces repeated reads. Option C is wrong because reducing shuffle partitions can cause OOM errors. Option D is wrong because autoscaling may not help if the bottleneck is I/O.

Full explanation →

20

MCQeasy

You are designing a data processing solution for a real-time dashboard. The source data is streaming from IoT devices at a rate of 10,000 events per second. Each event is under 1 KB. You need to store the raw data in a staging area before processing, and you must minimize storage costs. Which storage solution should you use?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Cosmos DB

AnswerB

Blob Storage provides low-cost object storage for large volumes of streaming data, ideal for staging before processing.

Why this answer

Azure Blob Storage is cost-effective for large volumes of streaming data that will be processed later. Azure Data Lake Storage Gen2 is built on Blob Storage but adds hierarchical namespace, which is not required for simple staging. Azure SQL Database and Azure Cosmos DB are more expensive and suited for transactional or real-time query workloads.

Full explanation →

21

MCQeasy

You are processing CSV files in Azure Synapse Analytics serverless SQL pool. The files contain newline characters within quoted fields, causing parsing errors. Which file format option should you specify to handle this correctly?

A.Use ROWTERMINATOR = '0x0a'

B.Use FORMAT = 'CSV'

C.Use FIELDQUOTE = '"'

D.Use PARSER_VERSION = '2.0'

AnswerD

Parser version 2.0 supports quoted fields with embedded newlines.

Why this answer

Option D is correct because PARSER_VERSION = '2.0' in Azure Synapse serverless SQL pool supports quoted fields that contain embedded newline characters (CR/LF). The parser version 2.0 is designed to handle CSV files with complex quoting rules, including newlines within quoted fields, which the older parser version 1.0 cannot parse correctly.

Exam trap

The trap here is that candidates often assume FIELDQUOTE or ROWTERMINATOR can fix embedded newlines, but only the parser version upgrade (2.0) changes the underlying parsing logic to treat newlines inside quotes as part of the field value.

How to eliminate wrong answers

Option A is wrong because ROWTERMINATOR = '0x0a' only specifies the row delimiter (line feed) and does not address the issue of newline characters inside quoted fields; it would still cause parsing errors when newlines appear within quotes. Option B is wrong because FORMAT = 'CSV' is a general format specification that does not by itself enable handling of embedded newlines in quoted fields; it relies on the parser version for such behavior. Option C is wrong because FIELDQUOTE = '"' defines the quote character (default is double quote) but does not change how the parser treats newlines within quoted fields; the parser version determines that behavior.

Full explanation →

22

Multi-Selecteasy

Which TWO options are correct for configuring a copy activity in Azure Data Factory to load data from Azure Blob Storage to Azure SQL Database?

Select 2 answers

A.Use a staging table in Azure SQL Database before inserting into the final table.

B.Use staging via Azure Blob Storage when loading large volumes to improve performance.

C.Use Azure Data Lake Storage as the staging location for better throughput.

D.Use PolyBase to load directly from Blob Storage to Azure SQL Database.

E.Always skip staging to reduce latency.

AnswersA, B

Staging table allows data validation and transformation before final load.

Why this answer

Correct options: A (staging via Blob Storage for large loads) and D (using a staging table and then inserting into final). B is wrong because staging cannot be skipped for large data. C is wrong because Blob Storage is not an intermediate for Azure SQL.

E is wrong because PolyBase is not supported for Azure SQL Database.

Full explanation →

23

MCQeasy

You are monitoring an Azure Data Factory pipeline that runs daily to load data from an on-premises SQL Server to Azure Synapse Analytics. The pipeline has been failing intermittently with timeout errors. You need to configure alerts to notify the team when the pipeline fails. What is the most efficient way to achieve this?

A.Enable diagnostic settings to send pipeline logs to a Log Analytics workspace and create a log search alert.

B.Use Azure Service Health alerts to notify when Data Factory is unhealthy.

C.Create an alert rule in Azure Monitor using the 'Pipeline failed runs' metric.

D.Configure Microsoft Defender XDR to monitor pipeline activities.

AnswerC

Azure Data Factory exposes pipeline run metrics in Azure Monitor, allowing direct alert creation based on failures.

Why this answer

Option B is correct because Azure Monitor alerts on pipeline run metrics can trigger notifications based on failure conditions. Option A is wrong because Azure Log Analytics itself does not send alerts; it stores logs. Option C is wrong because Azure Service Health alerts are for Azure service issues, not pipeline failures.

Option D is wrong because Microsoft Defender XDR is for security, not pipeline monitoring.

Full explanation →

24

MCQmedium

Your organization uses Azure Data Lake Storage Gen2 for a data lake. You need to prevent accidental deletion of data by enabling a soft delete policy. Which configuration is required?

A.Apply an Azure Resource Manager lock.

B.Configure Azure Backup for the storage account.

C.Enable blob versioning.

D.Enable blob soft delete on the storage account.

AnswerD

Soft delete retains deleted blobs for a specified period.

Why this answer

Option A is correct because Azure Data Lake Storage Gen2 supports soft delete for blobs. Option B is wrong because Azure Backup is for VM and other sources, not for Data Lake. Option C is wrong because versioning is different from soft delete.

Option D is wrong because Azure Resource Manager locks prevent resource deletion, not data deletion.

Full explanation →

25

Multi-Selecthard

You are designing a delta lake architecture in Azure Synapse Analytics. Which TWO practices should you follow to ensure ACID transactions and data consistency?

Select 2 answers

A.Disable schema evolution to prevent accidental schema changes

B.Run OPTIMIZE commands frequently to compact small files

C.Use the Delta Lake transaction log for all write operations

D.Store data in CSV format to simplify schema enforcement

E.Enable write-ahead logging to support concurrent reads and writes

AnswersC, E

The transaction log ensures ACID compliance by recording all operations.

Why this answer

Option C is correct because the Delta Lake transaction log is the core mechanism that enables ACID transactions. Every write operation (insert, update, delete, merge) is recorded as an atomic commit in the transaction log, ensuring that concurrent readers see a consistent snapshot and that partial writes are never visible. Without this log, Delta Lake cannot guarantee atomicity or isolation.

Exam trap

The trap here is that candidates often confuse performance optimization (OPTIMIZE) or file format choice (CSV) with ACID transaction guarantees, but the exam specifically tests the understanding that the transaction log is the fundamental enabler of atomicity, consistency, isolation, and durability in Delta Lake.

Full explanation →

26

MCQeasy

You are tasked with transforming data in an Azure Synapse Analytics pipeline using a mapping data flow. The source data contains a column 'FullName' in the format 'LastName, FirstName'. You need to split this into two separate columns: 'LastName' and 'FirstName'. Which transformation should you use?

A.Pivot transformation

B.Aggregate transformation

C.Lookup transformation

D.Derived Column transformation

AnswerD

Derived column can use expressions to split strings.

Why this answer

Option B is correct because the Derived Column transformation allows you to create new columns using expressions, including string functions like split(). Option A is wrong because Aggregate is for grouping and aggregations, not string manipulation. Option C is wrong because Pivot rotates data from rows to columns.

Option D is wrong because Lookup is for joining data from another source.

Full explanation →

27

MCQhard

Your team uses Azure Databricks for data processing. You need to implement a cost-control strategy that automatically terminates idle clusters after 30 minutes of inactivity, but allows users to override this policy for specific workloads that require long-running clusters. What is the most efficient approach?

A.Instruct all users to set auto-termination to 30 minutes on each cluster they create.

B.Configure a global auto-termination setting in the Azure Databricks workspace that terminates all clusters after 30 minutes of inactivity.

C.Use Azure Policy to enforce a tag that triggers a function to terminate idle clusters.

D.Create a cluster policy that enforces auto-termination with a default of 30 minutes, but allows users to override the value for specific clusters.

AnswerD

Correct: Cluster policies provide a balance between governance and flexibility, allowing overrides as needed.

Why this answer

Option D is correct because cluster policies in Azure Databricks allow admins to define rules (e.g., auto-termination) while allowing users to choose different values within allowed limits. Option A is wrong because manual termination is not automatic. Option B is wrong because cluster policies are not enforced by Azure Policy.

Option C is wrong because auto-termination is a cluster-level setting, not a global policy.

Full explanation →

28

MCQhard

You are migrating a large on-premises SQL Server database to Azure Synapse Analytics. The database includes tables with up to 500 million rows and frequent updates. You need to minimize data movement during the migration while ensuring optimal query performance in the dedicated SQL pool. Which table design strategy should you use?

A.Use hash-distributed tables for all tables and clustered columnstore indexes.

B.Use replicated tables for all fact tables and hash-distributed tables for dimension tables.

C.Use round-robin tables for all tables to simplify the migration.

D.Use round-robin tables for staging tables and hash-distributed tables for large fact tables on a key column.

AnswerD

Round-robin minimizes data movement; hash distribution optimizes joins.

Why this answer

Option D is correct because it uses round-robin tables for staging to minimize data movement during the initial load, then hash-distributes large fact tables on a key column to optimize query performance by collocating rows with the same distribution key on the same compute node. This balances the need for fast ingestion with efficient parallel query execution in Azure Synapse Analytics dedicated SQL pools.

Exam trap

The trap here is that candidates often assume hash-distributed tables are always the best choice for all tables, overlooking the fact that round-robin tables reduce data movement during migration and that hash distribution should be reserved for large fact tables to avoid skew and unnecessary shuffling.

How to eliminate wrong answers

Option A is wrong because using hash-distributed tables for all tables, including small dimension tables, can cause unnecessary data shuffling and skew, and clustered columnstore indexes are not optimal for tables with frequent updates due to high overhead in maintaining columnstore segments. Option B is wrong because replicated tables are designed for small dimension tables (typically < 2 GB), not for large fact tables with up to 500 million rows, as replicating such large tables would cause excessive storage and data movement. Option C is wrong because round-robin tables distribute data randomly across distributions, leading to poor query performance due to data movement during joins and aggregations, and are not suitable for production fact tables in a dedicated SQL pool.

Full explanation →

29

MCQmedium

You have an Azure Databricks workspace that processes sensitive data. The security team requires that all access to the workspace be authenticated using Microsoft Entra ID and that all API calls be audited. Which configuration should you implement?

A.Configure workspace to use Microsoft Entra ID authentication and enable diagnostic settings for audit logs.

B.Enable VNet injection and configure network security groups.

C.Deploy Azure Private Link and disable public access.

D.Configure personal access tokens for API access and enable cluster logs.

AnswerA

Entra ID authentication ensures identity-based access, and diagnostic settings capture API call logs.

Why this answer

Option C is correct because Microsoft Entra ID authentication is required, and diagnostic settings audit API calls. Option A is wrong because VNet injection is for network isolation, not authentication or auditing. Option B is wrong because personal access tokens bypass Entra ID.

Option D is wrong because Azure Private Link is for private connectivity.

Full explanation →

30

MCQeasy

Your organization uses Azure Data Factory to orchestrate data pipelines. You need to ensure that sensitive data is not exposed in pipeline logs. What should you configure?

A.Store connection strings in Azure Key Vault.

B.Enable 'Secure output' on pipeline activities.

C.Set a retention policy for pipeline logs.

D.Use data flow debug logs with session logs.

AnswerB

Secure output masks sensitive data in activity logs.

Why this answer

Option A is correct because enabling secure output prevents sensitive data from being written to logs. Option B is wrong because data flow debug logs are for debugging, not for masking. Option C is wrong because Azure Key Vault stores secrets but does not mask pipeline logs.

Option D is wrong because setting a TTL does not prevent sensitive data exposure.

Full explanation →

31

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. You need to choose the most appropriate service for near real-time ingestion with minimal latency. What should you use?

A.Azure Databricks with Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Functions with Event Hub trigger

AnswerB

Azure Stream Analytics is optimized for low-latency streaming and can directly ingest from Event Hubs and write to SQL pools.

Why this answer

Azure Stream Analytics is the correct choice because it is purpose-built for near real-time stream processing with sub-second latency, directly integrates with Azure Event Hubs as an input source and dedicated SQL pool as an output sink, and provides a SQL-like query language for defining transformations. This minimizes architectural complexity and latency compared to other services.

Exam trap

The trap here is that candidates often confuse 'near real-time' with 'batch processing' and choose Azure Data Factory (option C) because it is a familiar data integration tool, overlooking that it lacks native streaming capabilities and introduces latency from scheduled pipeline runs.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming introduces additional overhead from cluster startup time and micro-batch processing, which typically results in higher latency (seconds to minutes) compared to Stream Analytics' continuous processing model. Option C is wrong because Azure Data Factory is a batch-oriented ETL/ELT orchestration service that does not support native streaming ingestion; it polls sources on a schedule, introducing minutes of latency. Option D is wrong because Azure Functions with Event Hub trigger processes events one at a time in a serverless compute model, which can lead to cold-start delays and lacks built-in windowing, aggregation, and exactly-once semantics for streaming workloads.

Full explanation →

32

Multi-Selecthard

Which THREE components are required to implement a change data capture (CDC) solution in Azure Data Factory for incremental loading from Azure SQL Database to Azure Synapse Analytics?

Select 3 answers

A.A Mapping Data Flow to transform the data before load.

B.A Lookup activity to retrieve the last processed watermark.

C.Azure Functions to process change data.

D.A Copy activity to transfer changed data to Synapse.

E.Change Data Capture (CDC) enabled on the source Azure SQL Database tables.

AnswersB, D, E

Needed to get the last high-water mark.

Why this answer

Option B is correct because a Lookup activity is required to retrieve the last processed watermark (e.g., a timestamp or LSN) from a control table or source query. This watermark defines the incremental boundary for CDC, ensuring only new or changed rows are loaded in each pipeline run. Without it, the pipeline would have no way to track progress and would either reload all data or miss changes.

Exam trap

The trap here is that candidates often assume CDC requires a Mapping Data Flow or Azure Functions for transformation or processing, when in fact the core pattern uses only Lookup and Copy activities with change tracking metadata.

Full explanation →

33

MCQmedium

Your company uses Azure Purview for data governance. You need to ensure that sensitive data in Azure Data Lake Storage Gen2 is automatically detected and classified. What should you configure in Purview?

A.Apply sensitivity labels to the storage account using Microsoft Purview Information Protection.

B.Enable Microsoft Defender for Cloud's data sensitivity discovery.

C.Use Azure Policy to enforce tagging of resources containing sensitive data.

D.Create a scan rule set that includes built-in classification rules for sensitive data types.

AnswerD

Purview scanning with classification rules automatically detects sensitive data patterns.

Why this answer

Option A is correct because Purview's scanning and classification rules automatically detect sensitive data. Option B is wrong because sensitivity labels in Microsoft Purview Information Protection are applied manually or via policy, not automatically by Purview scanning. Option C is wrong because Azure Policy enforces resource compliance, not data classification.

Option D is wrong because Defender for Cloud is for security posture, not data classification.

Full explanation →

34

Multi-Selecteasy

Which TWO data processing activities in Azure Data Factory allow you to run custom code?

Select 2 answers

A.Data Flow activity

B.Copy activity

C.Custom activity

D.Azure Function activity

E.Stored Procedure activity

AnswersC, D

Runs custom code on Azure Batch.

Why this answer

Option C is correct because the Custom activity in Azure Data Factory allows you to execute custom code by running a .NET or PowerShell script in an Azure Batch service. This enables you to implement complex transformation logic that is not natively supported by ADF's built-in activities, such as custom data validation or integration with third-party APIs.

Exam trap

The trap here is that candidates often confuse the Custom activity with the Azure Function activity, thinking both run custom code, but the question asks for 'custom code' in the context of ADF's native activities—Azure Function activity is also correct because it runs custom code via an Azure Function, making it a valid choice alongside Custom activity.

Full explanation →

35

Matchingmedium

Match each data storage format to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Columnar storage format optimized for analytics

Row-based format with schema embedded

Columnar format with high compression

ACID transactions on data lakes

Why these pairings

These formats are commonly used in Azure data engineering.

Full explanation →

36

MCQeasy

You are monitoring an Azure Synapse Analytics pipeline that runs daily. The pipeline uses a data flow to transform data. You notice that the data flow is slow and consumes a lot of compute resources. Which action can you take to optimize performance?

A.Use a staging table to load data before transformation

B.Increase the batch size in the data flow

C.Use PolyBase to load data into the dedicated SQL pool

D.Partition the data flow by a key column

AnswerD

Partitioning allows parallel execution across nodes, improving throughput.

Why this answer

Partitioning the data flow by a key column can improve performance by allowing parallel processing. Option A is wrong because using a staging table adds overhead. Option B is wrong because increasing batch size may cause memory issues.

Option D is wrong because data flows do not use PolyBase.

Full explanation →

37

Multi-Selectmedium

You are designing an ETL process in Azure Data Factory. You need to transform data using Mapping Data Flows. Which THREE of the following transformations are available in Mapping Data Flows?

Select 3 answers

A.Pivot

B.Derived Column

C.Union All

D.Aggregate

E.Merge Join

AnswersA, B, D

Used to rotate data from rows to columns.

Why this answer

Option A is correct because the Pivot transformation in Mapping Data Flows allows you to rotate data from rows into columns, enabling aggregation and restructuring of datasets. It is a native transformation in Azure Data Factory's Mapping Data Flows, supporting group-by columns, pivot keys, and aggregated column values.

Exam trap

The trap here is that candidates confuse the 'Union All' and 'Merge Join' names from other tools (like SSIS or T-SQL) with the actual transformation names in Azure Data Factory Mapping Data Flows, leading them to select options that sound familiar but are not available.

Full explanation →

38

MCQmedium

You need to design a near-real-time data processing solution that ingests IoT telemetry data from millions of devices. The data must be aggregated per minute and stored in Azure Cosmos DB for low-latency queries. Which Azure service combination should you use?

A.Azure Event Hubs -> Azure HDInsight (Kafka) -> Azure Cosmos DB

B.Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB

C.Azure IoT Hub -> Azure Databricks (Structured Streaming) -> Azure Cosmos DB

D.Azure Event Hubs -> Azure Data Factory -> Azure Cosmos DB

AnswerB

Stream Analytics provides near-real-time aggregation.

Why this answer

Option B is correct because Azure Stream Analytics provides native, low-latency windowed aggregation (e.g., TumblingWindow for per-minute aggregates) directly on data ingested from Event Hubs, and it has a built-in output sink to Azure Cosmos DB. This combination meets the near-real-time requirement without needing an intermediate compute or storage layer, minimizing end-to-end latency.

Exam trap

The trap here is that candidates often over-engineer the solution by adding a big-data processing layer (like HDInsight or Databricks) when a simpler, fully managed stream analytics service (Azure Stream Analytics) is the correct choice for fixed-window aggregation and direct Cosmos DB output.

How to eliminate wrong answers

Option A is wrong because Azure HDInsight (Kafka) introduces unnecessary complexity and latency for simple per-minute aggregation; it requires manual stream processing setup and is not optimized for direct, low-latency output to Cosmos DB. Option C is wrong because Azure Databricks Structured Streaming, while capable, adds startup and cluster management overhead that is not ideal for sub-minute latency, and IoT Hub is typically used for device management and bi-directional communication, not purely for high-throughput telemetry ingestion. Option D is wrong because Azure Data Factory is a batch-oriented orchestration service, not designed for near-real-time stream processing or windowed aggregation.

Full explanation →

39

Multi-Selecthard

Your Azure Synapse Analytics dedicated SQL pool is experiencing high resource utilization. You need to identify queries that are consuming excessive memory. Which THREE DMVs should you query?

Select 3 answers

A.sys.dm_pdw_sql_requests

B.sys.dm_pdw_nodes_resource_usage

C.sys.dm_pdw_request_steps

D.sys.dm_pdw_exec_requests

E.sys.dm_pdw_waits

AnswersA, C, D

Shows SQL operations on distributions.

Why this answer

sys.dm_pdw_exec_requests shows requests, sys.dm_pdw_request_steps shows steps, and sys.dm_pdw_sql_requests shows distributed SQL requests. Option D is for resource waits, Option E is for node resource usage.

Full explanation →

40

Multi-Selecteasy

Which TWO are valid output sinks for an Azure Stream Analytics job? (Choose two.)

Select 2 answers

A.Azure SQL Data Warehouse (dedicated SQL pool)

B.Azure Blob Storage

C.Azure Cosmos DB (SQL API)

D.Azure Table Storage

E.Azure Synapse Analytics

AnswersB, E

Supported as a sink.

Why this answer

Options A and D are correct. Azure Synapse Analytics and Azure Blob Storage are supported sinks. Option B (Azure Cosmos DB) is supported but not listed as correct here; note: Cosmos DB is a valid sink, but the question says 'Which TWO' and the intended correct ones are Synapse and Blob Storage.

Option C (Azure Table Storage) is not directly supported. Option E (Azure SQL Data Warehouse) is the same as Synapse.

Full explanation →

41

MCQhard

You are designing a data processing solution using Azure Databricks. The solution must use Delta Lake for ACID transactions and must optimize storage costs by automatically compacting small files. Which feature should you enable?

A.Run OPTIMIZE command in a scheduled job.

B.Set retention duration for vacuum to 7 days.

C.Enable Z-order indexing on the Delta table.

D.Enable auto-optimize on the Delta table.

AnswerD

Auto-optimize automatically compacts small files during write operations.

Why this answer

Option D is correct because Delta Lake's auto-optimize feature automatically compacts small files during writes. Option A is wrong because OPTIMIZE is a manual command. Option B is wrong because Z-order is for data skipping, not compaction.

Option C is wrong because vacuum removes old files, not compacts.

Full explanation →

42

MCQeasy

You need to monitor the performance of your Azure Synapse Analytics dedicated SQL pool. Which metric should you use to identify queued queries due to concurrency limits?

A.Queued queries

B.DWU percentage

C.Active queries

D.Memory percentage

AnswerA

This metric directly shows queries waiting for concurrency slots.

Why this answer

Option C is correct because 'Queued queries' metric directly shows the number of queries waiting due to concurrency limits. Option A is wrong because DWU percentage shows resource usage, not queuing. Option B is wrong because active queries show running queries.

Option D is wrong because memory percentage shows memory pressure.

Full explanation →

43

MCQmedium

You are troubleshooting a slow-running pipeline in Azure Data Factory that uses a Copy activity to transfer data from Azure Blob Storage to Azure Synapse Analytics. The pipeline processes about 100 GB of CSV files. The copy performance is poor even though the source and sink are in the same region. What is the most likely cause?

A.The copy activity is not using staging and PolyBase

B.The source and sink are in different Azure regions

C.The Data Integration Unit (DIU) setting is too low

D.The source files are compressed

AnswerA

PolyBase dramatically improves load performance.

Why this answer

Option D (The copy activity is not using staging and PolyBase) is correct because PolyBase can significantly improve load performance into Synapse. Option A (Data Integration Unit too low) is possible but less likely than using PolyBase. Option B (Source files are compressed) should not degrade performance if using compressed format.

Option C (Source and sink in different region) is not the case.

Full explanation →

44

MCQmedium

You are monitoring an Azure Synapse Analytics dedicated SQL pool and notice that queries are experiencing excessive wait time due to concurrency slots being exhausted. What is the recommended approach to improve concurrency without increasing cost?

A.Create additional workload groups and assign queries to them.

B.Classify queries using workload classification and assign lower importance to reduce concurrency slot usage.

C.Scale up the dedicated SQL pool to a higher service level.

D.Change the distribution type of tables to round-robin.

AnswerB

Lower importance queries use fewer slots, allowing more concurrent queries.

Why this answer

Option C is correct because using workload classification and assigning a lower importance to background queries can allow critical queries to run while reducing slot consumption for non-critical ones. Option A is wrong because scaling up increases cost and may not be necessary. Option B is wrong because workload groups can limit resources but do not directly increase concurrency slots.

Option D is wrong because changing distribution does not affect concurrency slots.

Full explanation →

45

MCQmedium

Your company uses Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2. You need to monitor query performance and identify queries that are consuming excessive resources. Which Azure tool should you use?

A.Azure Monitor Metrics for the serverless SQL pool.

B.Synapse Studio monitoring hub.

C.Azure Log Analytics queries against diagnostic logs.

D.SQL Server Management Studio Query Performance Insight.

AnswerB

Synapse Studio provides detailed monitoring for serverless SQL pool.

Why this answer

Option A is correct because Synapse Studio provides monitoring dashboards for serverless SQL pool. Option B is wrong because Azure Monitor Metrics can show some metrics but not query-level details. Option C is wrong because Log Analytics can be used but requires diagnostic settings.

Option D is wrong because Query Performance Insight is for dedicated SQL pool.

Full explanation →

46

Multi-Selecthard

Which THREE of the following are valid ways to authenticate to Azure Data Lake Storage Gen2 from Azure Synapse serverless SQL pool?

Select 3 answers

A.Storage account key

B.Managed identity of the Synapse workspace

C.Service principal with certificate

D.Microsoft Entra ID (Azure AD) pass-through authentication

E.Shared access signature (SAS) token

AnswersB, D, E

Serverless SQL pool can use the workspace's managed identity to access storage.

Why this answer

Managed identity of the Synapse workspace (Option B) is correct because Azure Synapse serverless SQL pool can use the workspace's system-assigned managed identity to authenticate to Azure Data Lake Storage Gen2 without storing any credentials. This is enabled by granting the managed identity the appropriate RBAC role (e.g., Storage Blob Data Contributor) on the storage account, and then using the `IDENTITY='Managed Identity'` parameter in the `OPENROWSET` or `CREATE EXTERNAL TABLE AS SELECT` statements. It provides a secure, credential-free authentication method that is fully supported by serverless SQL pool.

Exam trap

The trap here is that candidates often assume storage account keys (Option A) are universally supported across all Azure Synapse components, but serverless SQL pool explicitly does not support key-based authentication, requiring token-based methods instead.

Full explanation →

47

MCQhard

Refer to the exhibit. You are reviewing an Azure Policy definition that was created to audit diagnostic settings for Azure Data Lake Storage Gen2. The policy is assigned to a management group containing several storage accounts. You notice that the policy is not flagging storage accounts that lack diagnostic settings. What is the most likely cause?

A.The existence condition incorrectly checks for a specific log category 'StorageRead' instead of any diagnostic setting.

B.The 'field' property in the 'if' condition should be 'type' equals 'Microsoft.Storage/storageAccounts/blobServices' for Data Lake.

C.The effect parameter is not assigned a value; the default 'AuditIfNotExists' is not applied.

D.The policy type is 'Custom', which requires special permissions to evaluate.

AnswerA

The policy should audit if no diagnostic setting exists, not check for a specific category.

Why this answer

Option C is correct because the existence condition checks if a diagnostic setting of category 'StorageRead' exists, but the policy should check for any diagnostic setting. Option A is wrong because the policy type is custom, but that doesn't cause the issue. Option B is wrong because the effect is parameterized and default is AuditIfNotExists, which should work.

Option D is wrong because the field check for type is correct.

Full explanation →

48

Multi-Selecthard

Which TWO options are correct about using Azure Stream Analytics for real-time data processing?

Select 3 answers

A.You can join streaming data with static reference data for enrichment.

B.Stream Analytics supports windowing functions like Tumbling, Hopping, and Sliding windows natively.

C.Stream Analytics can output to any Azure data store without custom code.

D.Stream Analytics guarantees exactly-once processing semantics.

E.You can use custom JavaScript or C# functions for complex transformations.

AnswersA, B, E

Reference data joins are a key feature of Stream Analytics.

Why this answer

Correct: B (reference data joins for enrichment) and D (windowing functions for time-based aggregations). A is wrong because Stream Analytics cannot write to all sinks; some require custom code. C is wrong because windowing is built-in but not for every scenario.

E is wrong because exactly-once semantics depend on the sink.

Full explanation →

49

MCQeasy

You are running a Python script in Azure Databricks that reads a CSV file from DBFS. The script runs successfully in an interactive notebook but fails when executed as a job with the error: 'Path does not exist: dbfs:/tmp/data.csv'. What is the most likely cause?

A.The job is using a different runtime that does not support Python.

B.The file is too large for DBFS.

C.The job cluster does not have permission to access DBFS.

D.The file was uploaded to the workspace filesystem, not to DBFS.

AnswerD

Workspace files are not automatically in DBFS.

Why this answer

Option C is correct because job clusters are ephemeral and may not have the file in DBFS if it was uploaded to the workspace files or requires mounting. Option A is incorrect because the cluster has access. Option B is incorrect because Python is supported.

Option D is incorrect because the file exists in interactive but not in job context.

Full explanation →

50

MCQhard

A healthcare organization stores patient data in Azure SQL Database. They need to encrypt sensitive columns (e.g., SSN) such that only authorized users can decrypt the data at query time. Which feature should they use?

A.Transparent Data Encryption (TDE)

B.Azure Information Protection labels

C.Always Encrypted (with secure enclaves)

D.Dynamic Data Masking

AnswerC

Column-level encryption, decrypted on the client.

Why this answer

Always Encrypted with secure enclaves is the correct choice because it allows sensitive columns (e.g., SSN) to be encrypted at rest and in memory, with decryption occurring only on the client side using keys that are never exposed to the database engine. This ensures that only authorized users with the proper encryption keys can decrypt the data at query time, meeting the requirement for column-level encryption with client-side key management.

Exam trap

The trap here is that candidates often confuse Transparent Data Encryption (TDE) with column-level encryption, mistakenly thinking TDE protects individual columns at query time, when in fact TDE only encrypts data at rest and does not control who can decrypt data during queries.

How to eliminate wrong answers

Option A is wrong because Transparent Data Encryption (TDE) encrypts the entire database at rest (pages and backups) but does not protect data in memory or during query execution, and it does not allow column-level granularity or client-side key control. Option B is wrong because Azure Information Protection labels are used for classifying and protecting documents and emails via sensitivity labels, not for encrypting specific database columns at query time. Option D is wrong because Dynamic Data Masking obfuscates data in query results for unauthorized users but does not encrypt the underlying data; it can be bypassed by users with direct database access or by using certain query patterns.

Full explanation →

51

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool with a large fact table partitioned by date. As data grows, query performance on recent data degrades. You need to improve performance for queries filtering on the current month without affecting queries on older data. What should you do?

A.Implement partition switching to move older partitions to a different filegroup with slower storage

B.Increase the service level objective (SLO) of the dedicated SQL pool

C.Change the table distribution to round-robin

D.Update statistics on the table for the current month partition

AnswerA

D is correct because partition switching isolates recent data and allows targeted performance tuning.

Why this answer

Partition switching allows you to efficiently move older partitions to a different filegroup (e.g., slower or compressed storage) without affecting queries on recent data. This reduces the active data volume in the fact table, improving scan performance for queries filtering on the current month. The operation is metadata-only and does not require data movement, so it does not impact ongoing queries on the remaining partitions.

Exam trap

The trap here is that candidates often confuse partition switching with partition elimination or index maintenance, thinking that updating statistics or scaling resources will solve the performance issue, when the real bottleneck is the sheer volume of data in the table that must be scanned for queries on recent data.

How to eliminate wrong answers

Option B is wrong because increasing the SLO (service level objective) scales up resources for the entire dedicated SQL pool, which is costly and does not specifically target the performance degradation on recent data; it also affects queries on older data unnecessarily. Option C is wrong because changing the table distribution to round-robin would distribute data evenly across distributions, which can improve load performance but does not address the performance issue caused by scanning a large fact table partitioned by date; it may even worsen query performance for date-filtered queries by spreading data across all distributions. Option D is wrong because updating statistics on the current month partition alone does not reduce the amount of data scanned; while fresh statistics help the optimizer choose better plans, the fundamental problem is the volume of data, not the statistics quality.

Full explanation →

52

MCQmedium

You are designing a data pipeline in Azure Data Factory that reads data from an Azure SQL Database and writes to Azure Synapse Analytics dedicated SQL pool. The pipeline will run hourly and process incremental changes. You need to ensure that the pipeline handles retries for transient failures and logs all activities for audit purposes. What should you configure?

A.Enable auditing on the Azure SQL Database and Azure Synapse SQL pool.

B.Configure a tumbling window trigger to rerun the pipeline on failure.

C.Enable pipeline logging to an Azure Storage account and set retry policy on copy activity.

D.Use an Azure Monitor alert to notify on failures.

AnswerC

Logs capture all events; retry handles transient errors.

Why this answer

Option B is correct because enabling pipeline logging to a storage account captures all activity, and retry policy in activities handles transient failures. Option A is wrong because Azure Monitor can monitor but does not log pipeline activities. Option C is wrong because Azure SQL Database audit is for database access, not pipeline.

Option D is wrong because event-based trigger does not provide logging or retry.

Full explanation →

53

MCQhard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. A frequent query scans a large fact table and filters on a date column. You notice that the query uses a full table scan. What is the most effective way to improve query performance?

A.Create a nonclustered index on the date column.

B.Create a clustered columnstore index on the table.

C.Change the distribution to round-robin.

D.Partition the table on the date column.

AnswerD

Enables partition pruning, reducing data scanned.

Why this answer

Option C is correct because a partitioned table on the date column allows partition elimination. Option A is wrong because a clustered columnstore index is good for scans but partition elimination is more effective for filtering. Option B is wrong because round-robin distribution does not help with filtering.

Option D is wrong because a nonclustered index on date may help but partition elimination is more effective for large scans.

Full explanation →

54

Multi-Selectmedium

Which TWO Azure services can be used to monitor and analyze query performance in Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.SQL Data Sync

B.Azure Policy

C.Azure Advisor

D.Azure Monitor with Log Analytics

E.Dynamic Management Views (DMVs)

AnswersD, E

Can collect and analyze query performance logs.

Why this answer

Options B and D are correct. DMVs provide detailed query execution statistics, and Azure Monitor with Log Analytics can store and analyze performance data. Option A is wrong because Azure Advisor provides recommendations but not real-time monitoring.

Option C is wrong because SQL Data Sync is for synchronization. Option E is wrong because Azure Policy is for governance.

Full explanation →

55

MCQmedium

You are designing a data lake architecture using Azure Data Lake Storage Gen2. You need to implement a least-privilege security model. Which authorization mechanism should you use for granular control?

A.Use storage account keys for access.

B.Use Azure RBAC roles at the storage account level.

C.Use POSIX-like access control lists (ACLs).

D.Use shared access signatures (SAS) with stored access policies.

AnswerC

ACLs provide granular permissions on files and directories.

Why this answer

Option D is correct because ACLs provide granular permissions at the file and directory level. Option A is wrong because Azure RBAC is at the subscription/resource group level, not granular. Option B is wrong because shared keys provide full access.

Option C is wrong because SAS tokens with stored access policies provide limited granularity but are not as flexible as ACLs.

Full explanation →

56

MCQhard

You are designing a data processing solution for a retail company that uses Azure Synapse Analytics. The solution must process point-of-sale (POS) data from multiple stores. The data arrives in CSV files in Azure Data Lake Storage Gen2. Each store sends a file every hour. You need to process the files as they arrive and load the data into a dedicated SQL pool. The solution must handle late-arriving files (files that arrive after the scheduled processing time) and ensure that the data is consistent. Which approach should you use?

A.Use Azure Data Factory with a Copy activity to load data into a staging table, then use a Data Flow activity to perform upserts.

B.Use Azure Databricks to read the CSV files, perform upserts, and write to the dedicated SQL pool using JDBC.

C.Use PolyBase to create external tables on the CSV files and then use CREATE TABLE AS SELECT to load into the dedicated SQL pool.

D.Use Azure Data Factory with a Copy activity to load data into a staging table in the dedicated SQL pool, then use a Stored Procedure activity to merge the data into the final table.

AnswerD

Handles upserts and late-arriving data effectively.

Why this answer

Option D is correct because using a stored procedure with merge logic can handle upserts and late-arriving data. Azure Data Factory can execute the stored procedure after copying data to a staging table. Option A is wrong because PolyBase with external tables does not handle upserts well.

Option B is wrong because Azure Databricks would be overkill for this scenario. Option C is wrong because partitioning alone does not handle late-arriving data.

Full explanation →

57

MCQhard

You are a data engineer for a gaming company that uses Azure Data Lake Storage Gen2. The data lake stores player event data in JSON format. The data is organized by date and event type. The analytics team frequently runs queries that filter by player ID to analyze individual player behavior. These queries are slow because they scan entire daily partitions. You need to improve the performance of queries that filter by player ID without restructuring the entire data lake. The data is stored as JSON files. What should you do?

A.Compress the JSON files using gzip to reduce I/O

B.Convert the data from JSON to Parquet format and partition by player ID

C.Create indexes on the player ID field in the data lake

D.Repartition the data by hour to reduce the data scanned per partition

AnswerB

Parquet allows predicate pushdown and column pruning, speeding up player ID queries.

Why this answer

Option B is correct because converting JSON to Parquet enables columnar storage, which significantly reduces I/O by reading only the columns needed for queries. Partitioning by player ID further improves performance by allowing partition elimination, so queries filter only the relevant partitions instead of scanning entire daily partitions. This approach directly addresses the slow queries without restructuring the entire data lake.

Exam trap

The trap here is that candidates often think compression alone (Option A) or indexing (Option C) can solve performance issues in a data lake, but Azure Data Lake Storage Gen2 does not support file-level indexes, and compression does not change the fundamental row-scanning nature of JSON queries.

How to eliminate wrong answers

Option A is wrong because compressing JSON with gzip reduces file size but does not change the row-oriented storage format; queries still must read and decompress entire files, and filtering by player ID still requires scanning all rows in each partition. Option C is wrong because Azure Data Lake Storage Gen2 does not support indexing on data files; indexes are a database concept and cannot be applied to files in a data lake. Option D is wrong because repartitioning by hour would create more partitions but does not help queries that filter by player ID; the queries would still scan all partitions unless they also filter by hour, and the player ID is the primary filter, not the time granularity.

Full explanation →

58

MCQhard

Refer to the exhibit. A Stream Analytics job shows increasing watermark delay and input deserialization errors. Which action should be taken first to troubleshoot?

A.Check the input data schema and ensure it matches the query

B.Change the output to a different sink

C.Increase the number of Streaming Units (SUs)

D.Set the watermark delay threshold higher

AnswerA

Deserialization errors are often due to schema mismatch; fixing the data or query resolves the root cause.

Why this answer

The input deserialization errors indicate that some incoming data cannot be parsed correctly. This can cause backpressure and increase watermark delay. Checking the event schema or data format is the first step.

The watermark delay of 120 seconds (max 300) is high but the root cause is likely the deserialization errors.

Full explanation →

59

MCQhard

Your organization uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slow when filtering on a date column. You need to improve query performance without increasing costs. What should you do?

A.Increase the maximum query concurrency limit

B.Provision a dedicated SQL pool with more DTUs

C.Create a clustered columnstore index on the date column

D.Partition the data by date in the data lake (e.g., folder structure: /year=*/month=*/day=*)

AnswerD

Partition elimination reduces data scanned.

Why this answer

Option D is correct because partitioning the data by date in the data lake (e.g., /year=*/month=*/day=*) allows the serverless SQL pool to leverage partition elimination. When querying with a filter on the date column, the pool can read only the relevant partitions (folders) instead of scanning all Parquet files, drastically reducing I/O and improving query performance at no additional cost.

Exam trap

The trap here is that candidates often confuse serverless SQL pool with dedicated SQL pool and incorrectly choose to create indexes or scale resources, not realizing that serverless SQL pool relies on external data partitioning and file-skipping techniques rather than internal indexing or provisioning.

How to eliminate wrong answers

Option A is wrong because increasing the maximum query concurrency limit does not improve the performance of a single query; it only allows more concurrent queries to run, which could even degrade individual query performance due to resource contention. Option B is wrong because provisioning a dedicated SQL pool with more DTUs increases costs and is not a serverless SQL pool feature; serverless SQL pool scales automatically and does not use DTUs, so this would be an expensive and incorrect solution. Option C is wrong because clustered columnstore indexes are not supported in serverless SQL pool; they are a feature of dedicated SQL pools, and creating one on a date column in a serverless context is not possible.

Full explanation →

60

MCQhard

You are reviewing a Mapping Data Flow in Azure Data Factory that copies data from a CSV file to an Azure Synapse table. The data flow uses 'allowSchemaDrift: true' and 'validateSchema: false'. After running the pipeline, you notice that the target table does not have the expected columns. The CSV file sometimes has extra columns. What is the most likely reason?

A.Schema drift is not enabled on the source.

B.The sink mapping is explicitly defined and does not include auto-mapping for drifted columns.

C.The sink dataset has a fixed schema that does not allow drift.

D.The source dataset has a fixed schema that does not include extra columns.

AnswerB

The sink script defines input columns, so extra columns are not mapped.

Why this answer

Option B is correct because when 'allowSchemaDrift' is enabled on the source, drifted columns are detected but will only be written to the sink if auto-mapping is used. If the sink mapping is explicitly defined (e.g., column-by-column mappings), it overrides auto-mapping and drifted columns are ignored. Since the target table is missing expected columns, the explicit mapping likely excludes the drifted columns.

Exam trap

The trap here is that candidates assume enabling 'allowSchemaDrift' on the source automatically writes all columns to the sink, but they overlook that explicit sink mappings override auto-mapping and exclude drifted columns.

How to eliminate wrong answers

Option A is wrong because the question states 'allowSchemaDrift: true' on the source, so schema drift is enabled. Option C is wrong because the sink dataset's schema does not control drift behavior at runtime; the sink's 'allowSchemaDrift' and mapping settings in the data flow determine whether drifted columns are written. Option D is wrong because the source dataset's schema does not prevent drift detection; the 'allowSchemaDrift' setting on the source overrides the dataset schema, and the CSV file's extra columns are still read as drifted columns.

Full explanation →

61

Multi-Selecthard

You are developing a data processing pipeline in Azure Data Factory. The pipeline uses a mapping data flow to transform data. You need to ensure that the data flow can handle schema drift from the source. Which THREE of the following actions should you take? (Select THREE.)

Select 3 answers

A.Use a 'Select' transformation to explicitly choose only the known columns.

B.Enable 'Allow schema drift' in the source settings.

C.Set 'Ignore drift column types' in the sink settings to avoid type mismatch errors.

D.Use 'Column pattern' transformations to apply transformations to drifted columns.

E.Add a 'Derived Column' transformation for every possible column that might be added.

AnswersB, C, D

This allows the data flow to accept additional columns.

Why this answer

Option B is correct because enabling 'Allow schema drift' in the source settings of a mapping data flow tells Azure Data Factory to accept incoming columns that are not defined in the source schema. This is essential for handling schema drift, as it prevents the pipeline from failing when new columns appear in the source data.

Exam trap

The trap here is that candidates often think they must explicitly define every column or use static transformations to handle schema drift, when in fact Azure Data Factory's mapping data flow provides dynamic mechanisms like 'Allow schema drift', column patterns, and drift type handling to automatically accommodate unknown columns.

Full explanation →

62

MCQhard

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. You have Parquet files in Azure Data Lake Storage Gen2 partitioned by date and hour. You need to query the data for the last 7 days efficiently, minimizing the amount of data scanned. Which approach should you use?

A.Use OPENROWSET with a wildcard path that includes all partitions and then apply a WHERE clause.

B.Create an external table with partition definition and query with a WHERE clause on the partition column.

C.Create a view using PolyBase external table and query with a date filter.

D.Use OPENROWSET with a file path that limits to the last 7 days' partitions and use a WHERE clause on the date column.

AnswerD

Limiting the file path reduces the files scanned, and the WHERE clause further filters.

Why this answer

Option D is correct because using OPENROWSET with a file path that restricts to the last 7 days' partitions (e.g., '/date=2025-03-*/hour=*/*.parquet') minimizes data scanned by only reading the relevant Parquet files. Applying a WHERE clause on the date column further filters rows within those files, leveraging partition pruning at the file system level. This approach avoids scanning all partitions, which is critical for cost and performance in serverless SQL pool.

Exam trap

The trap here is that candidates confuse serverless SQL pool's behavior with dedicated SQL pool's partition elimination, assuming that a WHERE clause on a partition column in an external table will automatically prune files, when in fact serverless SQL pool requires explicit path restriction in OPENROWSET to avoid scanning all partitions.

How to eliminate wrong answers

Option A is wrong because using a wildcard path that includes all partitions and then applying a WHERE clause does not prune partitions at the file level; serverless SQL pool still scans all files before filtering, resulting in unnecessary data read. Option B is wrong because creating an external table with partition definition in serverless SQL pool does not support automatic partition elimination; the WHERE clause on the partition column does not skip file scans as it would in a dedicated SQL pool. Option C is wrong because PolyBase external tables are not supported in serverless SQL pool; PolyBase is a feature of dedicated SQL pools, and using it here would be invalid.

Full explanation →

63

Multi-Selecthard

A company uses Azure Databricks to process streaming data from Event Hubs. The data is written to a Delta table. The job occasionally fails due to checkpoint corruption. Which THREE measures should you implement to improve reliability?

Select 3 answers

A.Configure checkpointing to a durable storage like Azure Data Lake Storage.

B.Increase the batch interval to reduce load.

C.Increase the cluster size to handle spikes.

D.Use Structured Streaming with `failOnDataLoss` set to false.

E.Implement a retry policy with exponential backoff for transient failures.

AnswersA, D, E

Durable storage prevents corruption.

Why this answer

Option A is correct because checkpointing to durable storage like Azure Data Lake Storage (ADLS) ensures that streaming progress metadata is persisted across cluster restarts and failures. ADLS provides high durability and availability, preventing checkpoint corruption that can occur with local or ephemeral storage, thereby enabling exactly-once processing guarantees in Structured Streaming.

Exam trap

The trap here is that candidates may confuse scaling solutions (increasing cluster size or batch interval) with reliability mechanisms, failing to recognize that checkpoint durability and data loss tolerance are the core mitigations for corruption and streaming failures.

Full explanation →

64

MCQhard

A company uses Azure Data Lake Storage Gen2 with Azure Databricks. They notice that the job to write data into Delta Lake tables takes too long. The data is coming from a streaming source with a high velocity of small writes. Which approach should be taken to optimize write performance?

A.Configure the streaming to write in micro-batches with a higher trigger interval.

B.Increase the cluster size to 16 nodes.

C.Enable 'auto optimize' and 'optimized writes' on the Delta table.

D.Change the output format from Delta to Parquet.

AnswerA

Batching reduces the number of small file writes.

Why this answer

Option A is correct because increasing the trigger interval for micro-batches reduces the frequency of writes, allowing more data to accumulate per batch. This minimizes the overhead of small file commits and metadata operations in Delta Lake, which is the primary bottleneck for high-velocity streaming writes. By batching more records together, the job writes fewer, larger files, improving overall throughput.

Exam trap

The trap here is that candidates often choose 'auto optimize' and 'optimized writes' (Option C) thinking they solve small file problems proactively, but these features are reactive compaction mechanisms that add overhead and do not reduce the frequency of log commits during streaming.

How to eliminate wrong answers

Option B is wrong because simply increasing cluster size to 16 nodes does not address the root cause of small file overhead; it may even exacerbate the problem by creating more concurrent writers producing even smaller files. Option C is wrong because 'auto optimize' and 'optimized writes' are designed to compact small files after they are written, but they do not prevent the initial overhead of many small writes during streaming; they add post-processing cost and latency. Option D is wrong because changing the output format from Delta to Parquet removes ACID transactions, schema enforcement, and time travel capabilities, and does not solve the small file problem—Parquet still suffers from the same small file overhead without the benefits of Delta Lake.

Full explanation →

65

MCQmedium

Your team is migrating an on-premises SQL Server data warehouse to Azure Synapse Analytics. The source has a fact table with 500 million rows and several dimension tables. You need to choose the best distribution strategy for the fact table to minimize data movement during joins. Which distribution type should you use?

A.Hash distribution on the foreign key column used in joins

B.No distribution (single distribution)

C.Replicated distribution

D.Round-robin distribution

AnswerA

Hash distribution on the join key ensures rows with the same key are on the same distribution.

Why this answer

Hash distribution on the foreign key column used in joins ensures that rows with the same join key are co-located on the same distribution node. This minimizes data movement because the join can be performed locally on each node without shuffling data across the compute nodes, which is critical for a 500-million-row fact table.

Exam trap

The trap here is that candidates often confuse replicated distribution as a general performance booster, but they fail to recognize that replicating a large fact table is impractical and that hash distribution on the join key is the correct strategy to minimize data movement for large fact tables.

How to eliminate wrong answers

Option B is wrong because single distribution (no distribution) places all data on one node, causing a bottleneck and eliminating the parallelism benefits of Azure Synapse Analytics, leading to poor performance for large fact tables. Option C is wrong because replicated distribution copies the entire table to each node, which is impractical for a 500-million-row fact table due to excessive storage and maintenance overhead; it is suitable only for smaller dimension tables. Option D is wrong because round-robin distribution distributes rows evenly without considering join keys, so joins require data to be shuffled across nodes, causing significant data movement and slower query performance.

Full explanation →

66

MCQmedium

You are designing a data pipeline that ingests JSON files from Azure Blob Storage into Azure Synapse Analytics using PolyBase. The files contain nested JSON arrays. What should you do to ensure that the data is loaded correctly?

A.Flatten the JSON arrays into a tabular format using Azure Data Factory or Databricks before loading.

B.Create an external table with the JSON file type and use a schema definition.

C.Use the OPENJSON function in T-SQL to parse the JSON during the load.

D.Use PolyBase with a JSON format file specifying the schema.

AnswerA

PolyBase requires tabular data, so flattening is necessary.

Why this answer

Option A is correct because PolyBase in Azure Synapse Analytics cannot directly handle nested JSON arrays; it requires a flat, tabular structure. Azure Data Factory or Databricks can flatten the nested arrays into rows and columns before loading, ensuring compatibility with PolyBase's external table format.

Exam trap

The trap here is that candidates assume PolyBase can handle any JSON structure natively, but it only supports flat JSON files, and they overlook the need for pre-processing nested arrays with tools like Data Factory or Databricks.

How to eliminate wrong answers

Option B is wrong because creating an external table with a JSON file type in PolyBase only works for simple, flat JSON files, not nested arrays; it will fail or produce incorrect results. Option C is wrong because OPENJSON is a T-SQL function used for parsing JSON within a query, but it cannot be used directly in a PolyBase load operation; it would require loading the entire file first, defeating PolyBase's purpose. Option D is wrong because PolyBase does not support a JSON format file for schema specification; it uses format files only for delimited text files (e.g., CSV), not JSON.

Full explanation →

67

MCQhard

Refer to the exhibit. You are deploying an Azure Synapse Analytics dedicated SQL pool using the provided ARM template snippet. After deployment, you need to adjust the performance level to DW200c to handle increased workload. Which parameter should you modify?

A.storageAccountType

B.maxSizeBytes

C.collation

D.sku.name

AnswerD

The sku name specifies the Data Warehouse Unit (DWU) level, e.g., DW100c or DW200c.

Why this answer

Option B is correct because the sku name defines the performance level (DW100c, DW200c, etc.). Option A is wrong because maxSizeBytes is for storage limit, not performance. Option C is wrong because storageAccountType is for data redundancy.

Option D is wrong because collation is for language support.

Full explanation →

68

Multi-Selecthard

You are designing a data processing solution that must handle both batch and stream processing in a unified manner. You need to choose a technology that supports a single code base for both processing modes. Which TWO technologies meet this requirement?

Select 2 answers

A.Azure Databricks with Delta Live Tables

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure SQL Database with Change Tracking

E.Apache Spark Structured Streaming in Azure Synapse

AnswersA, E

Delta Live Tables supports both batch and streaming with a unified declarative pipeline.

Why this answer

Option B (Azure Databricks with Delta Live Tables) and Option D (Apache Spark Structured Streaming in Azure Synapse) support a unified batch-stream programming model. Option A (Azure Stream Analytics) is streaming-only. Option C (Azure Data Factory) is batch-only.

Option E (Azure SQL Database) is not a processing engine.

Full explanation →

69

Multi-Selecthard

Which THREE of the following are best practices for designing tables in a dedicated SQL pool in Azure Synapse Analytics?

Select 3 answers

A.Avoid using clustered columnstore indexes on large tables.

B.Avoid data skew by choosing a good distribution key.

C.Use round-robin distribution for all large fact tables.

D.Use replicated tables for small dimension tables (less than 1 GB).

E.Use hash distribution on a column with high cardinality for large fact tables.

AnswersB, D, E

Data skew can degrade performance.

Why this answer

Option B is correct because a good distribution key minimizes data skew, ensuring that data is evenly distributed across all distributions. This prevents performance bottlenecks where some distributions handle a disproportionate amount of data or queries, which is critical for parallel processing in a dedicated SQL pool.

Exam trap

The trap here is that candidates often assume clustered columnstore indexes are unsuitable for large tables due to memory constraints, but they are actually the default and recommended index type for fact tables in Synapse dedicated SQL pools.

Full explanation →

70

MCQmedium

You are monitoring an Azure Synapse Analytics dedicated SQL pool using dynamic management views (DMVs). You want to identify queries that are experiencing excessive memory grants causing concurrency issues. Which DMV should you query?

A.sys.dm_pdw_waits

B.sys.dm_pdw_exec_requests

C.sys.dm_pdw_resource_waits

D.sys.dm_pdw_workload_management_workload_groups_details

AnswerB

Contains total_request_memory_granted column to identify large memory consumers

Why this answer

Option D is correct because sys.dm_pdw_exec_requests contains the total_request_memory_granted column that can be used to identify queries with large memory grants. Option A is wrong because it shows only waits, not memory grants. Option B is wrong because it shows resource class membership, not current memory usage.

Option C is wrong because it shows workload group settings, not actual grants.

Full explanation →

71

Multi-Selecteasy

Which THREE best practices should be followed when designing a data lake in Azure Data Lake Storage Gen2 for optimal performance?

Select 3 answers

A.Disable hierarchical namespace to improve performance.

B.Use a deep directory structure with many subfolders.

C.Use Parquet file format for analytics workloads.

D.Use a naming convention that avoids special characters and high cardinality.

E.Partition data by date to enable partition elimination.

AnswersC, D, E

Parquet is columnar and efficient for analytics.

Why this answer

Options A, C, and E are correct. Using Parquet improves compression and query performance; partitioning by date allows partition pruning; using a naming convention that avoids high cardinality directories prevents throttling. Option B is wrong because hierarchical namespace is required for ADLS Gen2 and is not a best practice to disable.

Option D is wrong because a deep directory structure can cause performance issues.

Full explanation →

72

MCQmedium

You are using Azure Data Factory to load data from an on-premises SQL Server to Azure Synapse Analytics. The data size is 500 GB, and you need to minimize the load time. The network bandwidth is limited. Which approach should you use?

A.Use a self-hosted integration runtime and copy directly to Synapse.

B.Use a self-hosted integration runtime and stage the data in Azure Blob Storage before loading to Synapse.

C.Use Azure ExpressRoute to improve network bandwidth.

D.Use an Azure integration runtime and copy to Azure Data Lake Storage, then to Synapse.

AnswerB

Staging improves performance by splitting the copy into two phases.

Why this answer

Option A is correct because using a self-hosted integration runtime with staging via Azure Blob Storage allows you to stage the data in Azure and then copy to Synapse, which is faster over limited bandwidth. Option B (copy activity directly) is slower. Option C (Azure Data Lake Storage) is similar but staging is more efficient.

Option D (Azure ExpressRoute) requires additional setup and cost.

Full explanation →

73

Multi-Selectmedium

A data engineering team is designing a batch processing solution using Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) and must be processed daily with minimal cost. The team needs to choose between using a Delta Lake table or a Parquet file format for the processed output. Which TWO factors should the team consider when making this decision?

Select 2 answers

A.Delta Lake provides time travel capabilities for accessing historical data versions.

B.Parquet is easier to implement for schema evolution than Delta Lake.

C.Delta Lake reduces storage costs by automatically compressing data.

D.Delta Lake supports ACID transactions, ensuring data consistency during concurrent writes.

E.Parquet files are not natively supported by Azure Databricks.

AnswersA, D

Delta Lake's time travel feature allows querying previous versions of data.

Why this answer

Option A is correct because Delta Lake's time travel feature allows querying previous versions of data using a timestamp or version number, which is essential for auditing, rollback, and reproducing historical reports. This capability is built on the transaction log that tracks every change, making it a key differentiator from plain Parquet files.

Exam trap

The trap here is that candidates assume Parquet is not natively supported in Databricks, but in reality, Parquet is the default storage format for Delta Lake and is fully supported; the key differentiators are ACID transactions and time travel, not format compatibility.

Full explanation →

74

MCQeasy

You are running an Azure Stream Analytics job that reads from an Event Hub and writes to a Power BI dataset. The job is falling behind and processing latency is increasing. What should you do to improve performance?

A.Increase the number of Streaming Units (SUs) allocated to the job.

B.Use a reference data input to filter events.

C.Change the output to Azure Blob Storage instead of Power BI.

D.Decrease the size of events sent to the Event Hub.

AnswerA

More SUs increase processing capacity.

Why this answer

Option D is correct because increasing the Streaming Units (SUs) allocates more compute resources to the job, improving throughput. Option A (change output to blob) changes architecture. Option B (decrease event size) is not always possible.

Option C (use reference data) does not affect throughput.

Full explanation →

75

Multi-Selecteasy

Which TWO actions should you take to optimize a Spark job in Azure Synapse Analytics that is experiencing excessive shuffling and long execution times?

Select 2 answers

A.Coalesce the number of partitions to reduce overhead.

B.Enable checkpointing to persist intermediate results.

C.Broadcast one of the tables if it is small enough to fit in memory.

D.Disable dynamic resource allocation.

E.Increase the number of shuffle partitions using 'spark.sql.shuffle.partitions'.

AnswersC, E

Broadcast join eliminates shuffle by replicating the small table to all executors.

Why this answer

Option A is correct because broadcasting a small lookup table can avoid shuffles in joins. Option C is correct because increasing the number of partitions can spread data more evenly, reducing shuffle size. Option B is wrong because coalescing reduces partitions, which may worsen shuffling.

Option D is wrong because disabling dynamic allocation may lead to resource contention. Option E is wrong because enabling checkpointing adds overhead and does not directly reduce shuffling.

Full explanation →

Page 1 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →