Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 76–150

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 2 of 12

76

MCQmedium

A data engineer needs to store JSON documents that are frequently updated by multiple users concurrently. The solution must support optimistic concurrency control and have built-in indexing on all fields. Which Azure data store should be used?

A.Azure Cosmos DB (SQL API)

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Table Storage

AnswerA

Cosmos DB supports JSON documents, optimistic concurrency, and automatic indexing.

Why this answer

Azure Cosmos DB (SQL API) is the correct choice because it natively supports optimistic concurrency control via ETags (HTTP entity tags) and provides automatic indexing of all fields without requiring manual index management. This makes it ideal for storing JSON documents that are frequently updated by multiple concurrent users, as it ensures conflict detection and resolution while maintaining high performance.

Exam trap

The trap here is that candidates often choose Azure SQL Database because they associate concurrency control with relational databases, overlooking that Cosmos DB is purpose-built for JSON documents with automatic indexing and native optimistic concurrency via ETags, which is more aligned with the requirements than a relational store.

How to eliminate wrong answers

Option B (Azure Blob Storage) is wrong because it does not support optimistic concurrency control; it uses lease-based locking for blobs, which is not designed for fine-grained concurrent updates on JSON documents and lacks built-in indexing on all fields. Option C (Azure SQL Database) is wrong because while it supports optimistic concurrency via snapshot isolation or row versioning, it requires manual index creation and is not optimized for storing and querying JSON documents natively; it is a relational store, not a document store. Option D (Azure Table Storage) is wrong because it does not support optimistic concurrency control (it uses ETags but only for individual entities, not for complex JSON documents) and its indexing is limited to partition and row keys, not all fields.

Full explanation →

77

Multi-Selecthard

You are designing a data storage solution for a financial services company. The solution must meet the following requirements: store transaction data for 7 years for regulatory compliance, support point-in-time restore (PITR) for the last 30 days, and minimize storage costs for historical data. Which THREE actions should you take?

Select 3 answers

A.Store historical data in Azure Data Lake Storage Gen2

B.Use Azure Blob Storage with cool access tier for data older than 30 days

C.Use Azure SQL Database with automated backups and PITR retention of 30 days

D.Implement a data export process using Azure Data Factory to move data older than 30 days to Blob Storage

E.Use Azure Table Storage for archival

AnswersB, C, D

Cool tier is cost-effective for infrequently accessed data.

Why this answer

Option B is correct because Azure Blob Storage's cool access tier is designed for infrequently accessed data with lower storage costs, making it ideal for historical transaction data older than 30 days. This tier provides cost-effective storage while still allowing retrieval when needed, aligning with the requirement to minimize storage costs for historical data.

Exam trap

The trap here is that candidates often confuse Azure Data Lake Storage Gen2 with a cost-effective archival solution, when in fact it is designed for high-throughput analytics and lacks the tiered pricing that Blob Storage offers for long-term retention.

Full explanation →

78

MCQeasy

You need to ensure that data in an Azure Data Lake Storage Gen2 account is encrypted at rest using a customer-managed key. Which feature should you configure?

A.Azure Key Vault integration with Storage Service Encryption

B.Azure Information Protection

C.Azure Storage Service Encryption with Microsoft-managed keys

D.Azure Disk Encryption

AnswerA

Azure Storage Service Encryption supports customer-managed keys stored in Azure Key Vault.

Why this answer

Option C is correct because ADLS Gen2 supports encryption at rest with customer-managed keys via Azure Key Vault. Option A is wrong because Azure Information Protection is for labeling, not encryption at rest. Option B is wrong because Azure Disk Encryption is for VMs.

Option D is wrong because Azure Storage Service Encryption uses Microsoft-managed keys by default; customer-managed keys require Key Vault.

Full explanation →

79

MCQeasy

Your organization uses Azure Data Lake Storage Gen2 and needs to prevent accidental deletion of data by enabling soft delete. You also need to ensure that deleted blobs are recoverable for 30 days. What should you configure?

A.Enable blob snapshots and set them to expire after 30 days.

B.Use Azure Backup to create daily backups of the storage account.

C.Enable container soft delete with a retention period of 30 days.

D.Enable blob soft delete and set retention period to 30 days.

AnswerD

Blob soft delete retains deleted blobs for the specified period, allowing recovery.

Why this answer

Option A is correct because blob soft delete allows recovery of deleted blobs within a specified retention period. Option B is wrong because container soft delete is for entire containers, not individual blobs. Option C is wrong because Azure Backup is for VM backups, not blob recovery.

Option D is wrong because snapshots are manual and not automatic.

Full explanation →

80

MCQmedium

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline runs successfully but you notice that the data transfer is taking longer than expected. You need to improve the transfer performance without increasing the cost significantly. Which configuration change should you make?

A.Enable staged copy with an Azure Blob Storage staging location.

B.Use PolyBase to load the data into Blob Storage.

C.Increase the number of concurrent connections on the SQL Server.

D.Increase the Data Integration Units (DIU) for the copy activity.

AnswerA

Staged copy uses staging storage to improve performance by enabling parallel uploads and automatic retries.

Why this answer

Option C is correct because enabling staged copy with staging storage can improve performance by allowing parallel uploads and retries. Option A is wrong because increasing DIUs may increase cost. Option B is wrong because increasing the number of concurrent connections on SQL Server might not be the bottleneck.

Option D is wrong because using PolyBase is for loading into Azure Synapse or SQL DW, not Blob Storage.

Full explanation →

81

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Live Tables (DLT). The solution must handle late-arriving data and ensure that updates to the delta table are processed exactly once. Which DLT pipeline setting should you configure?

A.Set the pipeline mode to 'Trigger once'.

B.Set the pipeline mode to 'Continuous' and enable 'Exactly once' delivery.

C.Use Auto Loader with 'file arrival' mode.

D.Configure the DLT pipeline with 'Delta Live Tables' and use 'APPLY CHANGES INTO' with a watermark.

AnswerD

APPLY CHANGES INTO handles late-arriving data with exactly-once semantics.

Why this answer

Option D is correct because DLT pipelines support exactly-once processing and handle late data via watermarking and append-only or update modes. Option A is wrong because 'Trigger once' processes data one time only, not continuous. Option B is wrong because 'Continuous' mode processes data as it arrives but may not guarantee exactly-once without proper configuration.

Option C is wrong because 'File arrival' is not a DLT setting.

Full explanation →

82

MCQhard

You are monitoring an Azure Stream Analytics job that processes data from an IoT hub. The job's output to Azure Synapse Analytics is experiencing high latency. The job's SU% utilization is at 90%. Which action will most likely reduce the latency?

A.Increase the number of Streaming Units (SUs) allocated to the job.

B.Decrease the watermark delay interval.

C.Increase the late arrival tolerance window.

D.Increase the number of partitions in the output table.

AnswerA

More SUs provide more processing capacity, reducing latency.

Why this answer

The job's SU% utilization is at 90%, indicating that the current Streaming Units (SUs) are nearly saturated, causing a processing bottleneck. Increasing the number of SUs allocates more compute resources (CPU and memory) to the Stream Analytics job, allowing it to process incoming IoT data faster and reduce the latency to Azure Synapse Analytics. This directly addresses the high utilization issue, which is the most likely root cause of the latency.

Exam trap

The trap here is that candidates often confuse output-side tuning (like partitioning or sink configuration) with the actual processing bottleneck, overlooking that high SU% utilization directly indicates the Stream Analytics job itself is the limiting factor.

How to eliminate wrong answers

Option B is wrong because decreasing the watermark delay interval would make the job emit results more frequently, but it does not increase processing capacity; with SU utilization already at 90%, this could worsen backpressure and increase latency. Option C is wrong because increasing the late arrival tolerance window only allows the job to handle out-of-order events for a longer period; it does not improve throughput or reduce latency caused by resource saturation. Option D is wrong because increasing the number of partitions in the output table improves write parallelism on the Synapse side, but the bottleneck is the Stream Analytics job's processing capacity (90% SU utilization), not the output sink's partitioning.

Full explanation →

83

MCQmedium

You are designing a data processing solution that requires running custom Python scripts for data transformation. The scripts have dependencies on specific libraries that are not pre-installed in the environment. You need a fully managed, serverless compute option that allows you to install custom libraries. Which service should you use?

A.Azure Databricks with cluster-scoped libraries

B.Azure Functions with custom handlers

C.Azure Container Instances with a custom Docker image

D.Azure Synapse Analytics serverless SQL pool

AnswerA

Databricks allows installing libraries at cluster level and is fully managed.

Why this answer

Azure Databricks with cluster configuration allows installation of custom libraries. Azure Synapse Spark pools also support libraries but require cluster management. Azure Functions may have limitations.

Azure Container Instances require container images.

Full explanation →

84

Drag & Dropmedium

Drag and drop the steps to convert data from CSV to Parquet format using Azure Data Factory into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Define source (CSV) and sink (Parquet) datasets, then a copy activity with mapping, run, and monitor.

Full explanation →

85

MCQmedium

You are building a data pipeline that uses Azure Data Factory to copy data from a REST API to Azure Blob Storage. The REST API returns JSON data in pages of 1000 records each. The total number of records is 50,000. Which activity or feature should you use to loop through the pages?

A.Use a ForEach activity to iterate over a fixed number of pages.

B.Use a Lookup activity to retrieve the total number of pages and then use a ForEach.

C.Use an Until activity to loop until the API returns no more pages.

D.Use a Copy activity with pagination rules enabled in the source.

AnswerD

The Copy activity supports pagination for REST APIs using pagination rules (e.g., next URL).

Why this answer

Option C is correct because the Until activity with a dynamic pagination logic is appropriate for looping until all pages are fetched. Option A is incorrect because the Copy activity alone does not handle pagination. Option B is incorrect because ForEach is for iterating a fixed set of items, not dynamic pagination.

Option D is incorrect because Lookup activity returns a single value, not suitable for pagination.

Full explanation →

86

MCQhard

You are troubleshooting a slow-running Azure Synapse Pipeline that loads data from Azure Blob Storage into a dedicated SQL pool using a Copy activity. The source is a set of CSV files totaling 500 GB. The sink is a staging table with a clustered columnstore index. The pipeline takes 4 hours to complete. You need to reduce the execution time to under 1 hour. What should you do?

A.Enable PolyBase in the Copy activity sink settings.

B.Increase the Data Integration Units (DIU) in the Copy activity to the maximum.

C.Increase the dedicated SQL pool's DWU setting to the highest tier.

D.Partition the staging table on a date column.

AnswerA

PolyBase provides the fastest way to load data into dedicated SQL pool by leveraging its parallel architecture.

Why this answer

Option D is correct because enabling PolyBase in the Copy activity uses the fast PolyBase loading path, which can significantly accelerate loading into dedicated SQL pool. Option A (increasing DIU) helps but is limited by the sink. Option B (increasing DWU) scales the warehouse but does not change the copy method.

Option C (partitioning the table) might help at query time but not during initial load.

Full explanation →

87

MCQeasy

Your company uses Azure SQL Database and needs to audit all data modifications, including SELECT operations on sensitive columns, for compliance. Which feature should you enable?

A.Enable SQL auditing and configure to log SELECT operations.

B.Enable Transparent Data Encryption (TDE).

C.Enable Azure Defender for SQL.

D.Enable SQL vulnerability assessment.

AnswerA

Auditing can log SELECT and other events for compliance.

Why this answer

Option A is correct because Azure SQL Database auditing captures database events including SELECT statements. Option B is wrong because vulnerability assessment scans for security issues but does not audit. Option C is wrong because TDE encrypts data at rest.

Option D is wrong because Azure Defender for SQL provides security alerts but not detailed audit logs.

Full explanation →

88

MCQhard

A data engineer runs the Azure CLI command shown in the exhibit. The blob is stored in Azure Blob Storage. The team previously set a lifecycle management rule to move blobs to the Archive tier after 30 days. The blob was created 45 days ago. What is the most likely reason the blob is still in the Cool tier?

A.The lifecycle management rule only applies to blobs in the Hot tier.

B.The lifecycle management rule was configured after the blob was created, and it can take up to 24 hours for the rule to be evaluated.

C.The blob is in a container that is excluded from the lifecycle rule.

D.The blob must be in the Hot tier for the rule to move it to Archive.

AnswerB

Correct. Lifecycle rules are evaluated once per day, so there may be a delay.

Why this answer

The lifecycle management rule is evaluated by Azure Storage once per day. If the rule was configured after the blob was created, it may not have been evaluated yet, and it can take up to 24 hours for the rule to apply. Since the blob is 45 days old and still in Cool tier, the most likely reason is that the rule has not yet been evaluated after its configuration.

Exam trap

The trap here is that candidates assume lifecycle rules are evaluated immediately or that blobs must be in Hot tier to be moved to Archive, but Azure's daily evaluation cycle and the ability to move from Cool to Archive are the key nuances tested.

How to eliminate wrong answers

Option A is wrong because lifecycle management rules can apply to blobs in any tier (Hot, Cool, or Archive) unless explicitly filtered by tier in the rule definition. Option C is wrong because there is no indication in the scenario that the container is excluded; the question states a rule was set, and exclusion would require explicit configuration. Option D is wrong because lifecycle rules can move blobs from Cool to Archive directly; blobs do not need to be in Hot tier first.

Full explanation →

89

MCQeasy

You are designing a data storage solution for a marketing analytics platform. The platform collects clickstream data from websites and needs to store it for both real-time dashboards and historical analysis. The data is semi-structured (JSON) and arrives at a rate of 10,000 events per second. You need to choose an Azure storage solution that can handle the ingestion rate, support schema-on-read, and integrate with Azure Databricks for advanced analytics. The solution must also be cost-effective for long-term storage. What should you use?

A.Azure Table Storage

B.Azure Data Lake Storage Gen2

C.Azure Cosmos DB

D.Azure SQL Database

AnswerB

ADLS Gen2 meets all requirements: high throughput, schema-on-read, Databricks integration, cost-effective.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with Azure Blob Storage, providing high-throughput ingestion (up to 60 GB/s per account) to handle 10,000 events per second of semi-structured JSON data. It supports schema-on-read natively, allowing Azure Databricks to query the data directly using Spark without prior schema definition, and its tiered storage (hot, cool, archive) makes it cost-effective for long-term historical analysis.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its real-time capabilities, overlooking that the question emphasizes cost-effective long-term storage and schema-on-read for historical analysis, which ADLS Gen2 handles far more efficiently and cheaply than Cosmos DB's per-request-unit pricing model.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for structured data with a fixed schema, not for semi-structured JSON clickstream data, and it lacks the hierarchical namespace and high-throughput ingestion needed for 10,000 events per second. Option C is wrong because Azure Cosmos DB is optimized for low-latency real-time access with its multi-model API, but it is significantly more expensive for long-term storage of high-volume historical data and does not natively integrate with Azure Databricks for schema-on-read analytics as efficiently as ADLS Gen2. Option D is wrong because Azure SQL Database is a relational database requiring a predefined schema (schema-on-write), which conflicts with the schema-on-read requirement, and its ingestion rate and cost model are not designed for high-velocity semi-structured data at 10,000 events per second.

Full explanation →

90

MCQmedium

Refer to the exhibit. You are deploying an Azure Synapse Analytics workspace using an ARM template. The template defines a managed virtual network integration runtime. You need to ensure that the integration runtime can run mapping data flows with a time-to-live (TTL) of 10 minutes. What is the purpose of the 'timeToLive' property in this configuration?

A.It defines how long the cluster will be kept alive after a data flow completes, allowing subsequent data flows to reuse the cluster.

B.It sets the timeout for the integration runtime to connect to the data sources.

C.It specifies the maximum duration a data flow activity can run before timing out.

D.It determines the maximum number of concurrent data flows that can run on the cluster.

AnswerA

TTL keeps the cluster warm for reuse, reducing startup time.

Why this answer

The 'timeToLive' property in an Azure Synapse Analytics managed virtual network integration runtime controls how long the cluster remains alive after a mapping data flow completes. By setting a TTL of 10 minutes, subsequent data flows can reuse the same warm cluster, avoiding the 5–10 minute cold start time for new clusters. This optimizes performance and reduces latency for consecutive data flow executions.

Exam trap

The trap here is that candidates confuse 'timeToLive' with activity timeout or concurrency limits, because all three involve time or capacity constraints, but TTL specifically governs cluster reuse after a data flow completes, not execution duration or parallelism.

How to eliminate wrong answers

Option B is wrong because the connection timeout to data sources is configured separately via linked service properties or the 'connectVia' runtime settings, not through the 'timeToLive' property. Option C is wrong because the maximum duration a data flow activity can run is set by the activity's 'timeout' property in the pipeline, not by the integration runtime's TTL. Option D is wrong because the maximum number of concurrent data flows is controlled by the 'concurrency' property on the integration runtime, not by 'timeToLive'.

Full explanation →

91

Multi-Selecthard

Which THREE security features are available in Azure Data Lake Storage Gen2 to protect data at rest and in transit? (Choose three.)

Select 3 answers

A.Azure Storage firewalls and virtual network rules

B.Azure Information Protection

C.Encryption at rest using Storage Service Encryption (SSE)

D.Azure ADLS Gen2 supports HTTPS for data in transit.

E.Azure Policy

AnswersA, C, D

Restrict access to specific networks.

Why this answer

Options A, B, and D are correct. A ensures data is encrypted at rest using AES-256. B provides data integrity for data in transit.

D provides network-level security. C is wrong because Azure Information Protection is for classification, not a built-in storage security feature. E is wrong because Azure Policy is for governance, not data protection.

Full explanation →

92

MCQhard

Refer to the exhibit. You deploy the ARM template to create a storage account. After deployment, a developer reports that they cannot access the storage account from an Azure Databricks notebook running in the same virtual network as specified in the subnet rule. What is the most likely reason?

A.The storage account uses customer-managed keys from Key Vault, which requires additional permissions.

B.The firewall default action is Deny, and there are no IP rules allowing the Databricks cluster's public IP.

C.The Databricks cluster is not using a managed virtual network with a service endpoint to the storage account.

D.The storage account has hierarchical namespace enabled, which blocks non-ADLS Gen2 tools.

AnswerC

For the subnet rule to work, the Databricks cluster must be deployed in a virtual network that has a service endpoint to Microsoft.Storage. If the cluster is using a public IP, access is denied.

Why this answer

Option D is correct because the default action is Deny, and only traffic from the specified subnet is allowed. The developer's Databricks notebook might be using a cluster that is not using a managed virtual network or is not in that subnet. Option A is wrong because HNS enabled does not block access.

Option B is wrong because firewall is correctly set to Deny with a rule for the subnet. Option C is wrong because the key vault encryption doesn't block access.

Full explanation →

93

MCQhard

Refer to the exhibit. An Azure Data Factory instance uses a self-hosted integration runtime. The exhibit shows the properties of the integration runtime. The data engineer notices that copy activities are failing with errors indicating that the integration runtime is not available. What is the most likely cause?

A.The integration runtime status is "Offline"

B.Auto-update is disabled, preventing the IR from updating

C.The integration runtime version is outdated and needs to be manually updated

D.Self-contained interactive authoring is disabled, causing connectivity issues

AnswerC

The version is behind the pushed version, indicating auto-update has not applied the latest update.

Why this answer

Option C is correct because the exhibit shows the integration runtime version as '5.24.8345.1' and the status as 'Online', but copy activities are failing. The most likely cause is that the self-hosted IR version is outdated and no longer compatible with the Azure Data Factory service endpoints, leading to connectivity failures. Auto-update being disabled (Option B) would prevent automatic updates, but the core issue is the outdated version itself, which requires manual intervention to update.

Exam trap

The trap here is that candidates see the status 'Online' and assume the IR is fully functional, overlooking that version incompatibility can cause operational failures even when the IR appears connected.

How to eliminate wrong answers

Option A is wrong because the exhibit clearly shows the integration runtime status as 'Online', not 'Offline', so the IR is technically reachable. Option B is wrong because while auto-update being disabled can lead to an outdated version, the question asks for the most likely cause of the copy activity failures, and the direct cause is the outdated version (Option C), not the disabled auto-update setting itself. Option D is wrong because self-contained interactive authoring is a feature for authoring and debugging in the self-hosted IR environment, and disabling it does not cause the IR to become unavailable for copy activities; it only affects authoring capabilities.

Full explanation →

94

Multi-Selectmedium

A company uses Azure Synapse Analytics dedicated SQL pool for a data warehouse. They notice that some queries are using more memory than expected, causing resource contention. Which TWO actions should they take to diagnose and optimize memory usage?

Select 2 answers

A.Enable result-set caching.

B.Increase the resource class for the users running the heavy queries.

C.Scale up the DWU setting.

D.Query the sys.dm_pdw_exec_requests DMV to identify queries with high memory grants.

E.Rebuild clustered columnstore indexes.

AnswersB, D

Larger resource classes provide more memory per query.

Why this answer

Option B is correct because increasing the resource class for users running heavy queries allocates more memory to those queries, reducing resource contention by ensuring they have sufficient memory to execute efficiently. Option D is correct because querying sys.dm_pdw_exec_requests DMV allows you to identify queries with high memory grants, which is the first step in diagnosing which queries are consuming excessive memory and need optimization.

Exam trap

The trap here is that candidates often confuse scaling up the DWU (Option C) as a diagnostic action, but it is a reactive scaling measure that does not help identify which queries are causing the memory issue, whereas querying the DMV and adjusting resource classes are targeted diagnostic and optimization steps.

Full explanation →

95

Multi-Selectmedium

Which TWO actions can you take to optimize the performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Select 2 answers

A.Create nonclustered indexes on all columns of the target table

B.Use ROUND_ROBIN distribution for the staging table

C.Set the row group size to 100,000 rows for optimal compression

D.Enable change tracking on the target table

E.Use CREATE TABLE AS SELECT (CTAS) with partition switching

AnswersB, E

Round-robin distributes data evenly, speeding up loads.

Why this answer

Options A and D are correct. Round-robin distribution ensures even data distribution during loads. Using CTAS with partition switching minimizes logging and fragmentation.

Option B is wrong because smaller row group sizes increase columnstore segment count, reducing compression and query performance. Option C is wrong because creating indexes on all columns increases write overhead. Option E is wrong because enabling change tracking adds overhead.

Full explanation →

96

MCQhard

You are designing a near-real-time data processing solution that ingests millions of events per second from IoT devices. The data must be aggregated on a per-minute basis and stored in Azure Data Lake Storage Gen2 for long-term analytics. The solution must also support alerting when certain thresholds are exceeded. Which combination of Azure services should you use?

A.Azure Event Hubs, Azure Data Factory, and Azure SQL Database.

B.Azure IoT Hub, Azure Stream Analytics, and Azure Functions.

C.Azure IoT Hub, Azure Databricks with Structured Streaming, and Azure Data Lake Storage Gen2.

D.Azure Event Hubs, Azure Data Explorer, and Power BI.

AnswerB

IoT Hub ingests device data, Stream Analytics performs per-minute aggregation and alerting, and Functions can output to Data Lake Storage.

Why this answer

Option D is correct because Azure IoT Hub ingests device data, Azure Stream Analytics performs real-time aggregation and alerting, and Azure Functions can write aggregated results to Data Lake Storage. Option A (Azure Event Hubs) is a generic event broker but lacks device management capabilities. Option B (Azure Databricks) is more suited for complex analytics than simple aggregation.

Option C (Azure Data Explorer) is for ad-hoc analytics, not for streaming aggregation with alerting.

Full explanation →

97

MCQmedium

Your team uses Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2. You notice that queries are running slower than expected. You need to improve query performance by reducing the amount of data scanned. Which two features should you implement? (Select two.)

A.Partition the data in the data lake and use partition elimination in queries.

B.Enable result-set caching for the serverless SQL pool.

C.Enable auto-optimize on the Delta Lake tables.

D.Create materialized views on the serverless SQL pool.

E.Use file pruning by specifying file paths in the OPENROWSET query.

AnswerA, E

Partition elimination allows the query to skip irrelevant partitions, reducing data scanned.

Why this answer

Options A and C are correct. File pruning reduces data scanned by skipping irrelevant files. Partition elimination reduces data scanned by skipping irrelevant partitions.

Option B is wrong because materialized views store pre-computed data but do not reduce scan on source. Option D is wrong because result-set caching caches query results but does not reduce data scanned for new queries. Option E is wrong because enabling auto-optimize is for Delta Lake, not serverless SQL.

Full explanation →

98

MCQeasy

Refer to the exhibit. You have a mapping data flow in Azure Data Factory that aggregates sales data. The data flow runs successfully but the sink table contains only the total sum per run instead of per product. What is missing?

A.The source dataset is not filtering by date

B.The aggregate transformation does not have a groupBy column

C.The data flow is missing a filter transformation

D.The sink dataset is not configured to append

AnswerB

Without groupBy, it aggregates all rows.

Why this answer

Option C is correct. The aggregate transformation does not have a groupBy clause, so it sums all rows. Adding a groupBy on ProductID will produce per-product totals.

Option A is wrong because the source is correct. Option B is wrong because the sink is fine. Option D is wrong because there is no filter.

Full explanation →

99

MCQmedium

You are designing access control for Azure Data Lake Storage Gen2. You need to allow a group of data scientists to read and write files in the 'processed' directory but prevent them from deleting files. Which authorization method should you use?

A.Assign the Storage Blob Data Contributor role at the container level

B.Assign the Storage Blob Delegator role at the container level

C.Assign the Storage Blob Data Contributor role at the container level and use ACLs to deny delete on the directory

D.Assign the Storage Blob Data Reader role and use ACLs to grant write

AnswerC

ACLs can override RBAC permissions to deny delete.

Why this answer

Azure RBAC roles like Storage Blob Data Contributor grant read/write/delete permissions. To differentiate write and delete, you need ACLs that allow read/execute on the container and read/write on the directory, but not delete. Option A (Storage Blob Data Contributor) includes delete.

Option B (Storage Blob Data Reader) is read-only. Option D (Storage Blob Delegator) is not a built-in role.

Full explanation →

100

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. They notice that some queries are slow due to high data movement. What should you do to minimize data movement for queries that join large fact tables?

A.Use round-robin distribution for all tables.

B.Partition both tables on the join keys.

C.Hash-distribute the fact tables on the join keys.

D.Use replicated tables for all large fact tables.

AnswerC

Hash distribution on join keys colocates rows and minimizes data movement.

Why this answer

Hash-distributing the fact tables on the join keys ensures that rows with the same join key value are placed on the same distribution node. This eliminates the need to shuffle data across nodes during the join, minimizing data movement and improving query performance in Azure Synapse dedicated SQL pool.

Exam trap

The trap here is that candidates confuse partitioning with distribution, thinking that partitioning on join keys reduces data movement, when in fact only hash distribution on the join key ensures collocation across nodes.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without considering join keys, which does not reduce data movement for joins and can actually increase it. Option B is wrong because partitioning on join keys organizes data within a distribution but does not control data placement across distributions; data movement still occurs when joining across partitions. Option D is wrong because replicated tables are suitable for small dimension tables, not large fact tables, as replicating large tables would consume excessive storage and negate the benefits of scale-out.

Full explanation →

101

Multi-Selecthard

Which THREE metrics should you monitor for an Azure Synapse Analytics dedicated SQL pool to ensure optimal performance?

Select 3 answers

A.tempdb usage

B.DWU usage

C.Queued queries

D.Login failures

E.Total storage size

AnswersA, B, C

High tempdb usage can slow queries.

Why this answer

Options A, C, and E are correct. Option A: DWU usage indicates how much of the allocated resources are used. Option C: Queued queries indicate concurrency bottlenecks.

Option E: tempdb usage can cause performance issues. Option B is wrong because storage size is capacity, not performance. Option D is wrong because login failures are security, not performance.

Full explanation →

102

MCQhard

Your organization uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled. You need to implement a security strategy that allows users to read only specific folders within a container. Which authorization method should you use?

A.Storage account shared key

B.Azure RBAC roles (e.g., Storage Blob Data Contributor) at the container level

C.Shared access signatures (SAS) with folder-level permissions

D.Access control lists (ACLs) on the folder

AnswerD

ACLs allow granular permissions at directory level

Why this answer

Option C is correct because ACLs (Access Control Lists) can be set at the folder level in ADLS Gen2 to grant granular permissions to specific directories. Option A is wrong because RBAC roles at the container level apply to the entire container. Option B is wrong because SAS tokens grant access at the container or file level, not folder level.

Option D is wrong because shared key is account-level access.

Full explanation →

103

Multi-Selecthard

Which THREE factors should you consider when choosing between rowstore and columnstore indexes in Azure Synapse Analytics?

Select 3 answers

A.The table contains many NULL values in indexed columns.

B.The table will be partitioned frequently.

C.The table size is expected to be over 1 TB.

D.The table has a high number of singleton lookups by a primary key.

E.The workload is heavy on aggregations and large scans.

AnswersC, D, E

Columnstore compression is more effective on large tables.

Why this answer

Option C is correct because columnstore indexes in Azure Synapse Analytics are optimized for large-scale data warehousing workloads, where table sizes exceeding 1 TB benefit from high compression and columnar storage, significantly improving scan and aggregation performance. Rowstore indexes, in contrast, are less efficient for such large datasets due to higher I/O and storage overhead.

Exam trap

The trap here is that candidates may mistakenly think NULL handling or partitioning frequency are key differentiators, when in fact the core decision hinges on workload type—aggregations/scans (columnstore) versus singleton lookups (rowstore)—and table size thresholds like 1 TB where columnstore compression becomes critical.

Full explanation →

104

Drag & Dropmedium

Drag and drop the steps to configure Azure Stream Analytics job with event input and Power BI output into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, set up the event hub as the data source. Then create the Stream Analytics job, configure input and output, write the query, and start it.

Full explanation →

105

MCQhard

You are building a batch processing solution in Azure Synapse Analytics that reads data from a dedicated SQL pool, applies complex transformations using Synapse Spark, and writes the results back to the dedicated SQL pool. The pipeline must run on a schedule and handle transient failures with retries. Which approach should you use?

A.Use Azure Batch with a custom application to run Spark jobs

B.Use Azure Synapse Pipelines with a Notebook activity that runs Spark code

C.Use Azure Functions to trigger Spark jobs on demand

D.Use Azure Databricks with Auto Loader and Delta Live Tables

AnswerB

Synapse Pipelines support scheduling and retries for Spark notebooks.

Why this answer

Option B is correct because Azure Synapse Pipelines support scheduling, retry policies, and can orchestrate Notebook activities for Spark transformations. Option A is wrong because Azure Batch is not integrated with Synapse. Option C is wrong because Databricks is a separate service; using it would require additional integration.

Option D is wrong because Azure Functions are for event-driven, short-lived tasks, not complex transformations.

Full explanation →

106

MCQmedium

You are partitioning a large fact table in Azure Synapse Dedicated SQL Pool by date. The table is used for queries that filter on CustomerID and Date. You want to minimize data movement. Which distribution strategy should you use?

A.Round-robin distribution

B.Hash distribution on CustomerID

C.Replicate distribution

D.Hash distribution on Date

AnswerB

Hash distribution on CustomerID ensures collocation for joins on that key, reducing data movement.

Why this answer

Hash distribution on CustomerID is correct because queries filtering on CustomerID and Date will benefit from collocated joins and aggregations when CustomerID is the distribution key. Since the table is large and partitioned by Date, hash distribution on CustomerID minimizes data movement by ensuring that rows with the same CustomerID reside on the same distribution node, allowing filters on Date to be applied locally within each partition.

Exam trap

The trap here is that candidates often assume partitioning and distribution should be on the same column (Date) to optimize date-range queries, but this ignores that distribution on the join key (CustomerID) is what minimizes data movement for the most common query pattern involving both CustomerID and Date filters.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without any logical grouping, causing every query that filters on CustomerID to require full data movement across all distributions. Option C is wrong because replicate distribution copies the entire table to each distribution node, which is impractical for a large fact table due to storage overhead and write performance degradation. Option D is wrong because hash distribution on Date would cause data movement for queries filtering on CustomerID, as the distribution key does not align with the join or filter column, and partitioning on Date already provides local pruning without needing distribution on the same column.

Full explanation →

107

MCQmedium

You are designing a data storage solution for a real-time analytics application that ingests IoT sensor data. The data must be stored in a format that supports both streaming ingestion and batch processing with low latency for queries. Which Azure storage solution should you use?

A.Azure Blob Storage with hot access tier

B.Azure SQL Database with change data capture

C.Azure Cosmos DB with analytical store

D.Azure Data Lake Storage Gen2

AnswerD

Supports hierarchical namespace, streaming and batch ingestion, and low-latency queries through Azure Synapse.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with Azure Blob Storage, enabling both streaming ingestion (via Event Hubs, Kafka, or Spark Structured Streaming) and batch processing (via PolyBase, Azure Synapse, or Databricks) while supporting low-latency queries through its POSIX-like file system and optimized columnar formats like Parquet. It directly addresses the requirement for a unified storage layer that handles real-time and batch workloads without data movement.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage (option A) with ADLS Gen2, assuming the hot access tier supports streaming ingestion and low-latency queries, but they overlook the critical need for a hierarchical namespace and native batch processing capabilities that only ADLS Gen2 provides.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage with hot access tier lacks a hierarchical namespace, making it inefficient for low-latency queries and batch processing patterns that rely on directory structures and atomic rename operations; it is designed for object storage, not for real-time analytics with streaming ingestion. Option B is wrong because Azure SQL Database with change data capture is optimized for transactional workloads and relational queries, not for high-volume streaming ingestion of IoT sensor data, and it introduces latency and cost overhead for large-scale batch processing. Option C is wrong because Azure Cosmos DB with analytical store is designed for globally distributed, multi-model data with automatic indexing, but it is not optimized for batch processing with columnar formats like Parquet and incurs higher latency for large-scale analytical queries compared to ADLS Gen2's native integration with Spark and Synapse.

Full explanation →

108

MCQhard

Your company uses Azure Databricks to process large datasets stored in Azure Data Lake Storage Gen2. You need to implement a solution that automatically rotates the storage account access keys every 90 days without interrupting running jobs. The solution should use managed identities where possible. What should you do?

A.Configure Azure Databricks to use a user-assigned managed identity for authentication to Azure Data Lake Storage Gen2.

B.Use Azure Policy to automatically rotate keys and update Databricks cluster configurations.

C.Use Azure Key Vault to store the storage account keys and configure a lifecycle policy to rotate them every 90 days. Update Databricks secrets accordingly.

D.Generate a SAS token with a 90-day expiry and update the Databricks secret before expiry.

AnswerA

Managed identity eliminates the need for keys, and rotation is handled automatically by Azure, avoiding job disruption.

Why this answer

Option A is correct because using a user-assigned managed identity for Databricks to access storage eliminates the need for access keys. Option B is wrong because key rotation would disrupt jobs that rely on keys. Option C is wrong because SAS tokens with expiry still require token management and rotation.

Option D is wrong because storing keys in Key Vault does not prevent job interruption during rotation.

Full explanation →

109

MCQeasy

You need to transform JSON data containing nested arrays into a tabular format for analysis in Azure Synapse Analytics. Which transformation in Azure Data Factory or Synapse Pipelines should you use?

A.Join transformation

B.Derived Column transformation

C.Aggregate transformation

D.Flatten transformation

AnswerD

Flatten is specifically designed to unroll nested array elements into separate rows in mapping data flows.

Why this answer

The Flatten transformation is specifically designed to denormalize nested JSON arrays into a tabular format by expanding array elements into multiple rows while preserving parent attributes. In Azure Data Factory and Synapse Pipelines, this transformation handles complex hierarchical structures like arrays of objects, making it the correct choice for converting JSON with nested arrays into a row-based dataset suitable for analysis in Azure Synapse Analytics.

Exam trap

The trap here is that candidates often confuse the Flatten transformation with the Unpivot transformation, which pivots columns into rows but does not handle nested JSON arrays, or they mistakenly think the Derived Column transformation can handle array expansion through expressions.

How to eliminate wrong answers

Option A is wrong because the Join transformation combines two streams based on matching keys, not for flattening nested arrays within a single JSON document. Option B is wrong because the Derived Column transformation creates or modifies columns using expressions but cannot expand nested arrays into multiple rows. Option C is wrong because the Aggregate transformation performs grouping and summarization operations (e.g., SUM, COUNT) and does not handle the denormalization of nested array structures.

Full explanation →

110

MCQhard

You are designing a data processing solution for a large e-commerce company. The data includes sensitive customer information that must be encrypted at rest and in transit. The solution uses Azure Data Lake Storage Gen2 and Azure Databricks. Compliance requires that all access to the data is logged and monitored. Which combination of security features should you implement?

A.Enable Azure RBAC for storage and use Azure AD authentication.

B.Use Azure Firewall to restrict access and enable Azure AD Authentication for Databricks.

C.Enable Azure Storage encryption at rest, enforce HTTPS, and configure diagnostic settings to log all read/write operations.

D.Enable transparent data encryption (TDE) on the storage account and use service endpoints.

AnswerC

Meets all requirements.

Why this answer

Option B is correct because it covers encryption at rest (Azure Storage encryption), in transit (HTTPS), and logging (diagnostic settings). Option A is incorrect because RBAC alone does not log access details. Option C is incorrect because Azure Firewall is not for storage logging.

Option D is incorrect because TDE is for SQL databases, not Data Lake.

Full explanation →

111

MCQeasy

You need to monitor the performance of an Azure Stream Analytics job in real time. Which Azure service should you use to track the job's resource utilization (e.g., SU % utilization) and set up alerts when the job is approaching its capacity?

A.Azure Monitor

B.Azure Advisor

C.Microsoft Sentinel

D.Azure Log Analytics

AnswerA

Azure Monitor collects metrics like SU % utilization and allows you to set alerts to notify when thresholds are exceeded.

Why this answer

Option C is correct because Azure Monitor provides metrics and alerts for Stream Analytics jobs. Option A is wrong because Azure Log Analytics ingests logs but is not primarily for real-time metric alerts. Option B is wrong because Microsoft Sentinel is a SIEM.

Option D is wrong because Azure Advisor provides recommendations but not real-time monitoring and alerts for resource utilization.

Full explanation →

112

MCQmedium

You are a data engineer at a retail company. You need to design a batch processing solution that ingests daily sales data from multiple stores. Each store uploads a CSV file to a dedicated folder in Azure Data Lake Storage Gen2. The files have the same schema but may have minor variations in column order and include null values. After ingestion, you must clean the data by removing rows with null values in the 'SalesAmount' column, convert the 'Date' column from string to date type, and aggregate sales by product category. The output should be stored as Parquet files partitioned by year and month in the same Data Lake. You need to choose a compute service and implement the transformation with minimal coding effort. The solution must be cost-effective and require no cluster management. What should you do?

A.Use Azure Synapse Serverless SQL pool. Create external tables over the CSV files, write a T-SQL query to filter, cast, aggregate, and use CETAS to write Parquet partitions.

B.Use Azure Synapse Dedicated SQL pool. Load CSV files via PolyBase, transform with T-SQL, and use CREATE TABLE AS SELECT to output partitioned Parquet.

C.Use Azure Data Factory with a Mapping Data Flow. Configure the source to read all CSV files from the ADLS Gen2 folder. Add a Filter transformation to remove null SalesAmount, a Derived Column to parse the Date, and an Aggregate to sum sales by category. Sink to ADLS Gen2 as Parquet with partition by year and month.

D.Use Azure Databricks with a PySpark notebook. Mount the ADLS Gen2 storage, read CSV files with schema inference, filter, cast, aggregate, and write partitioned Parquet.

AnswerC

Mapping Data Flow provides code-free transformations and handles schema drift.

Why this answer

Azure Data Factory Mapping Data Flows provide a code-free visual interface to perform transformations like filter, derived column, and aggregate. It can read from ADLS Gen2, handle schema drift, and write partitioned Parquet files. Option A is correct.

Option B requires coding in PySpark and cluster management. Option C uses serverless SQL which is not ideal for complex transformations and file partitioning. Option D uses SQL pool which requires provisioning and is more expensive.

Full explanation →

113

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool with a table that uses hash distribution on CustomerID. You notice that queries joining this table with another table on OrderDate are slow. What is the most likely cause?

A.The table is not partitioned by OrderDate

B.Statistics on the join columns are outdated

C.The table should use round-robin distribution instead

D.The join columns are not aligned; data must be shuffled across distributions

AnswerD

A is correct because hash distribution on CustomerID means OrderDate is not the distribution key, causing unnecessary data movement during join.

Why this answer

D is correct because in Azure Synapse Analytics dedicated SQL pools, hash distribution distributes rows across distributions based on a hash of the distribution key (CustomerID). When joining on OrderDate, which is not the distribution key, the join columns are not aligned across distributions. This forces data movement (shuffling) where rows from one or both tables must be redistributed to match the join key, causing significant performance degradation.

Exam trap

The trap here is that candidates often confuse partitioning with distribution, thinking that partitioning on the join column solves the data movement issue, when in fact distribution alignment is the critical factor for collocated joins in a distributed MPP system.

How to eliminate wrong answers

Option A is wrong because partitioning by OrderDate would help with partition elimination for scans or maintenance, but it does not address the fundamental issue of data movement required when join columns are not aligned with the distribution key. Option B is wrong because outdated statistics can cause suboptimal query plans, but the primary performance bottleneck here is the physical data movement across distributions, not statistics. Option C is wrong because round-robin distribution distributes rows evenly without any key, which would still require full data movement for any join, making performance even worse than hash distribution on a non-join column.

Full explanation →

114

Multi-Selectmedium

Which TWO Azure services can be used to monitor Azure Data Factory pipeline runs and set up alerts?

Select 2 answers

A.Log Analytics

B.Microsoft Sentinel

C.Azure Policy

D.Azure Monitor

E.Azure Advisor

AnswersA, D

Log Analytics queries logs and can trigger alerts.

Why this answer

Azure Monitor provides metrics and logs for ADF, and Log Analytics allows querying and alerting based on logs. Option C is wrong because Microsoft Sentinel is a SIEM that ingests from Log Analytics, not directly. Option D is wrong because Azure Policy is for governance.

Option E is wrong because Azure Advisor provides recommendations, not monitoring.

Full explanation →

115

MCQhard

You are designing a solution to store semi-structured JSON logs from a web application in Azure Cosmos DB. The logs are written once and rarely read. The application writes up to 10,000 documents per second, and each document is about 2 KB. You need to minimize RU/s cost. Which API and indexing policy should you choose?

A.Azure Cosmos DB for Cassandra with default indexing policy.

B.Azure Cosmos DB for Table with default indexing policy.

C.Azure Cosmos DB for NoSQL with default indexing policy (automatic indexing of all fields).

D.Azure Cosmos DB for MongoDB with a custom indexing policy that only includes a wildcard index on _id and disables automatic indexing.

AnswerD

Disabling automatic indexing and indexing only _id minimizes RU consumption for writes.

Why this answer

Option D is correct because Azure Cosmos DB for MongoDB with a custom indexing policy that disables automatic indexing and only includes a wildcard index on _id minimizes RU/s cost for write-heavy, rarely read workloads. Each write operation consumes RUs proportional to the number of indexed paths; by eliminating automatic indexing of all fields, you drastically reduce the per-document write RU charge. The MongoDB API supports this fine-grained indexing control, making it ideal for high-throughput ingestion of semi-structured JSON logs.

Exam trap

The trap here is that candidates often assume the NoSQL API (Option C) is always the best choice for JSON data, overlooking that its default indexing policy incurs significant RU overhead for write-heavy workloads, whereas the MongoDB API allows disabling indexing to minimize cost.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB for Cassandra uses a default indexing policy that indexes all columns, which would incur high RU/s costs for the 10,000 writes per second. Option B is wrong because Azure Cosmos DB for Table also indexes all properties by default, leading to unnecessary RU consumption for write-heavy, rarely read logs. Option C is wrong because Azure Cosmos DB for NoSQL with default automatic indexing of all fields would maximize RU/s cost per write, which directly contradicts the requirement to minimize cost.

Full explanation →

116

Drag & Dropmedium

Drag and drop the steps to set up Azure Data Factory pipeline with parameterization and dynamic expressions into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Create the pipeline, define parameters, use them in activities and linked services, then trigger with values.

Full explanation →

117

MCQeasy

Refer to the exhibit. You are reviewing an Azure Stream Analytics job query. The job has a stream input and a reference data input. The job is failing with the error 'Reference data input must be of type Reference, not Stream'. What is the cause of the error?

A.The input alias is incorrect.

B.The JOIN syntax is incorrect for reference data.

C.The input type is Stream; it should be Reference.

D.The output type is set to ReferenceData; it should be a different type.

AnswerD

ReferenceData is only for input; output must be a sink type.

Why this answer

Option B is correct because the output type is incorrectly set to ReferenceData; it should be set to a sink type like SQL or Blob. Option A is wrong because the alias is not the issue. Option C is wrong because the query syntax is correct.

Option D is wrong because the input type is correct.

Full explanation →

118

Multi-Selectmedium

You are building a data processing pipeline in Azure Synapse Analytics that ingests data from Azure Blob Storage and writes to a dedicated SQL pool. You need to ensure the pipeline can handle schema changes in the source data without failing. Which TWO actions should you take?

Select 2 answers

A.Use staged copy with an intermediate storage.

B.Enable schema drift in a Mapping Data Flow activity.

C.Define a fixed schema in the source dataset.

D.Use PolyBase to load data into the dedicated SQL pool.

E.Use the Copy activity with AutoCreateTable enabled.

AnswersB, E

Mapping Data Flow can automatically handle schema drift by mapping new columns.

Why this answer

Mapping Data Flows in Azure Synapse Analytics support schema drift, which allows the pipeline to dynamically handle changes in source data structure (e.g., new columns, changed data types) without failing. By enabling schema drift, the data flow can automatically detect and propagate these changes downstream, ensuring resilience against schema evolution.

Exam trap

The trap here is that candidates often confuse PolyBase or staged copy with schema drift handling, but those features are designed for performance or staging, not for dynamic schema adaptation.

Full explanation →

119

MCQhard

Match each Azure storage service to its primary use case.

A.Big data analytics

B.Globally distributed, low-latency applications

C.Relational OLTP

D.NoSQL key-value store

E.Azure Data Lake Storage Gen2

F.Azure Cosmos DB

G.Azure SQL Database

H.Azure Table Storage

Why this answer

Azure Data Lake Storage Gen2 is for big data analytics. Azure Cosmos DB is for globally distributed, low-latency applications. Azure SQL Database is for relational OLTP.

Azure Table Storage is for NoSQL key-value store.

Exam trap

Candidates often confuse Azure Data Lake Storage with Blob Storage; Data Lake Storage is Blob Storage with hierarchical namespace, optimized for analytics.

Full explanation →

120

MCQhard

Refer to the exhibit. A data engineer runs a Synapse Spark job that fails with the error shown. Which configuration change is most likely to resolve the issue?

A.Change the executor cores to 1

B.Switch to a different Spark pool with the same configuration

C.Reduce the number of executors to 1 for Job1

D.Increase executor memory to 4g for Job1

AnswerD

Increasing executor memory provides more heap space, which directly addresses the OutOfMemoryError.

Why this answer

The error 'Java heap space' indicates the executor memory is insufficient. Job1 uses 2g memory with 2 executors, while Job2 succeeds with 4g and 4 executors. Increasing executor memory or adding executors can resolve the OOM error.

Full explanation →

121

MCQeasy

A company uses Azure Data Lake Storage Gen2 to store sensor data. They notice that queries on the data are slow. Which feature should they enable to optimize query performance without moving data?

A.Implement Change Data Capture (CDC).

B.Enable Azure Search on the storage account.

C.Use PolyBase to query the data.

D.Enable hierarchical namespace on the storage account.

AnswerD

Hierarchical namespace organizes data in directories, improving query performance.

Why this answer

Enabling hierarchical namespace on Azure Data Lake Storage Gen2 organizes blobs into a directory hierarchy, which allows query engines like Azure Synapse Analytics and Apache Spark to perform directory-level pruning and partition elimination. This reduces the amount of data scanned during queries, directly improving performance without requiring data movement or restructuring.

Exam trap

The trap here is that candidates often confuse PolyBase (a query engine) with a storage optimization feature, or assume that enabling a search service or CDC will improve query performance on static data, when in fact the hierarchical namespace is the only option that directly optimizes storage layout for faster queries.

How to eliminate wrong answers

Option A is wrong because Change Data Capture (CDC) is a pattern for tracking row-level changes in relational databases (e.g., Azure SQL Database) and does not optimize query performance on static data in Data Lake Storage. Option B is wrong because Azure Search is a cognitive search service for indexing and full-text search over unstructured content, not a query acceleration feature for analytical workloads on Data Lake Storage. Option C is wrong because PolyBase is a data virtualization technology for querying external data sources (e.g., Hadoop, Azure Blob Storage) from SQL Server or Azure Synapse, but it does not enable a performance optimization on the storage account itself; it is a query engine, not a storage-level feature.

Full explanation →

122

MCQmedium

You are designing a data processing solution in Azure using Azure Data Lake Storage Gen2 as the storage layer. You need to ensure that data ingested from various sources is immutable and can be used for both batch and streaming workloads. Which storage design pattern should you implement?

A.Store data in a normalized relational database structure.

B.Implement a medallion architecture with bronze, silver, and gold layers.

C.Use a data vault model with hubs, links, and satellites.

D.Design a star schema with fact and dimension tables.

AnswerB

The medallion architecture provides data immutability and supports both processing paradigms.

Why this answer

Option A is correct because a medallion architecture (bronze, silver, gold) is a common pattern for organizing data in a lakehouse, providing immutability in bronze, transformations in silver, and aggregations in gold. It supports both batch and streaming. Option B is wrong because a star schema is a dimensional modeling approach for data warehouses, not a storage pattern for immutability.

Option C is wrong because a data vault is an enterprise data modeling pattern, not specifically for immutability in a lake. Option D is wrong because a normalized relational model is not suited for big data lake scenarios.

Full explanation →

123

MCQhard

A multinational corporation uses Azure Data Lake Storage Gen2 to store petabytes of parquet files partitioned by date and hour. Data scientists report that queries on the last 7 days of data take over 30 minutes, while queries on older data are fast. The storage account uses the default Azure Blob Storage hierarchical namespace. Which action will MOST improve query performance on recent data?

A.Convert the parquet files to CSV format to reduce metadata overhead

B.Enable soft delete on the storage account to reduce read latency

C.Optimize the partition layout by partitioning by date first, then by hour, to reduce the number of partitions scanned for recent data

D.Apply Z-order clustering on the parquet files using Azure Databricks

AnswerC

Recent data queries scan fewer partitions, improving performance.

Why this answer

Option C is correct because partitioning by date first, then by hour, ensures that queries filtering on the last 7 days scan only the relevant date partitions, drastically reducing the amount of data read. In Azure Data Lake Storage Gen2, the hierarchical namespace allows partition pruning at the directory level, so a date-first layout minimizes the number of partitions scanned for recent data, directly addressing the performance bottleneck.

Exam trap

The trap here is that candidates often confuse partition layout optimization with data format or clustering techniques, overlooking that the hierarchical namespace in ADLS Gen2 makes directory-level partition pruning the most impactful lever for time-range queries.

How to eliminate wrong answers

Option A is wrong because converting parquet to CSV would increase file size and read overhead, as CSV lacks columnar compression and predicate pushdown capabilities, making queries slower, not faster. Option B is wrong because soft delete is a data protection feature that adds metadata overhead for deleted objects and does not reduce read latency; it actually increases storage costs and can degrade performance due to additional index lookups. Option D is wrong because Z-order clustering in Azure Databricks optimizes data layout within a partition for multi-dimensional queries, but it does not reduce the number of partitions scanned; the primary issue is scanning too many partitions, not intra-partition data skew.

Full explanation →

124

MCQhard

You are a data engineer for a financial services company. The company stores sensitive transaction data in Azure Data Lake Storage Gen2. The data is partitioned by date and loaded daily via Azure Data Factory. Recently, an audit found that the storage account allows public network access, and some containers have anonymous read access enabled. You need to secure the storage account according to the principle of least privilege while ensuring that Azure Data Factory can still load data. You must also ensure that data can be accessed by Azure Databricks for analytics. The solution must minimize administrative overhead. Which course of action should you take?

A.Keep public network access enabled but restrict it to specific IP addresses. Use storage account keys for Azure Data Factory and Azure Databricks.

B.Disable public network access. Create private endpoints for the storage account and configure Azure Data Factory and Azure Databricks to use the private endpoints. Use RBAC to assign 'Storage Blob Data Contributor' to the managed identities.

C.Enable public network access with a firewall rule to allow only the Azure Data Factory and Azure Databricks IP ranges. Keep anonymous access enabled but set the containers to private.

D.Disable public network access. Set the storage account firewall to allow access only from Azure services. Configure Azure Data Factory and Azure Databricks to use managed identities. Grant the managed identities the 'Storage Blob Data Contributor' role at the container level. Remove any anonymous access.

AnswerD

This meets security requirements, uses managed identities for authentication, and applies least privilege via RBAC.

Why this answer

Option D is correct because disabling public network access and using managed identities with RBAC (Storage Blob Data Contributor) aligns with the principle of least privilege while minimizing administrative overhead. Azure Data Factory and Azure Databricks can authenticate via managed identities without managing keys or IP ranges, and removing anonymous access eliminates the security gap. The firewall rule allowing access only from Azure services ensures that only Azure-internal traffic can reach the storage account, which is sufficient for these services when they are in the same region.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing private endpoints (Option B) or IP-based firewalls (Option A/C), not realizing that the 'Allow Azure services' firewall rule combined with managed identities provides a simpler, least-privilege-compliant path for first-party Azure services.

How to eliminate wrong answers

Option A is wrong because keeping public network access enabled and using storage account keys violates least privilege—keys are shared secrets that are hard to rotate and audit, and IP restriction alone does not prevent access from other Azure services or compromised endpoints. Option B is wrong because creating private endpoints adds significant administrative overhead (VNet configuration, DNS resolution, peering) and is not necessary when the services can use managed identities over the Azure backbone network; it also does not address the anonymous access issue directly. Option C is wrong because enabling public network access with firewall rules for IP ranges is brittle—Azure Data Factory and Azure Databricks IP ranges can change without notice, leading to failures, and keeping anonymous access enabled (even with private containers) still allows enumeration of container names and potential misconfiguration.

Full explanation →

125

MCQmedium

Your organization uses Azure Synapse Analytics to run large-scale queries. You need to implement a solution that automatically pauses the dedicated SQL pool when no activity is detected for a specified period, and resumes it when a connection is attempted. What should you configure?

A.You cannot configure auto-pause for a dedicated SQL pool.

B.Use Azure Automation to run a PowerShell script that pauses and resumes the pool.

C.Create an Azure Function that pauses the pool on a schedule.

D.Set the auto-pause delay in the SQL pool properties.

AnswerA

Auto-pause is only for serverless SQL pools.

Why this answer

Option D is correct because Azure Synapse Analytics dedicated SQL pool does not support auto-pause; auto-pause is only for serverless SQL pools. Option A (auto-pause setting) is for serverless. Option B (Azure Automation runbook) could work but is not a built-in feature.

Option C (Azure Functions) is custom. The correct answer is that you cannot configure auto-pause for dedicated SQL pool.

Full explanation →

126

MCQmedium

You are designing a data ingestion pipeline for Azure Synapse Analytics. The pipeline will load sensitive financial data from an on-premises SQL Server to a dedicated SQL pool. The data must be encrypted at rest and in transit. Which combination of features should you use?

A.Use Always Encrypted in SQL Server and Azure Key Vault for column encryption.

B.Enable Azure Disk Encryption on the source server and use a VPN gateway.

C.Implement client-side encryption using Azure Storage client library and store keys in Microsoft Purview.

D.Use Azure Synapse Transparent Data Encryption (TDE) and enforce TLS 1.2 for connections.

AnswerD

TDE encrypts data at rest in Synapse, and TLS secures data in transit.

Why this answer

Option C is correct because Transparent Data Encryption (TDE) encrypts data at rest in the dedicated SQL pool, and TLS ensures encryption in transit. Option A is wrong because Always Encrypted is for column-level encryption, not for the entire pipeline. Option B is wrong because Azure Disk Encryption is for IaaS VMs, not for Azure Synapse.

Option D is wrong because client-side encryption is not built-in for Synapse ingestion.

Full explanation →

127

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is slow. Which action would most likely improve performance?

A.Partition the Parquet files by date

B.Create a materialized view on the external data

C.Create a PolyBase external table and query that

D.Create statistics on the columns used in WHERE clauses

AnswerD

Statistics enable the optimizer to generate better execution plans.

Why this answer

In Azure Synapse Analytics serverless SQL pool, query performance on external data like Parquet files heavily relies on statistics to enable the optimizer to generate efficient query plans. Without statistics, the engine makes default cardinality estimates, often leading to suboptimal joins and filters. Creating statistics on columns used in WHERE clauses allows the serverless SQL pool to accurately estimate row counts and choose better execution strategies, directly improving query performance.

Exam trap

The trap here is that candidates often assume physical data organization (partitioning) or materialized views are the primary levers for performance in serverless SQL pool, overlooking that statistics are the critical metadata for the query optimizer in this stateless, compute-on-demand environment.

How to eliminate wrong answers

Option A is wrong because partitioning Parquet files by date is a storage optimization that primarily benefits file pruning and partition elimination in batch processing or dedicated SQL pools, but serverless SQL pool does not automatically leverage file partitioning for query optimization in the same way; it relies on file metadata and statistics instead. Option B is wrong because materialized views are not supported in serverless SQL pool; they are a feature of dedicated SQL pool in Azure Synapse Analytics. Option C is wrong because PolyBase external tables are the default mechanism for querying external data in serverless SQL pool, so creating one does not change the query execution path or improve performance; the issue is lack of statistics, not the table type.

Full explanation →

128

MCQmedium

You are designing a solution to monitor the performance of Azure Synapse Analytics dedicated SQL pools. You need to identify queries that are consuming more than 100 GB of memory and have been running for more than 30 minutes. Which DMV should you query?

A.sys.dm_pdw_resource_waits

B.sys.dm_pdw_nodes

C.sys.dm_pdw_exec_requests

D.sys.dm_pdw_waits

AnswerC

Contains memory and duration columns for queries.

Why this answer

Option B is correct because sys.dm_pdw_exec_requests provides query execution details including memory and duration. Option A is wrong because sys.dm_pdw_nodes is for node-level info. Option C is wrong because sys.dm_pdw_resource_waits shows waits, not memory usage.

Option D is wrong because sys.dm_pdw_waits shows wait types, not memory consumption.

Full explanation →

129

MCQmedium

You are designing a data storage solution for a healthcare organization that stores patient records. The solution must comply with HIPAA and support point-in-time restore (PITR) for the last 35 days. The data is frequently accessed for reporting. Which Azure data service should you use?

A.Azure Cosmos DB

B.Azure Blob Storage with point-in-time restore

C.Azure SQL Database

D.Azure Data Lake Storage Gen2

AnswerC

Azure SQL Database supports PITR up to 35 days and is HIPAA-eligible when configured properly.

Why this answer

Azure SQL Database is the correct choice because it natively supports point-in-time restore (PITR) for up to 35 days (configurable from 7 to 35 days) and is a HIPAA-eligible service when configured with encryption, auditing, and network security. It provides transactional consistency required for patient records and supports frequent reporting workloads with features like columnstore indexes and read replicas.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage's point-in-time restore (which only applies to container-level recovery of blobs) with the transactional point-in-time restore needed for relational patient records, or they assume Cosmos DB's multi-model capabilities make it suitable for structured healthcare data despite its lack of ACID compliance and limited backup retention.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB does not support point-in-time restore for the last 35 days; its backup retention is limited to 8 hours for continuous backups or 30 days for periodic backups, and it is not a relational database, making it unsuitable for structured patient records requiring ACID transactions. Option B is wrong because Azure Blob Storage with point-in-time restore is designed for object storage, not transactional data; it lacks relational query capabilities, foreign key constraints, and native support for HIPAA-compliant auditing and row-level security required for patient records. Option D is wrong because Azure Data Lake Storage Gen2 is optimized for big data analytics and hierarchical namespace storage, not for transactional workloads with point-in-time restore; it does not provide built-in PITR for individual records and is not designed for frequent reporting on structured patient data.

Full explanation →

130

MCQmedium

You are optimizing an Azure Data Factory pipeline that moves data from Azure Blob Storage to Azure SQL Database. The pipeline currently uses a Copy activity with a staging setting enabled. You notice high DTU consumption on the Azure SQL Database during the copy operation. How can you reduce the impact on the source database?

A.Configure the Copy activity to use staging with Azure Blob Storage and then use a stored procedure to bulk insert.

B.Use PolyBase to load data directly from Blob Storage to Azure SQL Database.

C.Enable parallel copy in the Copy activity settings.

D.Increase the DTU of the Azure SQL Database to handle the load.

AnswerA

Staging allows the copy to offload transformation to Blob Storage, and bulk insert reduces DTU consumption.

Why this answer

Option B is correct because using PolyBase in Azure Synapse Analytics (not SQL Database) is for bulk loads, but the target is Azure SQL Database. Actually, the correct approach is to use staging copy with Azure Blob Storage as staging and then use PolyBase? Wait, the target is Azure SQL Database, not Synapse. For SQL Database, using a staging blob and then bulk insert can reduce DTU consumption.

But among options, Option A (increase DTU) is scaling, not reduction. Option B (use PolyBase) is for Synapse. Option C (use staging copy with Azure Blob Storage and then use bulk insert) is correct.

Option D (use parallel copy) increases throughput but also DTU. So correct is C.

Full explanation →

131

Multi-Selectmedium

A data engineer is building a pipeline in Azure Data Factory to load data from Azure Blob Storage to Azure Synapse Analytics using PolyBase. The pipeline must handle large files efficiently and minimize load time. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Split large files into multiple smaller files (e.g., 256 MB each).

B.Use a staging table to stage the data before loading.

C.Convert files to Parquet format for better compression.

D.Use the copy activity with staging enabled.

E.Configure PolyBase settings to use 'RejectValue' for handling bad records.

AnswersA, E

Enables parallelism.

Why this answer

Options B and D are correct. Splitting large files into multiple smaller files enables parallel loading with PolyBase. Using the 'RejectValue' option handles errors without failing the entire load.

Option A is incorrect because Parquet is not supported by PolyBase to Synapse dedicated SQL pool. Option C is incorrect because PolyBase does not use staging tables. Option E is incorrect because copy activity with staging uses PolyBase but the question asks for PolyBase-specific optimizations.

Full explanation →

132

Multi-Selecteasy

Which TWO configuration options are required to enable change data capture (CDC) in Azure Data Factory for incremental loading from Azure SQL Database?

Select 2 answers

A.Create a dedicated folder in Azure Blob Storage for CDC logs.

B.Enable CDC on the source Azure SQL Database.

C.Create a 'Change data capture' resource in Azure Data Factory.

D.Provision a staging Azure SQL Database.

E.Configure a full load of the entire table first.

AnswersB, C

CDC must be enabled at the database level.

Why this answer

Option B is correct because enabling CDC on the source Azure SQL Database is a prerequisite for Azure Data Factory (ADF) to capture incremental changes. ADF uses the native CDC mechanism of the source database, which requires the database to have CDC enabled at the database level and on the specific tables being tracked. Without this, ADF cannot read the change data from the system tables.

Exam trap

The trap here is that candidates confuse the 'Change Data Capture' resource in ADF (which is a configuration option within a dataset) with a separate resource type that must be explicitly created, leading them to select option C as a required configuration when it is actually part of the dataset setup and not a standalone resource.

Full explanation →

133

Multi-Selecthard

Which TWO actions should you take to secure access to Azure Data Lake Storage Gen2 when using Azure Synapse Analytics serverless SQL pools? (Choose two.)

Select 2 answers

A.Enable firewall rules and virtual network service endpoints for the storage account.

B.Use managed identity authentication from the serverless SQL pool to the storage account.

C.Use a service principal with a client secret to access the storage account.

D.Generate a shared access signature (SAS) token for the storage account and store it securely.

E.Use Azure Active Directory (Azure AD) passthrough authentication for the serverless pool.

AnswersA, B

Network restrictions reduce the attack surface.

Why this answer

Options A and D are correct. A: Enabling firewall and virtual network rules restricts network access. D: Using managed identity for authentication avoids storing credentials.

Option B is wrong because Azure AD passthrough is only for dedicated SQL pools, not serverless. Option C is wrong because service principal is less secure than managed identity. Option E is wrong because shared access signatures should be avoided for security.

Full explanation →

134

MCQhard

You are designing a data storage solution for a global e-commerce company. The company's analytics team uses Azure Synapse Serverless SQL to query Parquet files in ADLS Gen2. The data is partitioned by year, month, and day. The team frequently runs queries that aggregate sales by product category across the last 30 days. However, the queries are slow and scanning too much data. What should you do to improve performance?

A.Increase the number of files per partition to improve parallelism.

B.Convert the Parquet files to CSV format to reduce file size.

C.Create a view that uses explicit partition pruning by filtering on year, month, and day in the WHERE clause.

D.Add a new partition by product category to the folder structure.

AnswerC

Serverless SQL can eliminate partitions when filters are on partition columns.

Why this answer

Option C is correct because creating a view with explicit partition pruning by filtering on year, month, and day in the WHERE clause forces Synapse Serverless SQL to read only the relevant partitions (the last 30 days) instead of scanning the entire dataset. This reduces the amount of data scanned, directly improving query performance and lowering cost.

Exam trap

The trap here is that candidates often think adding more partitions (Option D) or changing file formats (Option B) will help, but they fail to recognize that the real bottleneck is the lack of partition pruning in the query itself, not the storage layout.

How to eliminate wrong answers

Option A is wrong because increasing the number of files per partition does not reduce the amount of data scanned; it can actually increase metadata overhead and slow down queries due to more file open/close operations. Option B is wrong because converting Parquet to CSV would increase file size (Parquet is columnar and compressed, CSV is row-based and uncompressed), leading to more data scanned and slower performance. Option D is wrong because adding a partition by product category would create a deep folder hierarchy that does not help with the existing partition pruning on date columns; the queries already filter by date, so the date-based partition structure is sufficient, and adding another partition level would not reduce the scan for the last 30 days.

Full explanation →

135

MCQmedium

Your company uses Azure Data Factory to load data from an on-premises SQL Server database to Azure Synapse Analytics. The pipeline runs daily and the load volume is about 50 GB per run. Recently, the pipeline started failing with timeout errors during the copy activity. You need to ensure reliable data transfer. What should you do?

A.Use a self-hosted integration runtime and increase the copy activity timeout

B.Use PolyBase to load data in parallel

C.Split the copy into multiple smaller activities

D.Switch to Azure Integration Runtime

AnswerA

Self-hosted IR connects to on-premises; increasing timeout prevents failures.

Why this answer

Option B (Use a self-hosted integration runtime and increase the timeout) is correct because a self-hosted IR is required for on-premises sources, and increasing the timeout can handle larger volumes. Option A (Use Azure Integration Runtime) does not connect to on-premises. Option C (Use PolyBase) is for loading into Synapse but does not address the connectivity or timeout.

Option D (Split the data into smaller chunks) might help but is not the primary solution; increasing timeout is more direct.

Full explanation →

136

MCQhard

Refer to the exhibit. You have an Azure Data Factory dataset definition for a Parquet file stored in Azure Data Lake Storage Gen2. You attempt to use this dataset as a source in a copy activity, but the copy activity fails with an error indicating that the file is not found. The file 'sales_orders.parquet' exists at the specified path. What is the most likely cause of the error?

A.The dataset is configured with a specific file name, but the data is partitioned across multiple folders; you should use a wildcard pattern in the folderPath.

B.The compression codec 'snappy' is not supported for Parquet files.

C.The dataset type should be 'AzureBlobStorageLocation' instead of 'AzureBlobFSLocation'.

D.The linked service 'ADLSGen2' is not properly configured with the storage account key.

AnswerA

While the exhibit shows a single file, the error suggests the file is not found, likely because the dataset expects a different path structure. Using wildcards would allow the copy activity to find the file within the partition structure.

Why this answer

Option A is correct because the dataset definition specifies a single file name ('sales_orders.parquet') in the 'fileName' property, but the actual data is likely stored across multiple Parquet files in a partitioned folder structure (e.g., 'sales_orders/year=2024/month=01/'). When a copy activity uses this dataset as a source, Azure Data Factory looks for the exact file name at the specified path, not finding any file because the data is spread across subfolders. Using a wildcard pattern (e.g., '*.parquet') in the 'folderPath' or setting 'fileName' to '*' allows the copy activity to read all Parquet files in the folder, resolving the 'file not found' error.

Exam trap

The trap here is that candidates often assume the error is due to authentication or configuration issues (like linked service keys or location types) rather than recognizing that the dataset's explicit file name prevents ADF from scanning subdirectories for partitioned files.

How to eliminate wrong answers

Option B is wrong because Snappy is a widely supported compression codec for Parquet files in Azure Data Factory; it is not the cause of a 'file not found' error. Option C is wrong because 'AzureBlobFSLocation' is the correct location type for Azure Data Lake Storage Gen2 (which uses the Blob Filesystem API), while 'AzureBlobStorageLocation' is for Azure Blob Storage (which uses the Blob REST API); using the wrong type would cause a different error, not a 'file not found' error. Option D is wrong because if the linked service were misconfigured (e.g., wrong storage account key), the error would typically indicate authentication or authorization failure, not a 'file not found' error.

Full explanation →

137

MCQmedium

You are configuring a data pipeline in Azure Data Factory that uses a Mapping Data Flow. The source is a SQL Server table with 50 million rows. The sink is a Delta table in ADLS Gen2. The pipeline runs slowly. You need to improve performance by reducing the number of partitions in the data flow. Which setting should you adjust?

A.In the source transformation, set the 'Optimize' tab to 'Use current partitioning'.

B.Decrease the cluster size to 4 cores.

C.In the sink transformation, set the 'Optimize' tab to 'Single partition'.

D.In the pipeline activity, set the 'Data flow' property 'partitionOption' to 'Single partition'.

AnswerC

Sink optimization controls how data is written; setting to single partition reduces file count.

Why this answer

Option C is correct because setting the sink transformation's 'Optimize' tab to 'Single partition' forces all data to be written into a single partition, which reduces the number of output files and the associated metadata overhead. This can improve performance when the downstream processing or storage benefits from fewer, larger files, especially for small-to-medium datasets or when the sink is a Delta table that must be compacted later.

Exam trap

The trap here is that candidates often confuse the 'Optimize' tab settings on the sink with the 'Partition option' in the source or pipeline activity, or mistakenly think that reducing cluster size (Option B) is a valid performance tuning technique for partition reduction.

How to eliminate wrong answers

Option A is wrong because 'Use current partitioning' in the source transformation does not reduce partitions; it preserves the existing partition count from the source, which may still be high and cause slow performance. Option B is wrong because decreasing the cluster size to 4 cores reduces compute resources, which typically worsens performance rather than improving it, and does not directly control the number of partitions in the data flow. Option D is wrong because the 'partitionOption' property is not available at the pipeline activity level for Mapping Data Flows; partition settings are configured within the data flow transformations themselves, not in the pipeline activity properties.

Full explanation →

138

MCQhard

You are a data engineer for a multinational e-commerce company. The company uses Azure Synapse Analytics as its data warehouse. The current fact table, SalesFact, is distributed using hash distribution on the CustomerID column. It has 2 billion rows and is 2 TB in size. Recently, the business team has been running many queries that aggregate sales by product category and date, and these queries are experiencing high data movement and long execution times. The product dimension table (ProductDim) has 100,000 rows and is 100 MB. The date dimension table (DateDim) has 5,000 rows and is 5 MB. You need to redesign the storage to minimize data movement for these aggregation queries. You cannot change the fact table distribution key to ProductID because of other critical queries that rely on CustomerID. What should you do?

A.Create materialized views on the fact table that aggregate by product category and date

B.Replicate the ProductDim and DateDim tables to all compute nodes

C.Partition the fact table by date and keep the same distribution

D.Change the fact table distribution to round-robin and create non-clustered indexes on ProductID and DateID

AnswerB

Replication eliminates data movement for small dimension tables.

Why this answer

Option B is correct because replicating small dimension tables (ProductDim at 100 MB and DateDim at 5 MB) to all compute nodes eliminates the need to shuffle these tables across nodes during joins. In Azure Synapse Analytics, replicated tables are copied to each distribution, so when the fact table (hash-distributed on CustomerID) joins with ProductDim and DateDim on ProductID and DateID, no data movement occurs for the dimension tables. This directly reduces the high data movement and long execution times for aggregation queries by product category and date.

Exam trap

The trap here is that candidates often choose materialized views (Option A) thinking they solve all aggregation performance issues, but they overlook that data movement from joins with non-replicated dimension tables remains the bottleneck, whereas table replication directly addresses the shuffle cost for small dimension tables.

How to eliminate wrong answers

Option A is wrong because materialized views in Azure Synapse Analytics pre-aggregate data but still require the underlying fact table's distribution; they do not eliminate data movement when joining with non-replicated dimension tables, and the queries would still suffer from shuffling ProductDim and DateDim. Option C is wrong because partitioning the fact table by date improves partition elimination for date-range filters but does not reduce data movement during joins; the hash distribution on CustomerID remains, so joins on ProductID and DateID still require redistributing the fact table or dimension tables. Option D is wrong because changing to round-robin distribution would distribute fact table rows randomly, causing even more data movement for all joins and aggregations, and non-clustered indexes do not address the fundamental distribution issue for large-scale aggregation queries.

Full explanation →

139

MCQeasy

Your organization uses Azure Data Lake Storage Gen2 (ADLS Gen2) and wants to transform data using Azure Databricks. The data is stored in Parquet format. You need to read the data into a Spark DataFrame. Which DataFrame reader method should you use?

A.spark.read.avro()

B.spark.read.json()

C.spark.read.parquet()

D.spark.read.csv()

AnswerC

The parquet() method reads Parquet files directly.

Why this answer

Option C is correct because the data is stored in Parquet format, and the Spark DataFrame reader method `spark.read.parquet()` is specifically designed to read Parquet files, which is a columnar storage format optimized for big data processing in Azure Databricks.

Exam trap

The trap here is that candidates may confuse file format reader methods (e.g., using `spark.read.avro()` for Parquet data) due to assuming all binary formats are interchangeable, but each reader method is strictly tied to its specific file format.

How to eliminate wrong answers

Option A is wrong because `spark.read.avro()` is used for Avro format, not Parquet. Option B is wrong because `spark.read.json()` is for JSON files, which are text-based and not columnar like Parquet. Option D is wrong because `spark.read.csv()` is for CSV files, which are row-based and lack the compression and schema efficiency of Parquet.

Full explanation →

140

MCQhard

Your company uses Azure Data Lake Storage Gen2 and Azure Databricks for data processing. Some Parquet files in the lake are written with a schema that includes a column 'address' of struct type. A downstream process expects 'address' to be a string. You need to transform the data in a way that minimizes read overhead and does not rewrite the entire dataset. Which approach should you use?

A.Use schema evolution in Delta Lake to change the column type.

B.Apply a schema-on-read transformation that casts the column to string when reading.

C.Create a view that selects the column as a string and save it as a new table.

D.Rewrite the Parquet files with the correct schema using a Spark job.

AnswerB

Efficient, no rewrite needed.

Why this answer

Option C is correct because schema-on-read allows you to read the struct and cast it to string without rewriting files. Option A is incorrect because rewriting is costly. Option B is incorrect because schema evolution does not change data types.

Option D is incorrect because it would create a new copy.

Full explanation →

141

Multi-Selectmedium

Which TWO of the following are built-in components of Azure Data Factory that can be used to perform data transformation without writing custom code?

Select 2 answers

A.Mapping Data Flow

B.Copy Activity

C.Data Flow Script

D.Azure Function Activity

E.Stored Procedure Activity

AnswersA, C

Mapping Data Flows allow visual data transformation without code.

Why this answer

Mapping Data Flow (A) is a built-in Azure Data Factory component that allows you to design data transformations visually using a graphical interface, without writing any code. It executes at scale on Azure Databricks clusters and supports operations like joins, aggregations, and pivots, making it a code-free transformation tool.

Exam trap

The trap here is that candidates often confuse the Copy Activity with a transformation tool because it can perform simple column mapping and data type conversions, but it lacks the ability to handle complex, multi-step transformations like those in Mapping Data Flows.

Full explanation →

142

MCQhard

You are designing a batch processing solution for a financial services company that processes transactions from multiple sources. The data is stored in Azure Data Lake Storage Gen2 in Parquet format. You need to perform complex transformations including joins, aggregations, and window functions, and then load the results into an Azure Synapse Analytics dedicated SQL pool. The transformations must be written in SQL and executed on a serverless Spark cluster to minimize costs. You want to manage the code in a Git repository and automate the deployment using Azure DevOps. Which approach should you recommend?

A.Use Azure Databricks with Auto Loader to read from Data Lake Storage, perform transformations using Spark SQL, and write to the dedicated SQL pool via JDBC.

B.Use Azure Data Factory with Mapping Data Flows to visually design transformations and write to the dedicated SQL pool.

C.Create a Spark job definition in Azure Synapse that reads Parquet files, performs transformations using PySpark, and writes to the dedicated SQL pool using the Spark Synapse connector.

D.Use a serverless SQL pool in Azure Synapse to query the Parquet files using T-SQL, then use CETAS to write the results to the dedicated SQL pool.

AnswerD

Serverless SQL pool is cost-effective for ad-hoc querying; CETAS allows moving data to dedicated pool.

Why this answer

Option B is correct because Synapse Serverless SQL pool can query Parquet files directly using T-SQL, and you can use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write results to the dedicated SQL pool. This avoids Spark costs and uses serverless compute. Option A uses Spark, which costs more.

Option C uses Azure Databricks, which is external and adds complexity. Option D uses Azure Data Factory with Data Flows that run on Spark, also incurring costs.

Full explanation →

143

MCQeasy

You have an Azure Synapse Analytics serverless SQL pool. You need to monitor the number of queries that are currently executing. Which dynamic management view should you query?

A.sys.dm_resource_governor_workload_groups

B.sys.dm_exec_query_stats

C.sys.dm_exec_requests

D.sys.dm_exec_sessions

AnswerC

This DMV returns information about each request currently executing.

Why this answer

Option A is correct because sys.dm_exec_requests shows currently executing requests in serverless SQL pool. Option B is wrong because sys.dm_exec_sessions shows sessions, not executing queries. Option C is wrong because sys.dm_exec_query_stats shows cumulative statistics.

Option D is wrong because sys.dm_resource_governor_workload_groups shows workload group configuration.

Full explanation →

144

MCQeasy

A data engineer needs to store semi-structured JSON log files from a web application. Each log entry is about 1 KB. The logs are rarely queried (once a month) and must be retained for 7 years for compliance. The solution must minimize storage cost. Which storage option should be used?

A.Store the logs in Azure SQL Database as a table.

B.Store the logs in Azure Files share.

C.Store the logs in Azure Blob Storage with cool access tier.

D.Store the logs in Azure Cosmos DB with a JSON container.

AnswerC

Blob Storage cool tier is low-cost for infrequent access, suitable for logs.

Why this answer

Azure Blob Storage with the cool access tier is the correct choice because it is optimized for storing large amounts of semi-structured data (like JSON logs) at low cost, with infrequent access (once a month) and long retention (7 years). The cool tier offers lower storage costs than hot or premium tiers, while still providing high durability and the ability to query logs using tools like Azure Data Lake Storage or serverless SQL. This meets the compliance requirement without the high compute or transaction costs of a database solution.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB (D) because it natively supports JSON, but they overlook the extreme cost of storing and rarely querying 7 years of data in a globally distributed, high-throughput NoSQL database, which is optimized for frequent, low-latency access, not archival.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database designed for structured, transactional workloads with frequent queries, and it incurs high storage and compute costs for 7 years of 1 KB log entries, making it far more expensive than blob storage for rarely accessed data. Option B is wrong because Azure Files provides SMB/NFS file shares primarily for shared file access in VMs or on-premises apps, not for cost-effective, long-term archival of semi-structured logs, and it lacks the tiered pricing and lifecycle management of blob storage. Option D is wrong because Azure Cosmos DB is a NoSQL database optimized for low-latency, globally distributed, and frequently queried data; storing 7 years of rarely accessed logs in Cosmos DB would incur prohibitive costs due to its per-request unit (RU) pricing and storage charges, far exceeding blob storage costs.

Full explanation →

145

Multi-Selecthard

You are designing a data storage solution that must support both operational and analytical workloads using a single copy of data. Which THREE technologies should you consider?

Select 3 answers

A.Azure SQL Database with materialized views

B.Azure SQL Database with change data capture (CDC)

C.Azure SQL Database with PolyBase

D.Azure Synapse Link for Cosmos DB

E.Azure Cosmos DB with analytical store (HTAP)

AnswersC, D, E

PolyBase allows querying external data sources, enabling hybrid workloads.

Why this answer

Azure Synapse Link for Cosmos DB and Azure Cosmos DB with analytical store (HTAP) enable hybrid transactional/analytical processing by automatically replicating operational data into a column-oriented analytical store, allowing you to run near-real-time analytical queries on the same copy of data without impacting transactional performance. PolyBase in Azure SQL Database allows querying external data sources (like Azure Blob Storage or Hadoop) using T-SQL, but it does not support both operational and analytical workloads on a single copy of data—it is primarily for data virtualization and loading.

Exam trap

The trap here is that candidates often confuse PolyBase's ability to query external data with the concept of supporting both operational and analytical workloads on a single copy of data, but PolyBase is a data virtualization tool, not an HTAP solution that maintains a single copy for both workloads.

Full explanation →

146

MCQeasy

You need to monitor resource usage for an Azure Synapse Analytics dedicated SQL pool. Which dynamic management view (DMV) should you query to see current query execution status?

A.sys.dm_pdw_nodes_db_size

B.sys.dm_pdw_os_performance_counters

C.sys.dm_pdw_exec_requests

D.sys.dm_pdw_nodes_resource_usage

AnswerC

Shows current and recent requests with status.

Why this answer

sys.dm_pdw_exec_requests provides information about all requests currently executing or recently executed in the dedicated SQL pool. Option A is for resource usage, Option B is for database sizes, Option D is for distribution states.

Full explanation →

147

MCQhard

You are configuring Microsoft Purview to scan an Azure Data Explorer (ADX) cluster. You define the JSON shown in the exhibit. However, the scan fails with an authentication error. What is the most likely cause?

A.The ADX cluster is behind a firewall and Purview cannot connect.

B.The database name in the scan configuration is incorrect.

C.The ADX cluster does not have a managed identity enabled for Purview.

D.The region of the ADX cluster does not match the Purview account region.

AnswerC

Purview uses managed identity for authentication; it must be enabled.

Why this answer

Option C is correct because Purview requires a system-assigned managed identity (or user-assigned) to be enabled on the ADX cluster and granted permissions. Option A (firewall) could be a cause, but the error is authentication specifically. Option B (region mismatch) is not a common issue.

Option D (incorrect database) would cause a different error.

Full explanation →

148

MCQmedium

You are a data engineer for a financial services company. You manage an Azure Data Lake Storage Gen2 account that stores real-time stock trade data ingested from Azure Event Hubs via Azure Stream Analytics. The data is partitioned by date and symbol. Each day, a downstream Azure Databricks job runs an ETL process to aggregate trades into 5-minute bars and writes the results to a separate container. The Databricks job runs on a cluster with 10 worker nodes (Standard_DS3_v2) using Auto-Scaling enabled (2-10 workers). Recently, the job has been taking longer than expected, and you observe that the cluster is often at 10 workers but still the job duration increased by 30%. The storage account shows high transaction costs. You suspect the issue is related to how data is read. What should you do to optimize the job's performance and reduce costs?

A.Convert the data to Avro format to reduce file size.

B.Increase the maximum number of workers to 20 and use a larger instance type.

C.Modify the Stream Analytics job to output larger files (e.g., set the minimum file size to 100 MB) and use coalesce in Databricks to reduce the number of output partitions.

D.Move the data to Azure Blob Storage Premium tier to reduce latency.

AnswerC

Larger input files reduce metadata overhead, and coalescing reduces output files, improving performance and reducing costs.

Why this answer

Option C is correct because the performance issue stems from reading many small files (small file problem) in Azure Data Lake Storage Gen2, which increases transaction costs and slows down Spark jobs. By configuring Stream Analytics to output larger files (e.g., minimum 100 MB) and using coalesce in Databricks to reduce output partitions, you minimize the number of files read/written, reducing overhead and transaction costs. This directly addresses the root cause—high transaction costs and cluster saturation at 10 workers—without unnecessary scaling or tier changes.

Exam trap

The trap here is that candidates often assume performance issues require scaling up (more workers or larger instances) or changing storage tiers, when the real problem is inefficient data layout (small files) causing excessive I/O and transaction costs.

How to eliminate wrong answers

Option A is wrong because converting to Avro reduces file size but does not solve the small file problem; it may even increase the number of small files if the output is not coalesced, and Avro's compression benefits are marginal for already-compressed data. Option B is wrong because increasing workers to 20 and using larger instances would increase costs without fixing the underlying issue of many small files; the cluster is already at max workers (10) and still slow, indicating a bottleneck in file I/O, not compute capacity. Option D is wrong because moving to Azure Blob Storage Premium tier improves latency but does not reduce the number of transactions or small files; it would increase costs without addressing the root cause of high transaction costs from reading many small files.

Full explanation →

149

MCQeasy

You are designing a data pipeline that uses Azure Data Factory to copy data from an Azure SQL database to Azure Data Lake Storage Gen2. The data contains personally identifiable information (PII) that must be masked. Which Data Factory feature should you use?

A.Use a copy activity with a query to select only non-PII columns.

B.Use a stored procedure activity to mask data in the source before copy.

C.Enable staging on the copy activity to use PolyBase.

D.Use a mapping data flow to apply a mask transformation on PII columns.

AnswerD

Data flows support mask transformations.

Why this answer

Option B is correct because Data Factory data flows allow column-level transformations including data masking. Option A is wrong because copy activity only copies data without transformation. Option C is wrong because stored procedures run on the source, not during copy.

Option D is wrong because staging is for bulk copy performance, not masking.

Full explanation →

150

MCQeasy

You need to process a large dataset that contains personally identifiable information (PII). The data must be anonymized before being used for analytics. Which Azure service should you use to apply column-level masking dynamically?

A.Azure API Management policies

B.Azure Synapse Analytics dynamic data masking

C.Azure Data Lake Storage access control lists (ACLs)

D.Azure Purview classification and labeling

AnswerB

Dynamic data masking hides sensitive data from non-privileged users in query results.

Why this answer

Azure Synapse Analytics provides dynamic data masking at the column level. Azure Purview is for data governance. Azure API Management is for APIs.

Azure Data Lake Storage does not provide masking.

Full explanation →

Page 2 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →