Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 601–675

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 9 of 12

601

MCQhard

Refer to the exhibit. You have an Azure Synapse Analytics workspace. You need to ensure that data processing jobs can access the Data Lake Storage Gen2 account using a managed identity. What should you do?

A.Use the SQL admin login credentials to access the storage account

B.Enable the system-assigned managed identity on the Synapse workspace and assign it the 'Storage Blob Data Contributor' role on the storage account

C.Create a private endpoint connection between the workspace and the storage account

D.Configure the storage account firewall to allow access from the Synapse workspace

AnswerB

The managed identity needs RBAC permissions on the storage account.

Why this answer

Option B is correct because Azure Synapse Analytics supports system-assigned managed identities, which provide a secure, passwordless authentication method for accessing Azure Data Lake Storage Gen2. By enabling the managed identity on the Synapse workspace and assigning it the 'Storage Blob Data Contributor' role, you grant the workspace's data processing jobs the necessary permissions to read, write, and delete data in the storage account without managing credentials.

Exam trap

The trap here is that candidates often confuse network-level access controls (firewall rules or private endpoints) with identity-based authorization (RBAC), mistakenly thinking that allowing network traffic alone is sufficient for data access.

How to eliminate wrong answers

Option A is wrong because using SQL admin login credentials to access a storage account is not supported; SQL authentication is for database access, not for Azure Storage RBAC. Option C is wrong because creating a private endpoint ensures network-level isolation and private connectivity, but it does not grant the identity permissions to access the storage account; RBAC role assignment is still required. Option D is wrong because configuring the storage account firewall to allow access from the Synapse workspace only controls network traffic, not authentication or authorization; the managed identity still needs the appropriate RBAC role to perform data operations.

Full explanation →

602

MCQeasy

Your Azure Data Lake Storage Gen2 account stores sensitive data. You need to audit who accesses the data and when, and you want to send the audit logs to a Log Analytics workspace for analysis. What should you configure?

A.Azure Activity Logs

B.Microsoft Sentinel

C.Azure Monitor alerts

D.Diagnostic settings on the storage account

AnswerD

Diagnostic settings enable streaming of data plane audit logs to Log Analytics.

Why this answer

Diagnostic settings on the storage account can stream audit logs (like read, write, delete) to Log Analytics. Option B is wrong because Azure Activity Logs capture control plane operations, not data plane access. Option C is wrong because Azure Monitor alerts are for notifications, not log collection.

Option D is wrong because Microsoft Sentinel requires data ingestion from diagnostic settings first.

Full explanation →

603

Matchingmedium

Match each Azure service to its primary purpose in a data engineering pipeline.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Scalable data lake for analytics workloads

Unified analytics platform with SQL and Spark

Cloud-based ETL and data integration service

Real-time stream processing service

Apache Spark-based analytics platform

Why these pairings

These are core Azure data services commonly used in DP-203.

Full explanation →

604

MCQeasy

You are using Azure Data Factory to copy data from Azure Blob Storage to Azure SQL Database. The copy operation fails with error 'Cannot insert duplicate key'. What is the most likely cause and solution?

A.The source files are in an incompatible format; convert to CSV

B.The sink table schema does not match the source; update the schema

C.The copy activity is not using staging; enable staging

D.The sink table has a primary key, and the source contains duplicate rows; enable upsert in the copy activity

AnswerD

Upsert behavior can handle duplicate key violations.

Why this answer

The error indicates duplicate rows violating a primary key or unique constraint. Enabling upsert behavior allows the copy activity to handle duplicates. Option A is wrong because source format doesn't cause duplicates.

Option B is wrong because using staging doesn't fix duplicates. Option D is wrong because sink table schema is not the cause.

Full explanation →

605

Multi-Selecthard

Which THREE of the following are valid methods to load data into Azure Synapse Analytics?

Select 3 answers

A.Azure Data Factory pipeline

B.COPY INTO T-SQL command

C.BULK INSERT with a data source from Azure Blob Storage

D.SQL Server Integration Services (SSIS) package

E.PolyBase from Azure Blob Storage

AnswersA, B, E

Data Factory can copy data into Synapse using the Copy activity.

Why this answer

Azure Data Factory (ADF) is a fully managed cloud-based ETL service that provides native connectors to Azure Synapse Analytics. It supports high-throughput data movement using the PolyBase engine or staged copy, making it a valid and recommended method for loading data at scale into Synapse dedicated SQL pools.

Exam trap

The trap here is that candidates often confuse BULK INSERT (which works only for SQL Server on-premises or IaaS VMs) with the COPY INTO command (which is the Synapse-specific equivalent for Azure Blob Storage), leading them to incorrectly select Option C as valid.

Full explanation →

606

MCQmedium

Refer to the exhibit. You have an Azure Data Factory with two triggers defined as shown. The DailyTrigger runs the CopyPipeline every day at midnight UTC. The BlobTrigger runs the ProcessPipeline when a blob is created in the /input/ folder. You notice that the ProcessPipeline is not executing even though blobs are being created. What is the most likely cause?

A.The blobPathBeginsWith property is missing the container name.

B.The BlobEventsTrigger is configured to listen to the wrong event type.

C.The ProcessPipeline expects parameters that are not provided by the trigger.

D.The storage account does not have an event subscription configured for blob creation.

AnswerD

BlobEventsTrigger requires an event subscription to route events to Data Factory.

Why this answer

Option B is correct because BlobEventsTrigger requires a storage event subscription, which must be configured separately. Option A is wrong because the trigger is properly defined with the event type. Option C is wrong because the container is not specified; but the path begins with /input/ which implies a container.

Option D is wrong because pipeline parameters are not required.

Full explanation →

607

MCQhard

You are troubleshooting a pipeline in Azure Data Factory that copies data from an Azure Blob Storage to an Azure Synapse Analytics dedicated SQL pool. The pipeline fails with the error: 'PolyBase requires a varchar(max) column to be less than 1 MB.' Which action should you take to resolve this issue?

A.Modify the source data to truncate varchar(max) columns to 8000 characters.

B.Configure the copy activity to use staging via Azure Blob Storage.

C.Increase the 'batchSize' property in the copy activity to 10000.

D.Disable PolyBase in the sink settings and use bulk insert instead.

AnswerB

Using staging allows PolyBase to handle large varchar(max) columns by breaking them into smaller chunks.

Why this answer

The error indicates that PolyBase is being used for the copy operation, and it has a limitation that varchar(max) columns must be less than 1 MB. Configuring staging via Azure Blob Storage (option B) allows the copy activity to use PolyBase with staging, which automatically splits large varchar(max) values into manageable chunks, bypassing the 1 MB limit. This is the recommended approach in Azure Data Factory for loading large string data into Synapse dedicated SQL pools.

Exam trap

The trap here is that candidates often assume the error requires truncating data or switching to bulk insert, but the correct solution leverages PolyBase's staging feature to handle large columns without data loss.

How to eliminate wrong answers

Option A is wrong because truncating varchar(max) columns to 8000 characters is a data loss solution and does not address the PolyBase limitation; PolyBase's 1 MB limit is on the total size of the column value, not character count, and truncation may still exceed 1 MB if the data is multi-byte. Option C is wrong because the 'batchSize' property controls the number of rows per batch for bulk insert operations, not the size of individual columns, and it does not affect PolyBase's varchar(max) size restriction. Option D is wrong because disabling PolyBase and using bulk insert would work but is less performant and not the recommended resolution; the error specifically occurs when PolyBase is enabled, and using staging with PolyBase is the intended fix.

Full explanation →

608

MCQmedium

A company uses Azure SQL Database for an OLTP application. They need to run complex analytical queries without impacting OLTP performance. Which solution should they implement?

A.Enable automatic tuning

B.Enable geo-replication

C.Create a readable secondary replica

D.Create a failover group

AnswerC

Correct. A readable secondary replica allows read-only queries without affecting the primary.

Why this answer

Creating a readable secondary replica (Option C) offloads read-only analytical queries to a synchronized copy of the database, isolating them from the primary OLTP workload. Azure SQL Database supports this via Active Geo-Replication or Hyperscale named replicas, ensuring the primary remains unaffected by heavy analytical processing.

Exam trap

The trap here is that candidates confuse high-availability features (failover groups, geo-replication) with workload isolation, assuming any replication solves the performance impact, whereas only a dedicated readable secondary explicitly separates read-only analytical traffic from the primary OLTP workload.

How to eliminate wrong answers

Option A is wrong because automatic tuning focuses on index and query plan optimization for the primary database, not on isolating analytical workloads. Option B is wrong because geo-replication provides disaster recovery and read-scale capabilities, but its primary purpose is not to offload complex analytical queries without impacting OLTP; it can be used for read-only workloads but is not the optimal solution for analytical isolation. Option D is wrong because a failover group manages geo-replication and failover for high availability, not for distributing analytical queries away from the primary.

Full explanation →

609

Multi-Selectmedium

You are using Azure Stream Analytics to process real-time data from an Event Hub. Which TWO of the following are valid output sinks?

Select 2 answers

A.Azure Synapse Analytics

B.Azure Blob Storage

C.Azure Queue Storage

D.Azure Files

E.Azure Table Storage

AnswersA, B

Supported via a dedicated SQL pool.

Why this answer

Azure Stream Analytics supports Azure Synapse Analytics as a native output sink, allowing you to write real-time streaming results directly into dedicated SQL pools for high-performance analytics. This is achieved via the built-in Azure Synapse Analytics output adapter, which uses PolyBase or COPY INTO for efficient bulk ingestion.

Exam trap

The trap here is that candidates often confuse Azure Table Storage or Queue Storage as valid sinks because they are general Azure storage services, but Stream Analytics has a specific, limited set of supported output sinks documented in official Microsoft documentation.

Full explanation →

610

Multi-Selecthard

Which TWO actions should be taken to secure data at rest in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete for blobs

B.Configure firewall and virtual network rules

C.Enable customer-managed keys in Azure Key Vault

D.Assign storage blob data contributor roles to users

E.Use client-side encryption with Azure Key Vault

AnswersB, C

Restricts network access to storage.

Why this answer

Option B is correct because configuring firewall and virtual network rules restricts network-level access to the storage account, preventing unauthorized traffic from reaching the data at rest. Option C is correct because enabling customer-managed keys (CMK) in Azure Key Vault provides an additional encryption layer for data at rest, allowing you to control and rotate encryption keys independently of Azure-managed keys.

Exam trap

The trap here is confusing data protection mechanisms (soft delete, RBAC, client-side encryption) with the specific Azure-native controls for securing data at rest (network isolation and encryption key management).

Full explanation →

611

MCQeasy

Your company runs an Azure Data Factory pipeline that copies data from an FTP server to Azure Blob Storage daily. Recently, the pipeline has been failing with the error: 'Failure happened on 'Source' side. ErrorCode=UserErrorFailedFileOperation, Error details: The remote server returned an error: (550) File unavailable (e.g., file not found, no access).' The FTP server administrator confirms that the file exists and the credentials are correct. You need to resolve the issue with minimal administrative effort. What should you do?

A.Use an SFTP connector instead of FTP

B.Reset the FTP server credentials in the linked service

C.Check the file path and correct the case sensitivity in the dataset

D.Ask the FTP administrator to re-upload the file

AnswerC

FTP servers often use case-sensitive paths.

Why this answer

The error indicates the file is not found or access denied. The most common cause is that the file path is case-sensitive on the FTP server. Verify the correct path with case sensitivity.

Option A is wrong because credentials are confirmed correct. Option B is wrong because the file exists. Option D is wrong because changing the connector may not help if path is wrong.

Full explanation →

612

Multi-Selecthard

Which THREE factors should you consider when choosing a shard key for Azure Cosmos DB to ensure even distribution and optimal performance?

Select 3 answers

A.Stable key values that do not change

B.Frequent updates to the key value

C.High cardinality of the key

D.Low cardinality of the key

E.Even distribution of request volume across partitions

AnswersA, C, E

Stable keys avoid partition splits and data movement.

Why this answer

Stable key values that do not change (Option A) are correct because Azure Cosmos DB uses the shard key to determine the physical partition for each document. If the key value changes after insertion, the document cannot be moved to a different partition, leading to cross-partition queries and degraded performance. A stable key ensures that the partition assignment remains consistent throughout the document's lifetime, which is essential for optimal query routing and throughput management.

Exam trap

The trap here is that candidates often confuse 'low cardinality' with 'high cardinality' or assume that frequently updated keys are acceptable because they think Cosmos DB can repartition dynamically, but in reality, the shard key is immutable after document creation.

Full explanation →

613

Multi-Selecthard

Which THREE of the following are required to implement column-level security in Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.A GRANT statement on specific columns to users or roles

B.A VIEW that selects only the allowed columns

C.A DENY statement on specific columns to users or roles

D.A row-level security policy must be in place

E.The database user must have a default schema

AnswersA, E

GRANT allows access to specified columns.

Why this answer

Option A is correct because column-level security in Azure Synapse Analytics dedicated SQL pool is implemented using GRANT statements on specific columns. By granting SELECT on only certain columns to a user or role, you restrict access to sensitive data at the column level without needing to create views or modify schemas. This is the native mechanism provided by SQL Server and Azure Synapse for column-level security.

Exam trap

The trap here is that candidates often confuse column-level security with row-level security or assume that DENY statements can be used at the column level, but Azure Synapse only supports GRANT for column-level permissions and does not support DENY on individual columns.

Full explanation →

614

MCQeasy

You are implementing a data pipeline using Azure Data Factory. The source is an on-premises SQL Server database. Which Azure Data Factory component is required to connect to the on-premises data source?

A.Azure Integration Runtime

B.Self-hosted Integration Runtime

C.Managed Virtual Network Integration Runtime

D.Azure Data Factory Gateway

AnswerB

Why this answer

A self-hosted integration runtime (IR) is required to connect Azure Data Factory to on-premises SQL Server because it provides the compute environment for data movement between on-premises networks and Azure. It must be installed on a machine inside the corporate firewall, enabling secure communication via outbound HTTPS (port 443) to Azure. This is the only IR type that can access private, on-premises data sources directly.

Exam trap

The trap here is that candidates often confuse the Self-hosted Integration Runtime with the Azure Integration Runtime, not realizing that only the self-hosted variant can bridge on-premises and cloud networks, while the Azure IR is restricted to cloud-to-cloud scenarios.

Why the other options are wrong

A

Azure IR runs in the cloud and cannot access on-premises networks directly.

C

Managed VNet IR is for secure access to Azure resources, not on-premises.

D

While historically called Gateway, the correct term is Self-hosted Integration Runtime.

Full explanation →

615

MCQmedium

Refer to the exhibit. You are monitoring the CopyDataPipeline in Azure Data Factory. The copy activity is failing with timeout errors. What is the most likely cause?

A.The writeBatchTimeout is set too low (30 seconds), causing timeouts

B.The enableStaging is false, causing network congestion

C.The writeBatchSize is too large (10000), exceeding SQL limits

D.The recursive property is set to true, causing infinite loops

AnswerA

30 seconds may be insufficient for large batches.

Why this answer

The copy activity in Azure Data Factory is failing with timeout errors because the `writeBatchTimeout` is set to 30 seconds, which is too low for the volume of data being written to the sink. This property defines the maximum time allowed for a single batch write operation to complete; when it expires, the activity times out. Increasing this value (e.g., to 120 seconds or more) accommodates larger or slower writes, resolving the timeout.

Exam trap

The trap here is that candidates often confuse `writeBatchTimeout` with `writeBatchSize`, assuming a large batch size causes timeouts, but the timeout is a separate property that controls how long the system waits for a batch to complete, not the size of the batch itself.

How to eliminate wrong answers

Option B is wrong because `enableStaging` being false does not cause network congestion; staging is an optional intermediate storage for large data transfers or to enable PolyBase, and its absence does not inherently lead to timeouts. Option C is wrong because `writeBatchSize` of 10000 rows is within typical SQL Database limits (default batch size is 10,000 rows for bulk insert), and exceeding limits would cause row-level errors, not timeout errors. Option D is wrong because `recursive` property controls whether subdirectories are processed in file-based sources (e.g., Blob storage) and has no effect on timeout behavior in a copy activity; it cannot cause infinite loops.

Full explanation →

616

MCQmedium

A data engineer is designing a batch processing pipeline that reads data from Azure Blob Storage, transforms it using Azure Databricks, and writes the output to Azure Synapse Analytics. The source files are in CSV format and arrive daily at 02:00 UTC. The transformation must be idempotent and the pipeline should handle late-arriving data (up to 2 hours). What is the best approach to trigger the pipeline?

A.Storage event trigger using Azure Event Grid

B.Schedule trigger set to 02:00 UTC daily

C.Tumbling window trigger with window size of 1 day and a late arrival window of 2 hours

D.Event trigger on blob creation in the container

AnswerC

Ensures idempotency and handles late data by allowing up to 2 hours delay.

Why this answer

Option C is correct because a tumbling window trigger in Azure Data Factory allows you to define a fixed-size window (1 day) and a late arrival window (2 hours), which ensures idempotent processing by automatically rerunning the window for late-arriving data within the specified delay. This matches the requirement for daily batch processing at 02:00 UTC while handling data arriving up to 2 hours late.

Exam trap

Microsoft often tests the distinction between schedule triggers (fixed time) and tumbling window triggers (window-based with late arrival handling), where candidates mistakenly choose a simple schedule trigger because they overlook the late-arriving data requirement.

How to eliminate wrong answers

Option A is wrong because a Storage event trigger using Azure Event Grid fires on every blob creation event, which would cause duplicate processing for late-arriving data and does not guarantee idempotency without custom deduplication logic. Option B is wrong because a Schedule trigger set to 02:00 UTC daily cannot handle late-arriving data; it runs only at the scheduled time and misses files that arrive after the trigger execution. Option D is wrong because an Event trigger on blob creation in the container is event-driven and will process each blob individually, leading to non-idempotent behavior and potential out-of-order processing for late-arriving files.

Full explanation →

617

Matchingmedium

Match each Azure monitoring service to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Collect and analyze telemetry from Azure resources

Query and analyze log data

Numerical data from Azure resources

Interactive analytics on large telemetry datasets

Why these pairings

These services are used for monitoring and diagnostics.

Full explanation →

618

MCQmedium

You are designing a data processing solution for a retail company. The solution must ingest clickstream data from a website, process it in near real-time to update user session information, and store the results in Azure Cosmos DB for low-latency queries. Which combination of Azure services should you use?

A.Azure Event Hubs -> Azure Stream Analytics -> Azure SQL Database

B.Azure IoT Hub -> Azure Data Factory -> Azure Blob Storage

C.Azure Event Hubs -> Azure Stream Analytics -> Azure Cosmos DB

D.Azure Event Hubs -> Azure Functions -> Azure Cosmos DB

AnswerD

Event Hubs ingests events, Functions performs lightweight processing, Cosmos DB provides low-latency access.

Why this answer

Option D is correct because Azure Event Hubs ingests streaming data, Azure Functions processes it (e.g., for sessionization), and Cosmos DB stores it. Option A is wrong because Azure SQL Database is not ideal for low-latency key-value queries. Option B is wrong because Blob Storage is not designed for real-time querying.

Option C is wrong because Stream Analytics is more for analytical processing, but Cosmos DB is correct, but the combination with Azure SQL DB is not optimal.

Full explanation →

619

MCQeasy

You need to store log data from multiple Azure services in a single location for long-term retention and cost-effective querying. The data is append-only and rarely modified. Which storage solution should you use?

A.Azure SQL Database

B.Azure Data Lake Storage Gen2

C.Azure Table Storage

D.Azure Cosmos DB

AnswerB

ADLS Gen2 provides scalable, cost-effective storage for log data with analytics capabilities.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is optimized for append-only, rarely modified data like logs, offering hierarchical namespace, POSIX-like ACLs, and cost-effective tiered storage (hot/cool/archive). It supports high-throughput querying via Azure Synapse, Athena, or Spark, making it ideal for long-term retention and analytics on immutable log data.

Exam trap

The trap here is that candidates confuse 'append-only' with a database requirement and pick Azure SQL Database or Cosmos DB, overlooking that log storage prioritizes cost and schema-on-read over transactional consistency or low-latency writes.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational OLTP store with high transaction costs and schema rigidity, not designed for append-only log blobs or cost-effective long-term retention. Option C is wrong because Azure Table Storage is a NoSQL key-value store for semi-structured data with limited query capabilities (only on partition/row key) and no native support for hierarchical namespaces or large-scale analytics. Option D is wrong because Azure Cosmos DB is a globally distributed, multi-model database optimized for low-latency reads/writes and real-time applications, not for cost-effective archival of append-only logs; its RU-based pricing makes long-term storage expensive.

Full explanation →

620

Multi-Selecthard

You are monitoring an Azure Data Lake Storage Gen2 account that stores streaming data from IoT devices. You notice that query performance on the data in Parquet format is degrading over time. You need to improve query performance for both current and future data. Which TWO actions should you take?

Select 2 answers

A.Move frequently accessed data to Azure SQL Database.

B.Partition the data by a column commonly used in filter conditions.

C.Convert the Parquet files to Delta Lake format and enable file compaction.

D.Enable soft delete on the storage account to optimize read performance.

E.Migrate the data to Azure NetApp Files for lower latency.

AnswersB, C

Partitioning reduces the amount of data scanned per query.

Why this answer

Partitioning the data by a column commonly used in filter conditions (e.g., date, device ID) enables predicate pushdown in query engines like Azure Synapse or Spark, allowing them to skip irrelevant partitions and scan only the necessary files. This directly addresses the performance degradation by reducing the amount of data read during queries, and it benefits both current and future data when applied consistently.

Exam trap

The trap here is that candidates often confuse data protection features (like soft delete) or storage migration options (like Azure SQL or NetApp Files) with performance optimization techniques, failing to recognize that partitioning and file format optimization are the standard solutions for improving query performance on large-scale Parquet data in a data lake.

Full explanation →

621

MCQmedium

You are designing a data storage solution for a retail company that needs to store transaction data that is frequently updated and requires strong consistency. The solution must support complex queries and joins across multiple tables. Which Azure data service should you recommend?

A.Azure Cosmos DB

B.Azure SQL Database

C.Azure Synapse Analytics

D.Azure Table Storage

AnswerB

Why this answer

Azure SQL Database is a fully managed relational database service that provides strong consistency, supports complex queries and joins across multiple tables, and is optimized for frequently updated transaction data. It offers ACID compliance and built-in high availability, making it the ideal choice for this retail scenario.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its low-latency and global distribution capabilities, overlooking that it does not provide native relational joins or the strong consistency required for transactional workloads, which Azure SQL Database is specifically designed for.

Why the other options are wrong

A

Cosmos DB is NoSQL and while it can be configured for strong consistency, it does not natively support complex joins across multiple tables as efficiently as a relational database.

C

Synapse is a data warehouse for analytics, not designed for transactional workloads with frequent updates.

D

Table Storage is a NoSQL key-value store with limited query capabilities and no support for complex joins.

Full explanation →

622

MCQeasy

You are designing a streaming job in Azure Stream Analytics. The job needs to count the number of events per device type every 10 seconds. The input is from Event Hubs. Which query should you use?

A.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, SessionWindow(second, 10, 30)

B.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, TumblingWindow(second, 10)

C.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, HoppingWindow(second, 10, 1)

D.SELECT DeviceType, COUNT(*) FROM Input GROUP BY DeviceType, SlidingWindow(second, 10)

AnswerB

Tumbling window outputs exactly every 10 seconds.

Why this answer

Option B is correct because a TumblingWindow(second, 10) produces non-overlapping, fixed-size 10-second windows, which is exactly what is needed to count events per device type every 10 seconds. The GROUP BY clause groups by DeviceType and the window, ensuring each device type gets its own count per window. This query meets the requirement without overlapping or sliding behavior.

Exam trap

The trap here is that candidates confuse HoppingWindow with TumblingWindow, thinking a hop size of 1 second still produces 10-second intervals, but HoppingWindow emits results at every hop, not at the window duration, leading to incorrect output frequency.

How to eliminate wrong answers

Option A is wrong because SessionWindow(second, 10, 30) defines session windows based on inactivity gaps, not fixed 10-second intervals; the 30-second timeout means windows can be much longer than 10 seconds, violating the requirement. Option C is wrong because HoppingWindow(second, 10, 1) creates overlapping windows that emit results every 1 second, not every 10 seconds, leading to redundant counts. Option D is wrong because SlidingWindow(second, 10) produces a continuous stream of results for every event within the last 10 seconds, not discrete 10-second intervals, so it does not count events 'every 10 seconds' as a batch.

Full explanation →

623

Multi-Selectmedium

Which TWO features can be used to audit access to data in Azure Storage? (Choose two.)

Select 2 answers

A.Azure Monitor diagnostic settings

B.Azure Storage analytics logs

C.Azure RBAC role assignments

D.Azure Policy

E.Microsoft Defender for Cloud

AnswersA, B

Sends logs to Log Analytics for querying

Why this answer

Options A and D are correct. Option A: Storage analytics logs capture successful and failed requests. Option D: Azure Monitor diagnostic settings can send logs to Log Analytics for auditing.

Option B is wrong because Azure Policy enforces compliance, not auditing. Option C is wrong because Microsoft Defender for Cloud provides security alerts, not detailed access logs. Option E is wrong because Azure RBAC is for access control, not auditing.

Full explanation →

624

MCQhard

You are a data engineer at a financial services company. The company uses Azure Synapse Analytics with a dedicated SQL pool for its data warehouse. The current table 'FactTransactions' is 2 TB and uses round-robin distribution. Query performance is poor for queries that frequently filter on 'CustomerID' and join with a 'DimCustomer' table (10 GB, replicated). You need to redesign the table to improve query performance while minimizing data movement during queries. The solution must also support incremental data loading with minimal overhead. You cannot change the storage size limit or add more DWU. What should you do?

A.Replicate the FactTransactions table to all distributions.

B.Keep round-robin distribution but add indexes on CustomerID and TransactionDate.

C.Change distribution to hash on TransactionDate and partition by month.

D.Recreate the table using hash distribution on CustomerID and use CTAS for incremental loads.

AnswerD

Collocates data on CustomerID, reducing data movement; CTAS handles incremental loads.

Why this answer

Option D is correct because hash distribution on CustomerID ensures that rows with the same CustomerID are co-located on the same distribution, which eliminates data movement during joins with the replicated DimCustomer table. Using CTAS (CREATE TABLE AS SELECT) for incremental loads allows you to efficiently rebuild the table with minimal overhead by loading only new data into a staging table, then swapping partitions or using CTAS to replace the target table without blocking reads.

Exam trap

The trap here is that candidates often choose hash distribution on a date column (Option C) thinking it helps with time-based queries, but the question specifically requires improving join performance on CustomerID, so the distribution key must match the join key to avoid data movement.

How to eliminate wrong answers

Option A is wrong because replicating a 2 TB FactTransactions table to all distributions would exceed the storage capacity (each distribution would hold a full copy, multiplying storage by 60 distributions) and is not feasible given the storage size limit. Option B is wrong because round-robin distribution distributes rows randomly across distributions, so queries filtering on CustomerID still require data shuffling across nodes to gather matching rows, and indexes on a columnstore table are not effective for point lookups in a distributed environment. Option C is wrong because hash distribution on TransactionDate does not align with the join key (CustomerID), so joins with DimCustomer would still cause data movement; partitioning by month adds maintenance overhead and does not reduce data shuffling for the join.

Full explanation →

625

MCQmedium

You are using Azure Synapse Analytics dedicated SQL pool to run a query that joins a large fact table (10 billion rows) and a small dimension table (1 million rows). The query is slow. Which distribution strategy should you use for the dimension table to improve performance?

A.Round-robin distribute the dimension table.

B.Hash-distribute the dimension table on its primary key.

C.Replicate the dimension table to all compute nodes.

D.Hash-distribute the dimension table on the foreign key column.

AnswerC

Replication avoids data movement for small tables.

Why this answer

Replicating the small dimension table (1 million rows) to all compute nodes eliminates data movement during the join with the large fact table (10 billion rows). In Azure Synapse dedicated SQL pool, replicated tables store a full copy on each distribution, so the join can be performed locally on every node without shuffling data across the network, drastically reducing query latency.

Exam trap

The trap here is that candidates often choose hash distribution on the foreign key (Option D) thinking it aligns the join keys, but they overlook that the fact table is typically distributed on a different column (e.g., its own primary key or a date column), so the join still requires data movement, whereas replication is the optimal strategy for small dimension tables in a star schema.

How to eliminate wrong answers

Option A is wrong because round-robin distribution spreads the dimension table evenly across distributions without any alignment with the fact table, causing all join operations to require data movement (shuffle) across nodes, which is highly inefficient for a large fact table. Option B is wrong because hash-distributing the dimension table on its primary key does not align with the fact table's distribution key (typically the foreign key), so the join will still require redistributing one or both tables unless the fact table is also hash-distributed on the same column. Option D is wrong because hash-distributing the dimension table on the foreign key column would scatter its rows across distributions, but the fact table is likely hash-distributed on a different column (e.g., its own primary key or a different foreign key), so the join would still cause data movement; moreover, dimension tables are typically small and benefit more from replication than from hash distribution.

Full explanation →

626

MCQhard

Your company uses Azure Synapse Analytics and has deployed a pipeline that uses a Mapping Data Flow to transform data. The data flow reads from a source in Azure Blob Storage and writes to a dedicated SQL pool. You notice that the data flow is running slowly and consuming a lot of Data Flow cluster resources. You need to improve performance without increasing the cluster size. Which action should you take?

A.Use a self-hosted integration runtime instead of the default auto-resolve IR.

B.Increase the batch size in the data flow settings to reduce the number of round trips.

C.Add a partitioning step in the data flow to distribute the data across partitions based on a key column.

D.Change the source format to Delta Lake to leverage optimizations.

AnswerC

Partitioning the data can improve parallelism and performance.

Why this answer

Adding a partitioning step in the Mapping Data Flow distributes data across partitions based on a key column, which allows parallel processing across the cluster's nodes. This reduces data shuffling and improves throughput without increasing the cluster size, directly addressing the performance bottleneck caused by skewed or unpartitioned data.

Exam trap

The trap here is that candidates often confuse increasing batch size (Option B) with improving parallelism, but batch size only affects sink write operations, not the internal data processing distribution that causes cluster resource exhaustion.

How to eliminate wrong answers

Option A is wrong because using a self-hosted integration runtime (IR) would introduce network latency and management overhead, and it does not address the internal data flow processing inefficiency; the default auto-resolve IR is optimized for Azure services. Option B is wrong because increasing the batch size in data flow settings reduces round trips to the sink but does not resolve the root cause of slow transformation performance, which is data distribution and parallelism within the cluster. Option D is wrong because changing the source format to Delta Lake does not inherently improve performance for a Mapping Data Flow reading from Blob Storage; Delta Lake optimizations (like Z-ordering) require a Delta Lake engine and are not applicable to the current pipeline's source format.

Full explanation →

627

MCQhard

You are designing a batch processing pipeline in Azure Databricks. The data is stored in Delta Lake and you need to perform a time-series join between two tables: 'events' (100 billion rows) and 'sessions' (10 billion rows). The join condition is on 'device_id' and a timestamp range (event_time BETWEEN session_start AND session_end). Which join strategy would be most efficient?

A.Broadcast the smaller table (sessions) to all nodes.

B.Use a range join with interval threshold using Delta Lake's optimized join.

C.Use a sort-merge join by repartitioning both tables on device_id.

D.Bucket both tables on device_id with 500 buckets.

AnswerB

Delta Lake supports range join optimization with interval thresholds, reducing data shuffle.

Why this answer

Option B is correct because Delta Lake's optimized range join leverages interval threshold pruning and data skipping to efficiently handle time-series joins on large datasets. This strategy avoids full shuffles by using min/max statistics and Bloom filters to eliminate non-matching partitions, making it far more efficient than generic join methods for 100B and 10B row tables.

Exam trap

Microsoft often tests the misconception that broadcasting a large table is acceptable if it fits in memory, but the trap here is that candidates overlook the driver memory limit and assume broadcast join scales linearly, while the correct answer requires understanding Delta Lake's specialized range join optimization for time-series data.

How to eliminate wrong answers

Option A is wrong because broadcasting a 10 billion row 'sessions' table would exceed driver memory and cause out-of-memory errors; broadcast joins are only suitable for small tables (typically < 1 GB). Option C is wrong because a sort-merge join with repartitioning on 'device_id' alone does not optimize the timestamp range condition, leading to a full shuffle of both massive tables and poor performance. Option D is wrong because bucketing on 'device_id' with 500 buckets does not address the range join predicate; it only co-locates rows by hash, but the timestamp range still requires a cross-join-like comparison within each bucket, which is inefficient.

Full explanation →

628

MCQmedium

You have an Azure Data Factory (ADF) pipeline that runs hourly to ingest data from an on-premises SQL Server into Azure Data Lake Storage Gen2. The pipeline includes a Copy activity that transfers all rows from a source table 'Sales' (approximately 10 million rows) to a Parquet file in the data lake. Recently, you notice that the pipeline runtime has increased from 15 minutes to over an hour. The source database CPU utilization is normal, and the network bandwidth is not saturated. You check ADF monitoring and see high 'Data integration unit' consumption and frequent 'BlobWrite' throttling errors. The storage account is in the same region as the ADF. You need to reduce the pipeline runtime. What should you do?

A.Change the storage account to Premium tier to increase throughput limits.

B.Modify the pipeline to use incremental loads instead of full loads each time.

C.Replace the Copy activity with an Azure Databricks notebook to process the data.

D.Use PolyBase in the Copy activity to load data directly into Azure Synapse Analytics.

AnswerB

Reduces data volume per run, decreasing storage throttling and runtime.

Why this answer

The pipeline runtime has increased due to frequent BlobWrite throttling errors, indicating that the storage account is hitting its write request limits. By modifying the pipeline to use incremental loads instead of full loads each hour, you reduce the volume of data written per execution, which lowers the number of write operations and avoids throttling. This directly addresses the root cause without requiring a storage tier upgrade or a complete architectural change.

Exam trap

The trap here is that candidates often assume throttling errors require a storage tier upgrade (Option A) or a compute change (Option C), when the real solution is to reduce the volume of data written per execution by implementing incremental loading.

How to eliminate wrong answers

Option A is wrong because upgrading to Premium tier increases throughput for block blobs but does not eliminate the fundamental issue of writing 10 million rows every hour; throttling can still occur if the write request rate exceeds the account limits, and the cost increase may not be justified. Option C is wrong because replacing the Copy activity with an Azure Databricks notebook adds complexity and overhead without addressing the storage throttling; the bottleneck is at the sink (BlobWrite), not the compute, and Databricks would still write to the same storage account. Option D is wrong because PolyBase is used for loading data into Azure Synapse Analytics, not for writing to Azure Data Lake Storage Gen2; it does not apply to the current sink and would not resolve the BlobWrite throttling errors.

Full explanation →

629

MCQhard

You are developing a real-time data processing solution for a financial services company. The system ingests stock trade data from Azure Event Hubs at 50,000 events per second. Each event is a JSON object with fields: TradeID, Symbol, Price, Quantity, Timestamp. You need to calculate a 5-minute rolling average of the trade price per symbol and store the result in Azure Cosmos DB for low-latency queries. Additionally, you need to detect anomalies where the price deviates more than 10% from the rolling average within the same window, and send alerts to Azure Event Grid. You must minimize latency and ensure that the processing is stateful across multiple partitions. What should you do?

A.Use Azure Functions with Event Hubs trigger. In each function invocation, compute the rolling average using a distributed cache (Redis) and detect anomalies. Write to Cosmos DB and Event Grid via output bindings.

B.Use Azure Synapse Pipelines with a Data Flow. Set up a streaming Data Flow from Event Hubs, compute rolling average using window functions, and sink to Cosmos DB and Event Grid.

C.Use Azure Databricks with Structured Streaming. Read from Event Hubs using Kafka API. Perform windowed aggregations and anomaly detection using Spark SQL. Write to Cosmos DB via the Azure Cosmos DB Spark connector and to Event Grid via HTTP sink.

D.Create an Azure Stream Analytics job. Define input from Event Hubs. Use a Tumbling window of 5 minutes to compute average price per symbol. Add a custom function to compare each event's price to the average and output anomalies. Write to Cosmos DB via the Azure Cosmos DB output adapter and to Event Grid via the Event Grid output adapter.

AnswerD

Stream Analytics provides native support for windowing, stateful processing, and multiple outputs.

Why this answer

Azure Stream Analytics can ingest from Event Hubs, perform windowed aggregations (e.g., Tumbling window for rolling average), detect anomalies using conditional logic, and output to both Cosmos DB and Event Grid. It handles partitioning automatically and is stateful. Option A is correct.

Option B uses Azure Functions which are not ideal for high-throughput stateful stream processing. Option C uses Databricks Structured Streaming which is more complex to manage. Option D uses Synapse Pipelines which are batch-oriented.

Full explanation →

630

MCQmedium

Your company uses Azure Data Factory to orchestrate data movement. You need to monitor pipeline runs across multiple factories and create a dashboard that shows success and failure rates over the past 30 days. What is the most efficient approach?

A.Use the Data Factory monitoring UI to view runs for each factory individually.

B.Enable Azure Storage Analytics and query the logs stored in a storage account.

C.Configure diagnostic settings for each Data Factory to send logs to a Log Analytics workspace, then create a workbook using KQL queries.

D.Create alert rules in Azure Monitor for each pipeline failure and aggregate manually.

AnswerC

Log Analytics centralizes logs from multiple factories, and workbooks provide rich visualizations.

Why this answer

Option B is correct because sending diagnostic logs to a Log Analytics workspace allows querying and visualizing data from multiple factories in a single dashboard. Option A is wrong because Azure Monitor alerts are for notifications, not historical dashboards. Option C is wrong because Azure Data Factory monitoring views are per factory.

Option D is wrong because Azure Storage Analytics is for storage metrics, not pipeline runs.

Full explanation →

631

MCQmedium

A manufacturing company uses Azure Data Lake Storage Gen2 to store IoT sensor data. The data arrives in JSON format with a nested structure. You need to transform the data into a tabular format for downstream analytics using Azure Synapse Pipelines. Which data flow transformation should you use?

A.Aggregate transformation

B.Flatten transformation

C.Window transformation

D.Pivot transformation

AnswerB

Flatten unpacks nested arrays into rows, converting JSON to tabular format.

Why this answer

Option B is correct because the Flatten transformation in mapping data flows unpacks nested arrays into rows. Option A is wrong because the Aggregate transformation groups data but does not flatten nested structures. Option C is wrong because the Pivot transformation rotates rows to columns.

Option D is wrong because the Window transformation calculates aggregated values over a range of rows.

Full explanation →

632

Multi-Selecteasy

You are designing a data storage solution for a retail company that needs to store semi-structured IoT sensor data from thousands of devices. The data is ingested in near real-time, and queries will involve filtering by device ID and timestamp. The solution must minimize storage costs while supporting interactive queries. Which TWO Azure data storage options are most appropriate?

Select 2 answers

A.Azure Table Storage

B.Azure Cosmos DB

C.Azure Blob Storage

D.Azure Data Lake Storage Gen2

E.Azure SQL Database

AnswersC, D

Cost-effective for large volumes of semi-structured data.

Why this answer

Azure Blob Storage (C) is correct because it provides a cost-effective, scalable object store for semi-structured IoT data, supporting near real-time ingestion via REST APIs or SDKs and enabling interactive queries through Azure Data Lake Storage Gen2's hierarchical namespace and integration with query engines like Azure Synapse Serverless SQL or PolyBase. Azure Data Lake Storage Gen2 (D) is correct as it builds on Blob Storage with a hierarchical namespace, optimized for analytics workloads and interactive queries using tools like Azure Synapse or Databricks, while minimizing costs through tiered storage and lifecycle management.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its low-latency querying capabilities, overlooking that the question emphasizes minimizing storage costs for large volumes of semi-structured IoT data, where object storage (Blob/Data Lake) is far cheaper and still supports interactive queries via analytics engines.

Full explanation →

633

MCQmedium

Your company stores sensitive customer data in Azure Data Lake Storage Gen2. You need to ensure that only authorized users can access the data, and that access is audited. Which approach should you use to control access to the data lake?

A.Use Azure managed identities for all user access

B.Use Azure RBAC roles and ACLs on directories/files, and enable storage analytics logging

C.Use shared access signatures (SAS) with stored access policies

D.Configure a virtual network service endpoint and firewall rules

AnswerB

RBAC provides coarse access, ACLs provide fine-grained, and logging audits access.

Why this answer

Option A is correct because RBAC combined with ACLs provides fine-grained access control and Azure Storage analytics logs enable auditing. Option B is wrong because SAS tokens are less granular and harder to manage for many users. Option C is wrong because firewall rules control network access, not user authorization.

Option D is wrong because managed identities are for service-to-service authentication, not user access.

Full explanation →

634

MCQeasy

You are monitoring an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The pipeline fails intermittently with the error: 'Operation on target SQL table failed: String or binary data would be truncated.' Which action should you take to resolve this issue?

A.Increase the length of the destination columns in the SQL table to accommodate the source data.

B.Set 'enable identity insert' to true.

C.Use auto-create table option in the copy activity.

D.Enable staging copy to use PolyBase.

AnswerA

Direct fix for truncation error.

Why this answer

Option A is correct because the error indicates that source data length exceeds destination column length. Increasing column size resolves it. Option B is incorrect because the table already exists.

Option C is incorrect because the error is not about connection. Option D is incorrect because the error is not about identity insert.

Full explanation →

635

MCQhard

You are designing a data processing solution in Azure Databricks that uses Unity Catalog. The security team requires that all users authenticate using Microsoft Entra ID and that access to tables is governed by attribute-based access control (ABAC) using table tags. Which feature should you enable?

A.Column-level security masks.

B.Dynamic views with user context functions.

C.Row-level security filters.

D.Table tags with access control lists (ACLs) in Unity Catalog.

AnswerD

Unity Catalog supports ABAC using tags and ACLs, and integrates with Microsoft Entra ID for authentication.

Why this answer

Option B is correct because Unity Catalog supports ABAC through tags and Azure Databricks can use Microsoft Entra ID for authentication. Option A is wrong because row-level security is for filtering rows, not ABAC with tags. Option C is wrong because column-level security is for columns.

Option D is wrong because dynamic views are for custom logic, not attribute-based access control with tags.

Full explanation →

636

MCQmedium

You are a data engineer at a manufacturing company. You need to process sensor data from IoT devices that arrive in real time. The data is sent to Azure Event Hubs. You need to aggregate the data over 5-minute windows and store the results in Azure Data Lake Storage Gen2 in Parquet format. The solution should minimize cost and use serverless components. Which solution should you use?

A.Use Azure Stream Analytics to create a query with a tumbling window of 5 minutes, and output the results to Azure Data Lake Storage Gen2 in Parquet format.

B.Use Azure Databricks with Structured Streaming to read from Event Hubs, aggregate with a sliding window, and write to ADLS Gen2 in Parquet.

C.Use Azure Data Factory with a tumbling window trigger to run a pipeline every 5 minutes that copies data from Event Hubs to ADLS Gen2.

D.Use Azure Functions with an Event Hubs trigger to aggregate data in memory and write to ADLS Gen2.

AnswerA

Serverless, real-time, and cost-effective.

Why this answer

Option A is correct because Azure Stream Analytics is a serverless, cost-effective solution for real-time stream processing with windowed aggregations. It can output directly to ADLS Gen2 in Parquet. Option B is wrong because Azure Databricks with Structured Streaming requires a running cluster, which is not serverless and incurs cost.

Option C is wrong because Azure Data Factory is not designed for real-time streaming. Option D is wrong because Azure Functions would require custom code and may not handle large throughput efficiently.

Full explanation →

637

MCQeasy

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline is failing with a 'Gateway is offline' error. What is the most likely cause?

A.The Azure Integration Runtime is being used instead of a Self-Hosted Integration Runtime.

B.The Azure Integration Runtime is not configured to use the correct region.

C.The source SQL Server is not configured to allow remote connections from Azure.

D.The Self-Hosted Integration Runtime is not running or cannot connect to the Azure Data Factory service.

AnswerD

Correct: The SHIR is the bridge between on-premises and cloud; if it's offline, the pipeline cannot access the on-premises SQL Server.

Why this answer

Option B is correct because the Self-Hosted Integration Runtime (SHIR) is required for on-premises data sources. If the SHIR is not running or unreachable, the pipeline fails. Option A is wrong because the SHIR is used, not Azure IR.

Option C is wrong because the Azure IR cannot connect to on-premises networks. Option D is wrong because a VNet is optional for SHIR.

Full explanation →

638

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. They notice that queries against a large fact table are slow. They have already created statistics on all columns used in WHERE clauses and JOIN predicates. What should they do next to improve query performance?

A.Enable result-set caching.

B.Increase the DWU setting for the dedicated SQL pool.

C.Create additional statistics on all columns.

D.Partition the table on a frequently filtered column.

AnswerD

Partitioning enables partition elimination, reducing the amount of data scanned.

Why this answer

Option B is correct because partitioning can help with partition elimination, reducing data scanned. Option A is wrong because they already created statistics. Option C is wrong because increasing DWU might help but is not the best first step.

Option D is wrong because result-set caching is for repeated queries, not for improving scan efficiency.

Full explanation →

639

MCQmedium

Your organization uses Azure Synapse Analytics serverless SQL pools to query data in Azure Data Lake Storage Gen2. You need to ensure that only authorized users can access the data via the serverless SQL endpoint, while minimizing administrative overhead. What should you use?

A.Enable Microsoft Entra ID authentication and grant users permissions via Azure RBAC on the storage account.

B.Use managed identities for the serverless SQL pool.

C.Use storage account access keys for authentication.

D.Use shared access signatures (SAS) tokens generated for each user.

AnswerA

Microsoft Entra ID pass-through authentication allows users to authenticate with their Azure AD identities, and RBAC controls access to storage, minimizing overhead.

Why this answer

Option D is correct because Microsoft Entra ID (formerly Azure AD) pass-through authentication allows users to authenticate using their existing Azure AD identities without managing separate SQL logins, reducing overhead. Option A is wrong because storage account keys provide broad access and are not tied to user identities. Option B is wrong because SAS tokens are time-limited and require token management.

Option C is wrong because managed identities are for service-to-service authentication, not individual user access.

Full explanation →

640

Multi-Selecthard

Which TWO options are valid ways to load data into Azure Synapse SQL Pool? (Choose two.)

Select 2 answers

A.Using INSERT INTO VALUES

B.Using BCP utility

C.Using the COPY statement

D.Using SQL Server Integration Services (SSIS)

E.Using PolyBase to load from Azure Blob Storage

AnswersC, E

Correct. COPY is the preferred bulk load method.

Why this answer

The COPY statement (Option C) is a valid and recommended method for loading data into Azure Synapse SQL Pool because it provides a high-throughput, fully managed ingestion path from Azure Data Lake Storage or Azure Blob Storage. It supports parallel loading, automatic schema inference, and built-in error handling, making it the preferred choice over older methods like PolyBase for many scenarios.

Exam trap

The trap here is that candidates may incorrectly assume INSERT INTO VALUES or BCP are valid for Synapse SQL Pool due to their familiarity with traditional SQL Server, but Synapse's MPP architecture requires bulk loading methods like COPY or PolyBase to achieve acceptable performance.

Full explanation →

641

MCQhard

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query is slow and you suspect that the file layout is not optimized. You examine the files and find that each file is 50 MB. What should you do to improve query performance?

A.Partition the data into folders by date

B.Compress the files with Gzip

C.Convert the files to CSV format to reduce overhead

D.Merge the small files into larger files of at least 100 MB each

AnswerD

Larger files reduce metadata overhead and improve query performance.

Why this answer

Option D is correct because Azure Synapse serverless SQL pool performs best when reading files of at least 100 MB each. Small files (50 MB) cause excessive metadata operations, partition discovery, and I/O overhead, leading to slower queries. Merging them into larger files reduces the number of file open/close operations and improves parallelism efficiency.

Exam trap

The trap here is that candidates often focus on compression or format conversion to improve performance, but the real bottleneck in serverless SQL pool with many small files is the metadata and scheduling overhead, not the data size itself.

How to eliminate wrong answers

Option A is wrong because partitioning by date does not address the core issue of small file size; while partitioning can help with predicate pushdown, it does not reduce the overhead of reading many small files. Option B is wrong because compressing with Gzip can actually degrade performance in serverless SQL pool, as Gzip is not splittable and forces sequential reads, whereas the pool prefers splittable formats like Snappy or uncompressed Parquet. Option C is wrong because converting to CSV format increases overhead due to lack of columnar storage, schema inference, and predicate pushdown, making queries slower than with optimized Parquet files.

Full explanation →

642

MCQhard

You are designing a data lake architecture for a healthcare company. The solution must support fine-grained access control at the file level, encryption at rest and in transit, and integration with Microsoft Purview for data lineage. Which storage solution should you recommend?

A.Azure NetApp Files.

B.Azure Files.

C.Azure Data Lake Storage Gen2 (ADLS Gen2).

D.Azure Blob Storage.

AnswerC

ADLS Gen2 provides ACLs, encryption, and Purview integration.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with POSIX-like access control lists (ACLs) for fine-grained file-level permissions, supports encryption at rest (Azure Storage Service Encryption) and in transit (TLS 1.2+), and natively integrates with Microsoft Purview for automated data lineage and cataloging. This makes it ideal for healthcare scenarios requiring strict compliance and auditability.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with ADLS Gen2, assuming blob storage's container-level permissions are sufficient for file-level control, but the hierarchical namespace and POSIX ACLs are exclusive to ADLS Gen2 and required for the fine-grained access described.

How to eliminate wrong answers

Option A is wrong because Azure NetApp Files provides NFS/SMB file shares with ACLs but lacks native integration with Microsoft Purview for data lineage and is not optimized for large-scale analytics workloads like data lakes. Option B is wrong because Azure Files offers SMB file shares with ACLs but does not support the hierarchical namespace or POSIX ACLs needed for fine-grained file-level control in a data lake, and its Purview integration is limited compared to ADLS Gen2. Option D is wrong because Azure Blob Storage provides encryption and Purview integration but lacks a hierarchical namespace and POSIX ACLs, making it impossible to enforce fine-grained access control at the individual file level.

Full explanation →

643

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. They need to ensure that only users with a specific Azure AD group can query a particular schema. Which approach should they use?

A.Configure a server-level firewall rule to block other users.

B.Use the GRANT statement to grant SELECT on the schema to the Azure AD group.

C.Create a row-level security policy on all tables in the schema.

D.Apply dynamic data masking to the schema.

AnswerB

GRANT schema permission controls access at schema level.

Why this answer

The GRANT statement in Azure Synapse dedicated SQL pool allows you to assign permissions directly to Azure AD groups. By granting SELECT on the schema to the specific Azure AD group, only members of that group can query objects within that schema, meeting the requirement precisely.

Exam trap

The trap here is that candidates often confuse network-level controls (firewall rules) or data obfuscation techniques (masking, RLS) with access control, when the correct solution is a straightforward permission grant using T-SQL's GRANT statement.

How to eliminate wrong answers

Option A is wrong because server-level firewall rules control network access to the entire Azure SQL logical server, not granular schema-level access for specific Azure AD groups. Option C is wrong because row-level security (RLS) restricts access to specific rows within tables based on a predicate function, not entire schemas or tables at the schema level. Option D is wrong because dynamic data masking obfuscates sensitive data in query results but does not prevent users from querying the schema or seeing the underlying data with appropriate permissions.

Full explanation →

644

MCQeasy

You are a data engineer at a financial services company. You are developing a data processing pipeline that uses Azure Data Factory to copy transactional data from an Azure SQL Database to Azure Data Lake Storage Gen2. The pipeline runs daily and processes about 10 GB of data. You need to implement error handling for the pipeline. Specifically, if the copy activity fails due to a transient error, the pipeline should retry automatically. If the retry fails, the pipeline should log the error and send an email alert to the operations team. What should you do?

A.Configure the copy activity with retry policy (retry count = 2, retry interval = 30 seconds). Add a failure path to a web activity that calls an Azure Logic App to send an email.

B.Use Azure Functions to implement custom retry logic and send email.

C.Create an Azure Monitor alert for failed pipeline runs and configure an action group to send an email.

D.Set the pipeline retry to 2 and add a storage event trigger on the error file.

AnswerA

Built-in retry and integration with Logic Apps handle the requirement.

Why this answer

Option A is correct because Azure Data Factory allows you to set retry count and retry interval on the activity, and you can add a web activity to send an email via Azure Logic Apps after failure. Option B is wrong because the pipeline-level retry would not allow activity-specific settings. Option C is wrong because Azure Monitor alerts would be reactive, not integrated into the pipeline.

Option D is wrong because Azure Functions would require additional coding and setup.

Full explanation →

645

MCQhard

A company has an Azure Data Lake Storage Gen2 account. They want to ensure that only users with the 'Data Reader' role can access files in a specific container, while other users cannot list or read files. The storage account has hierarchical namespace enabled. What is the most secure and manageable approach?

A.Assign the Storage Blob Data Reader role at the storage account level and use row-level security

B.Generate a shared access signature (SAS) token for each user

C.Configure a storage firewall to allow only the Data Reader role's IP addresses

D.Set POSIX-like access control lists (ACLs) on the container folder for the Data Reader role

AnswerD

ACLs provide fine-grained permissions at the file/directory level for specific users/groups.

Why this answer

Option D is correct because Azure Data Lake Storage Gen2 with hierarchical namespace enabled supports POSIX-like access control lists (ACLs) at the container and folder level. By setting ACLs on the specific container folder to grant 'Read' and 'Execute' permissions only to the 'Data Reader' role (or its associated security group), you enforce least-privilege access without affecting other containers. This approach is both secure and manageable, as ACLs are inherited by default and can be centrally managed via Azure RBAC integration.

Exam trap

The trap here is that candidates often confuse row-level security (a database concept) with file-level security in Data Lake Storage, or they assume that a storage firewall can filter by user role, when in fact it only filters by network source IP.

How to eliminate wrong answers

Option A is wrong because row-level security (RLS) is a feature of Azure SQL Database and Azure Synapse SQL, not Azure Data Lake Storage Gen2; it cannot be applied to files in a storage container. Option B is wrong because generating a SAS token for each user is not manageable at scale, introduces token management overhead, and does not leverage Azure AD-based role assignments for centralized access control. Option C is wrong because a storage firewall restricts access based on network IP addresses, not user roles; it cannot differentiate between users who have the 'Data Reader' role and those who do not, and it would block all traffic from non-whitelisted IPs regardless of role membership.

Full explanation →

646

MCQeasy

You need to monitor the health of your Azure Data Factory pipelines and set up alerts for failures. Which Azure service should you use to collect and analyze pipeline run logs?

A.Azure Purview

B.Azure Log Analytics

C.Azure Monitor

D.Azure Sentinel

AnswerC

Collects metrics and logs for Azure Data Factory.

Why this answer

Option B is correct because Azure Monitor collects and analyzes pipeline run logs and metrics. Option A is wrong because Azure Purview is for data governance. Option C is wrong because Azure Sentinel is a SIEM.

Option D is wrong because Azure Log Analytics is part of Azure Monitor but the question asks for the service that collects and analyzes logs.

Full explanation →

647

MCQmedium

You are designing a data processing pipeline that ingests data from a REST API endpoint every hour. The API returns JSON data with a varying schema. You need to store the raw data in Azure Data Lake Storage Gen2 and later process it using Azure Databricks. Which file format should you use for the raw data storage?

A.Parquet

B.CSV

C.JSON

D.Avro

AnswerC

Preserves schema flexibility and is easy to store.

Why this answer

Option A (JSON) is correct because it preserves the original schema and is easy to store without transformation. Option B (Parquet) requires schema definition upfront. Option C (CSV) is not suitable for nested JSON.

Option D (Avro) is good for streaming but less common for raw ingestion.

Full explanation →

648

MCQhard

You are troubleshooting a slow-running query in Azure Synapse Analytics dedicated SQL pool. The query joins a large fact table (hash-distributed on ProductID) with a small dimension table (replicated). Upon reviewing the query plan, you see a 'ShuffleMove' operation. What is the most likely cause of the slow performance?

A.The dimension table is not actually replicated due to its size exceeding the replication threshold

B.The query is running with a low resource class

C.Result set caching is enabled

D.Statistics are outdated on the dimension table

AnswerA

If the dimension table is too large to be replicated, it will be distributed and cause shuffle.

Why this answer

The 'ShuffleMove' operation in a Synapse dedicated SQL pool query plan indicates that data is being moved between distributions to complete the join. If the small dimension table is supposed to be replicated but exceeds the replication threshold (default 60 GB compressed), it will not be replicated and instead remains hash-distributed. This forces a shuffle of the large fact table's data to align with the dimension table's distribution, causing significant data movement and slow performance.

Exam trap

The trap here is that candidates often assume a 'ShuffleMove' is always caused by a join key mismatch or poor statistics, but the specific scenario of a dimension table failing to replicate due to size is a common and subtle cause that directly triggers data movement.

How to eliminate wrong answers

Option B is wrong because a low resource class affects concurrency and memory allocation, but it would not introduce a 'ShuffleMove' operation; it would manifest as spilling to tempdb or timeouts. Option C is wrong because result set caching improves performance for repeated queries by storing results, and it does not cause a 'ShuffleMove' operation; in fact, it would reduce the need for shuffles. Option D is wrong because outdated statistics can lead to suboptimal cardinality estimates and poor join choices, but they do not directly cause a 'ShuffleMove' operation; the shuffle is a physical data movement decision based on distribution type, not statistics.

Full explanation →

649

MCQeasy

Your company uses Azure Synapse Analytics dedicated SQL pool to store a fact table with 2 billion rows. You need to improve query performance for a workload that frequently aggregates sales by date and product category. Which distribution and index type should you use?

A.Hash-distribute on product_category and use a clustered columnstore index.

B.Replicate the table and use a clustered index.

C.Round-robin distribution and a clustered columnstore index.

D.Hash-distribute on date and use a clustered index.

AnswerA

Hash distribution on the join/aggregation key improves performance; columnstore is ideal for large data volumes.

Why this answer

Hash-distributing on product_category ensures that rows with the same product category are co-located on the same distribution, enabling local aggregation without data movement. A clustered columnstore index provides high compression and batch-mode processing, which is ideal for large fact tables and analytical workloads that aggregate millions of rows by columns like date and product_category.

Exam trap

The trap here is that candidates often choose round-robin distribution (Option C) thinking it balances data evenly, but they overlook that it causes data shuffling for any aggregation on a non-distribution column, while hash distribution on the grouping column avoids that overhead entirely.

How to eliminate wrong answers

Option B is wrong because replicating a 2-billion-row table is impractical due to storage overhead and the 60-distribution replication limit, and a clustered index lacks the columnar compression and batch-mode execution needed for large-scale aggregation. Option C is wrong because round-robin distribution distributes rows randomly, causing data movement during joins and aggregations, which degrades performance for frequent grouping by product_category. Option D is wrong because hash-distributing on date scatters rows with the same product_category across distributions, forcing expensive shuffle operations during aggregation, and a clustered index is less efficient than columnstore for analytical queries.

Full explanation →

650

MCQeasy

Your team is using Azure Data Factory to orchestrate a data pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline runs successfully during testing. However, after moving to production, you notice that the pipeline fails intermittently with connectivity errors. You need to ensure reliable data transfer. What should you implement?

A.Use Azure Integration Runtime (IR)

B.Use PolyBase for data transfer

C.Configure Azure VPN Gateway

D.Use Self-Hosted Integration Runtime

AnswerD

Self-hosted IR acts as a gateway connecting ADF to on-premises resources.

Why this answer

Option B is correct because a self-hosted integration runtime provides a gateway for on-premises connectivity. Option A is wrong because Azure IR cannot access on-premises networks. Option C is wrong because VPN is not a direct solution for data factory connectivity.

Option D is wrong because PolyBase is for loading into Synapse, not for connectivity.

Full explanation →

651

MCQmedium

Your company uses Azure Data Lake Storage Gen2 as a data lake. You need to process CSV files that arrive in a 'raw' container, transform them into Parquet format, and write them to a 'curated' container. The transformation includes filtering out rows with null values in the 'customer_id' column and adding a partition column 'year' based on the 'order_date'. You use Azure Synapse Pipelines. Which activity should you use for the transformation?

A.Stored procedure activity

B.Notebook activity with PySpark

C.Copy data activity

D.Data flow activity

AnswerD

Data flows provide visual transformation with built-in mapping.

Why this answer

Option C is correct because a data flow activity in Azure Synapse Pipelines can perform transformations like filtering and adding computed columns, and can write to ADLS Gen2 in Parquet format. Option A is wrong because Copy activity only copies data without transformation. Option B is wrong because Notebook activity requires Spark code; data flow is simpler for this scenario.

Option D is wrong because Stored Procedure activity runs SQL, not file transformations.

Full explanation →

652

MCQhard

You are designing a data storage solution for an Azure Data Lake Storage Gen2 account that will store sensitive customer data. The solution must enforce that all data is encrypted at rest using customer-managed keys (CMK) stored in Azure Key Vault. Additionally, you need to prevent data from being accessed by any Azure service except Azure Synapse Analytics. Which combination of configurations should you implement?

A.Enable Azure Storage encryption with customer-managed keys stored in Azure Key Vault and configure a firewall with a service endpoint for Azure Synapse Analytics

B.Enable Azure Storage encryption with customer-managed keys and use a shared access signature (SAS) token for Azure Synapse Analytics

C.Enable Azure Storage encryption with customer-managed keys and assign an Azure Policy denying public network access

D.Enable Azure Storage encryption with Microsoft-managed keys and configure a private endpoint

AnswerA

This combination provides CMK encryption and restricts network access to Azure Synapse Analytics only.

Why this answer

Option A is correct because it combines two essential controls: Azure Storage encryption with customer-managed keys (CMK) stored in Azure Key Vault ensures data is encrypted at rest using keys you control, and a firewall with a service endpoint for Azure Synapse Analytics restricts network access so that only traffic from Azure Synapse Analytics can reach the storage account. This meets both the encryption and access restriction requirements.

Exam trap

The trap here is that candidates often confuse service endpoints with private endpoints or SAS tokens, thinking any network restriction or key management approach will suffice, but the question specifically requires both CMK and service-specific access control, which only a service endpoint for Azure Synapse Analytics combined with a firewall provides.

How to eliminate wrong answers

Option B is wrong because a shared access signature (SAS) token grants time-limited, delegated access to specific resources but does not restrict access to only Azure Synapse Analytics; it can be used by any client with the token, and it does not enforce network-level isolation. Option C is wrong because assigning an Azure Policy denying public network access only blocks public endpoints but does not explicitly allow only Azure Synapse Analytics; it would block all access unless a private endpoint or service endpoint is configured, and it does not address the encryption requirement (CMK is already enabled, but the policy alone does not enforce the service-specific restriction). Option D is wrong because it uses Microsoft-managed keys instead of customer-managed keys, which fails the explicit requirement for CMK stored in Azure Key Vault, even though a private endpoint provides network isolation.

Full explanation →

653

Multi-Selectmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored as Parquet files partitioned by date. Which TWO of the following statements are true regarding querying this data? (Select TWO.)

Select 2 answers

A.You can use the filepath() function in the query to retrieve the partition column values.

B.Partition elimination is automatically applied when filtering on the partition column in the WHERE clause.

C.You must create an external table to query Parquet files; OPENROWSET is not supported.

D.You can create indexes on the serverless SQL pool to improve query performance.

E.You can only query a single file at a time; wildcards are not supported.

AnswersA, B

The filepath function returns the partition path values.

Why this answer

Option A is correct because the `filepath()` function in a serverless SQL pool query returns the file path of the row being read. When data is partitioned by date in Azure Data Lake Storage Gen2, the partition column values are embedded in the folder structure (e.g., `/year=2023/month=01/day=15/`). Using `filepath(1)`, `filepath(2)`, etc., you can extract these values directly in the query without needing to parse the path manually.

Exam trap

The trap here is that candidates often assume serverless SQL pools behave like dedicated SQL pools, leading them to think indexes are needed or that external tables are mandatory, when in fact serverless pools are schema-on-read and rely on file metadata and statistics for performance.

Full explanation →

654

MCQmedium

You are designing a data storage solution for a retail company that needs to store semi-structured JSON data from IoT sensors. The data is ingested continuously and must support both real-time analytics and batch processing. Which Azure data store should you recommend?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Cosmos DB

AnswerA

ADLS Gen2 combines Blob Storage with a hierarchical namespace and is designed for big data analytics.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical file system with the scalability and low cost of Azure Blob Storage, making it ideal for storing semi-structured JSON data from IoT sensors. It supports both real-time analytics (via services like Azure Stream Analytics or Apache Spark) and batch processing (via tools like Azure Data Factory or PolyBase) without data movement, and it natively handles JSON files with schema-on-read capabilities.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB (Option D) because it natively supports JSON and real-time access, but they overlook the requirement for batch processing and cost-effective storage at scale, which ADLS Gen2 is designed for as a data lake solution.

How to eliminate wrong answers

Option B (Azure Blob Storage) is wrong because while it can store JSON data, it lacks a hierarchical namespace, making it less efficient for directory-based operations and batch processing patterns that require folder-level security or rename operations, which ADLS Gen2 provides. Option C (Azure SQL Database) is wrong because it is a relational database optimized for structured data and transactional workloads, not for storing semi-structured JSON at scale with schema-on-read; it would require schema definition and indexing, adding latency and cost for high-volume IoT ingestion. Option D (Azure Cosmos DB) is wrong because it is a NoSQL database designed for low-latency, transactional access to semi-structured data, but it is not optimized for batch processing or large-scale analytical queries on raw JSON files; it is better suited for operational workloads rather than the combined real-time and batch analytics scenario described.

Full explanation →

655

MCQmedium

Refer to the exhibit. You deploy this ARM template to create a managed integration runtime in Azure Synapse Analytics. You notice that the integration runtime shows as 'Running' but copy activities using it are slow. The data volume is 500 GB per run. What is the most likely cause of the poor performance?

A.The integration runtime type should be 'Self-hosted' for better performance.

B.The node size (Standard_D3_v2) is too small for the data volume.

C.The number of nodes (2) is insufficient for 500 GB of data.

D.The location is set to 'AutoResolve' which may cause data movement across regions.

AnswerB

Standard_D3_v2 has limited resources for large data processing.

Why this answer

Option B is correct because the node size Standard_D3_v2 has limited memory and CPU for large data volumes. A larger node size (e.g., Standard_D8_v3) would improve performance. Option A is wrong because 2 nodes is reasonable; increasing nodes could help but the node size is more critical.

Option C is wrong because the location 'AutoResolve' is fine. Option D is wrong because the integration runtime type 'Managed' is appropriate for Azure-based copy.

Full explanation →

656

Multi-Selectmedium

Which THREE considerations should be evaluated when designing a partitioning strategy for a large fact table in Azure Synapse Dedicated SQL Pool?

Select 3 answers

A.Align partition boundaries with distribution key

B.Use hash distribution for all fact tables

C.Ensure partition elimination is possible in queries

D.Limit the number of partitions to fewer than 60

E.Choose a partition column with high cardinality

AnswersA, C, D

Prevents data movement.

Why this answer

Option A is correct because aligning partition boundaries with the distribution key in Azure Synapse Dedicated SQL Pool ensures that data movement during partition switching or merging is minimized, as each distribution contains its own set of partitions. This alignment avoids cross-distribution data transfers, which can significantly improve performance and maintainability of large fact tables.

Exam trap

The trap here is that candidates often confuse partition column cardinality with distribution key cardinality, assuming high cardinality is always beneficial, but for partitioning, low cardinality is required to avoid creating too many small partitions that hurt performance.

Full explanation →

657

MCQhard

A healthcare company stores patient records in Azure Blob Storage. The compliance team requires that all data be encrypted at rest using customer-managed keys (CMK) stored in Azure Key Vault. Additionally, the storage account must be accessible only from a specific virtual network (VNet) and must support versioning to protect against accidental deletion. The storage account is currently using Microsoft-managed keys and has public network access enabled. You need to implement the required changes with minimal downtime. Which course of action should you take?

A.Create a new storage account in the same region with infrastructure encryption enabled and customer-managed keys configured. Enable versioning and VNet access. Use AzCopy to copy data from the old account to the new one. Update applications to use the new storage account. Delete the old account.

B.Modify the existing storage account to use customer-managed keys by updating the encryption settings in the Azure portal. Then, enable versioning and configure firewall rules to allow access from the VNet.

C.Delete the existing storage account and recreate it with the same name, enabling customer-managed keys, versioning, and VNet access. Restore data from backups.

D.Create a new storage account with the same name in a different region. Enable infrastructure encryption and customer-managed keys. Use Azure Data Factory to copy data from the old account to the new one. Then, delete the old account and update applications to point to the new account.

AnswerA

This meets all requirements. The new account is created with CMK support, versioning, and VNet access. Data is migrated with AzCopy, and applications are updated.

Why this answer

Option A is correct because creating a new storage account with infrastructure encryption and customer-managed keys (CMK) is the only way to enable infrastructure encryption, which cannot be enabled on an existing account. Using AzCopy ensures minimal downtime by copying data in the background while applications continue to use the old account until the cutover. Enabling versioning and VNet access on the new account meets all compliance requirements without disrupting the existing environment.

Exam trap

The trap here is that candidates assume infrastructure encryption can be enabled on an existing storage account, but it is a creation-time-only setting, making a new account the only viable path to meet the compliance requirement with minimal downtime.

How to eliminate wrong answers

Option B is wrong because you cannot enable infrastructure encryption on an existing storage account; it must be set at creation time. Additionally, changing encryption settings from Microsoft-managed to customer-managed keys on an existing account does not enable infrastructure encryption, which is required by the compliance team. Option C is wrong because deleting and recreating the account with the same name causes significant downtime, and restoring from backups is not a minimal-downtime approach; also, infrastructure encryption cannot be enabled on a recreated account with the same name if the original was created without it.

Option D is wrong because creating the new account in a different region violates the requirement to keep data in the same region, and Azure Data Factory is not the optimal tool for a one-time bulk copy; AzCopy is more efficient for this scenario.

Full explanation →

658

MCQmedium

A company uses Azure Synapse Analytics with a dedicated SQL pool. They need to ensure that a team of data scientists can query all tables in the 'sales' schema but cannot modify any data or schema objects. Which role should the team be assigned?

A.db_owner

B.db_datareader

C.db_ddladmin

D.db_datawriter

AnswerB

db_datareader grants read access to all tables.

Why this answer

The `db_datareader` role grants read-only access to all user tables in a database, allowing the team to query all tables in the 'sales' schema without the ability to modify data or schema objects. This aligns perfectly with the requirement for data scientists to perform SELECT queries only.

Exam trap

The trap here is that candidates often confuse `db_datareader` with `db_datawriter` or assume `db_ddladmin` is required for querying, not realizing that read-only access is specifically granted by `db_datareader` without any write or schema modification capabilities.

How to eliminate wrong answers

Option A is wrong because `db_owner` provides full control over the database, including the ability to modify data and schema, which violates the requirement. Option C is wrong because `db_ddladmin` allows execution of Data Definition Language (DDL) commands like CREATE, ALTER, and DROP, enabling schema modifications. Option D is wrong because `db_datawriter` grants INSERT, UPDATE, and DELETE permissions, allowing data modification.

Full explanation →

659

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool. You notice that some queries are taking longer than expected due to excessive data movement operations. You need to minimize data movement without changing the distribution columns. Which table design approach should you recommend?

A.Use replicated tables for small dimension tables

B.Use round-robin distribution for dimension tables

C.Use hash distribution for all tables

D.Use partitioning on join columns

AnswerA

Replicated tables store a full copy on each node, eliminating data shuffling for joins.

Why this answer

Using replicated tables for small dimension tables reduces data movement because the table is copied to all nodes, avoiding shuffles. Option A is wrong because hash-distributing large fact tables is already common; replicating large tables is not recommended due to storage overhead. Option C is wrong because round-robin distribution is for staging or temporary tables.

Option D is wrong because partitioning does not directly reduce data movement for joins.

Full explanation →

660

MCQmedium

Your company is building a real-time analytics solution for monitoring manufacturing equipment. Sensors send JSON data every second to an Azure Event Hubs instance. The data must be stored in Azure Data Lake Storage Gen2 in Parquet format, partitioned by date and hour. You use Azure Stream Analytics to read from Event Hubs and write to ADLS Gen2. Currently, the output is writing many small Parquet files (under 1 MB each), which is causing performance issues when reading the data. You need to optimize the output to produce fewer, larger files while maintaining low latency. What should you do?

A.Partition the output by minute instead of hour to distribute data more

B.Change the output format to Avro to improve compression

C.Increase the 'Maximum events per batch' setting in the Stream Analytics output to ADLS Gen2

D.Decrease the 'Maximum events per batch' setting to reduce latency

AnswerC

Buffering more events per batch produces larger files.

Why this answer

Option C is correct because increasing the 'Maximum events per batch' setting in the Stream Analytics output to ADLS Gen2 allows more events to be accumulated before writing a file, resulting in fewer, larger Parquet files. This directly addresses the small-file problem while maintaining low latency, as the batching is time-bound and does not introduce excessive delay.

Exam trap

The trap here is that candidates may confuse partitioning granularity with file sizing, incorrectly assuming finer partitioning (Option A) or format changes (Option B) will solve the small-file problem, when the actual solution is to adjust the batching threshold in the output sink.

How to eliminate wrong answers

Option A is wrong because partitioning by minute instead of hour would create even more partitions, leading to more small files and worsening the performance issue. Option B is wrong because changing the output format to Avro does not inherently reduce the number of files; it only changes the serialization format, and Avro may not provide better compression than Parquet for this scenario. Option D is wrong because decreasing the 'Maximum events per batch' setting would cause more frequent writes with fewer events, increasing the number of small files and degrading read performance.

Full explanation →

661

MCQmedium

You are reviewing a Spark job definition in Azure Synapse Analytics. The job aggregates sales data. The job runs successfully but takes longer than expected. You notice that dynamic allocation is disabled and the executor instances are fixed at 10. The cluster has a maximum of 20 nodes. What is the most likely reason for the slow performance?

A.The file path is incorrect, causing data read errors.

B.The job cannot scale out beyond 10 executors because dynamic allocation is disabled.

C.The job is not parallelized because of a single partition.

D.The executor memory is too low for the aggregation.

AnswerB

With dynamic allocation off, the job is limited to 10 executors.

Why this answer

With dynamic allocation disabled and executor instances fixed at 10, the Spark job cannot utilize additional cluster resources even though the cluster supports up to 20 nodes. This means the job is artificially constrained to 10 executors, limiting parallelism and causing slower performance despite available compute capacity.

Exam trap

The trap here is that candidates may overlook the explicit configuration detail (dynamic allocation disabled, fixed 10 executors) and instead focus on generic performance issues like memory or partitioning, missing the direct scaling limitation.

How to eliminate wrong answers

Option A is wrong because an incorrect file path would cause job failures or data read errors, not simply slower performance; the job runs successfully. Option B is wrong because it is actually the correct answer. Option C is wrong because a single partition would cause extreme underutilization and likely very slow processing, but the question states the job aggregates sales data and runs successfully, implying some parallelism exists; the fixed executor count is the more direct bottleneck.

Option D is wrong because while low executor memory can cause spilling to disk and slowdowns, the question specifically highlights disabled dynamic allocation and fixed executors as the observed configuration, making insufficient scaling the primary issue.

Full explanation →

662

Multi-Selectmedium

You are designing a data transformation pipeline using Azure Databricks. The pipeline reads from Azure Data Lake Storage Gen2, performs aggregations, and writes to a Synapse dedicated SQL pool. Which three configurations should you implement to optimize performance and minimize cost? (Choose three.)

Select 3 answers

A.Enable Delta Lake on the storage account

B.Use Auto Optimize and Optimized Writes on Delta tables

C.Enable Photon engine for Spark SQL operations

D.Use a single-node cluster to reduce cost

E.Disable autoscaling to avoid cost variability

F.Use default Spark shuffle partitions (200)

AnswersA, B, C

Why this answer

Option A is correct because enabling Delta Lake on the storage account allows you to use Delta tables, which provide ACID transactions, scalable metadata handling, and unified batch/streaming capabilities. This is essential for reliable and performant data transformations in Azure Databricks, especially when reading from ADLS Gen2 and writing to Synapse.

Exam trap

The trap here is that candidates often assume cost savings come from reducing cluster size (single-node) or disabling autoscaling, but in practice these choices hurt performance and can increase total cost due to longer runtimes and resource contention.

Why the other options are wrong

D

Single-node cluster cannot handle large data volumes efficiently.

E

Autoscaling helps minimize cost by scaling down during idle periods.

F

Default may not be optimal; tuning shuffle partitions is recommended.

Full explanation →

663

MCQhard

A data engineering team uses Azure Data Factory to load data from Azure SQL Database to Azure Data Lake Storage Gen2. They notice that the pipeline runs fail intermittently due to transient errors. They need to implement a retry policy with exponential backoff. What is the most efficient way to achieve this?

A.Use a 'Validation' activity before the copy to check source availability

B.Create a custom .NET activity to handle retries

C.Add a 'Until' loop with a wait activity in the pipeline

D.Configure the 'Retry' property on the copy activity with a count and exponential backoff interval

AnswerD

Built-in retry with exponential backoff.

Why this answer

Option D is correct because Azure Data Factory natively supports configuring a 'Retry' property on activities, including Copy activities, with an exponential backoff interval. This built-in mechanism automatically retries the activity upon transient failures without requiring custom logic, making it the most efficient and maintainable approach for handling intermittent errors.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing a custom loop or validation activity, overlooking that Azure Data Factory's native 'Retry' property with exponential backoff is the simplest and most efficient built-in mechanism for handling transient errors.

How to eliminate wrong answers

Option A is wrong because a 'Validation' activity only checks source availability before the copy starts; it does not retry the copy operation itself if a transient error occurs during data transfer. Option B is wrong because creating a custom .NET activity introduces unnecessary complexity, development overhead, and maintenance burden when Azure Data Factory already provides a native retry feature. Option C is wrong because an 'Until' loop with a wait activity requires manual implementation of retry logic and exponential backoff, which is less efficient and more error-prone than using the built-in 'Retry' property.

Full explanation →

664

MCQmedium

You are using Azure Purview to scan an Azure Data Lake Storage Gen2 account. After scanning, you notice that some files are not classified. What is the most likely reason?

A.The storage account is not registered in Purview

B.The files are in Parquet format

C.The classification rules are disabled

D.The file types are not included in the scan rule set

AnswerD

Default rule sets may not include all file types.

Why this answer

Purview uses scan rule sets to classify data. If the file type is not included in the default scan rule set, it will be skipped. Option A is wrong because Purview can scan Parquet/CSV.

Option B is wrong because registration is required. Option D is wrong because classification rules are defined in rule sets.

Full explanation →

665

Multi-Selecteasy

Which TWO actions help optimize data storage costs in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete for blobs.

B.Enable geo-redundant storage (GRS) for the storage account.

C.Configure lifecycle management policies to move data to cool or archive tiers.

D.Enable encryption at rest using customer-managed keys.

E.Use locally redundant storage (LRS) for temporary data.

AnswersC, E

Lifecycle policies reduce cost by tiering infrequently accessed data.

Why this answer

Option C is correct because Azure Data Lake Storage Gen2 supports lifecycle management policies that automatically transition data to cooler tiers (cool or archive) based on age or usage patterns. Moving infrequently accessed data to lower-cost tiers directly reduces storage costs without manual intervention.

Exam trap

The trap here is that candidates often confuse cost-optimization features (like tiering) with data protection or security features (like soft delete, GRS, or encryption), which serve different purposes and may actually increase costs.

Full explanation →

666

MCQmedium

You are monitoring an Azure Data Factory pipeline that runs every hour. The pipeline uses a Copy activity to copy data from Azure SQL Database to Azure Blob Storage. Recently, the pipeline has been failing with a 'Timeout' error. The source SQL database has a large number of records. What should you do to resolve the timeout?

A.Enable staging and use PolyBase or COPY statement for the copy activity.

B.Decrease the 'writeBatchSize' to 1000.

C.Increase the 'timeout' value in the copy activity settings.

D.Use a query with 'queryTimeout' set to 7200 seconds.

AnswerA

Staging with PolyBase/COPY allows data to be copied in parallel and avoids timeouts.

Why this answer

Option A is correct because enabling staging with PolyBase or the COPY statement offloads the data transfer to Azure Data Lake or Blob Storage, bypassing the bottleneck of the Copy activity's default data movement. This approach is specifically designed for large-scale data loads from Azure SQL Database, as it uses the database's bulk export capabilities and avoids the timeout by not relying on the Copy activity's internal query execution.

Exam trap

The trap here is that candidates often assume increasing timeouts (options C or D) will fix the issue, but the real problem is the Copy activity's default command timeout limitation, which requires a fundamentally different data movement approach like staging with PolyBase or COPY statement.

How to eliminate wrong answers

Option B is wrong because decreasing 'writeBatchSize' to 1000 reduces the number of rows written per batch to the sink, which does not address the source-side timeout; the timeout occurs during the data read from Azure SQL Database, not during the write to Blob Storage. Option C is wrong because increasing the 'timeout' value in the copy activity settings only extends the overall activity duration but does not resolve the underlying issue of the source query exceeding the default command timeout (typically 30 seconds) when reading a large number of records. Option D is wrong because setting 'queryTimeout' to 7200 seconds in a query only applies to the query execution on the source database, but the Copy activity's default command timeout for the source dataset is separate and still limited; moreover, the real solution is to use a bulk export mechanism like PolyBase or COPY statement, not just extending the query timeout.

Full explanation →

667

MCQmedium

You are designing a data lake architecture using Azure Data Lake Storage Gen2. The data will be ingested from multiple sources with varying schemas. You need to organize the data in a way that supports both batch and streaming analytics while maintaining data lineage. Which folder structure convention should you use?

A.Organize by ingestion date only, with subfolders for each source.

B.Organize by source system, then by date.

C.Use a medallion architecture with three layers: bronze (raw), silver (cleaned), gold (aggregated).

D.Organize by file format (CSV, Parquet, JSON) and date.

AnswerC

Medallion architecture provides clear separation and lineage.

Why this answer

Option C is correct because the medallion architecture (bronze, silver, gold) is the recommended pattern for Azure Data Lake Storage Gen2 when handling multiple sources with varying schemas. It supports both batch and streaming by storing raw data in bronze, applying incremental transformations in silver, and serving aggregated views in gold, while maintaining data lineage through clear layer boundaries and audit columns.

Exam trap

The trap here is that candidates often choose Option B (source then date) because it seems logical for organization, but they overlook the requirement to support both batch and streaming analytics while maintaining data lineage, which the medallion architecture explicitly addresses through layered transformations.

How to eliminate wrong answers

Option A is wrong because organizing by ingestion date only, with subfolders for each source, lacks schema evolution support and makes it difficult to trace data lineage across transformations. Option B is wrong because organizing by source system then by date does not provide a standardized processing pipeline for both batch and streaming, and it fails to separate raw, cleaned, and aggregated states. Option D is wrong because organizing by file format and date ignores the need for schema management and lineage tracking, and it does not facilitate incremental processing or data quality checks across layers.

Full explanation →

668

MCQhard

Refer to the exhibit. You are running the KQL query in Azure Data Explorer. The query returns no results, but you know there is data in the table T. What is the most likely issue?

A.The bin function is used incorrectly; should be bin(Timestamp, 1h) but placed in the wrong clause

B.The between operator syntax is incorrect; should be 'between (startTime..endTime)' without spaces around the dots

C.The datetime format is incorrect; use ISO 8601

D.The table T does not exist in the database

AnswerB

Spaces around the '..' can cause the range to be misparsed.

Why this answer

Option B is correct because the KQL `between` operator requires the range syntax `between(datetime1..datetime2)` with no spaces around the two dots. The query uses `between (startTime .. endTime)` with spaces, which is invalid syntax and causes the query to return no results even though data exists in table T.

Exam trap

Microsoft often tests the exact syntax of KQL operators like `between`, where candidates overlook the requirement for no spaces around the two dots, assuming whitespace is allowed as in other languages.

How to eliminate wrong answers

Option A is wrong because the `bin` function is used correctly in the `where` clause to round timestamps; the issue is not about `bin` placement. Option C is wrong because the datetime format in the query uses a valid format (e.g., '2023-01-01 00:00:00') and KQL accepts various datetime formats, including the one shown. Option D is wrong because the question explicitly states that table T exists and contains data, so the table not existing is not the issue.

Full explanation →

669

MCQhard

A healthcare company stores sensitive patient data in Azure Data Lake Storage Gen2. They need to ensure that only authorized users can access data and that all access is audited. They also need to prevent data from being accessed by unauthorized Azure services. Which combination of security features should be used?

A.Use a private endpoint and Azure AD authentication, disable public access.

B.Use Azure RBAC and ACLs for authorization, enable firewall and virtual network service endpoints, and enable diagnostic settings for auditing.

C.Use managed identity for service access and disable public access.

D.Use Azure AD authentication and SAS tokens for access, enabling diagnostic logs for auditing.

AnswerB

Combination provides layered security and full audit.

Why this answer

Option B is correct because it combines Azure RBAC and ACLs for fine-grained authorization, a firewall with virtual network service endpoints to restrict access to authorized networks, and diagnostic settings to capture audit logs. This layered approach ensures that only authorized users can access the data, all access is audited, and unauthorized Azure services are blocked by the firewall and service endpoints.

Exam trap

Microsoft often tests the misconception that disabling public access alone is sufficient to block unauthorized Azure services, when in fact service endpoints or private endpoints are required to prevent access from other Azure services within the same region.

How to eliminate wrong answers

Option A is wrong because while a private endpoint and Azure AD authentication secure access and disable public access, they do not provide the ability to block unauthorized Azure services (e.g., other Azure services outside the virtual network) unless combined with network rules like service endpoints or firewall rules. Option C is wrong because managed identity alone does not prevent unauthorized Azure services from accessing the data; it only authenticates the service, and disabling public access without network-level controls still allows other Azure services within the same region to access the storage account via the Azure backbone. Option D is wrong because SAS tokens can be leaked or misused, and they do not provide the same level of fine-grained authorization as RBAC and ACLs; additionally, enabling diagnostic logs alone does not prevent unauthorized Azure services from accessing the data.

Full explanation →

670

Multi-Selecteasy

Which TWO Azure services can you use to monitor data processing pipelines in Azure Data Factory?

Select 2 answers

A.Log Analytics workspace

B.Azure Sentinel

C.Azure Dashboard

D.Azure Service Health

E.Azure Monitor

AnswersA, E

Log Analytics stores and queries pipeline logs.

Why this answer

Options B and C are correct. Azure Monitor provides metrics and alerts; Log Analytics workspace enables querying pipeline run logs. Option A (Azure Sentinel) is a SIEM, not typically for pipeline monitoring.

Option D (Azure Dashboard) is a visualization tool, not a monitoring service. Option E (Azure Service Health) monitors Azure service health, not pipelines.

Full explanation →

671

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Lake. The data is partitioned by date and ingested daily. You notice that the Delta table has many small files, causing slow read performance. Which strategy should you recommend to optimize the table for faster queries?

A.Run OPTIMIZE on the table to compact small files.

B.Run ZORDER BY on the date column.

C.Run VACUUM to delete old files.

D.Increase the number of partitions by adding a new partition column.

AnswerA

OPTIMIZE merges small files into larger ones.

Why this answer

Option A is correct because running OPTIMIZE on a Delta Lake table compacts many small files into larger ones, reducing the number of files that need to be read during queries. This directly addresses the slow read performance caused by the small file problem, which is common in daily partitioned ingestion. OPTIMIZE uses bin-packing to merge files up to a target size (default 256 MB), improving scan efficiency without changing the data.

Exam trap

The trap here is that candidates may confuse ZORDER BY (which improves data skipping but not file count) with OPTIMIZE (which reduces file count), or mistakenly think VACUUM or adding partitions solves the small file problem, when in fact they either don't address it or make it worse.

How to eliminate wrong answers

Option B is wrong because ZORDER BY is used to colocate related information within files to improve data skipping, but it does not reduce the number of small files; it only reorganizes data within existing files. Option C is wrong because VACUUM removes old, unreferenced files for storage cleanup and compliance, but it does not compact small files or improve read performance. Option D is wrong because increasing the number of partitions (e.g., by adding a new partition column) would create even more small files, worsening the small file problem and degrading read performance further.

Full explanation →

672

MCQhard

You are monitoring an Azure Synapse Analytics dedicated SQL pool that is experiencing performance degradation during peak hours. You notice that some queries are being queued due to resource contention. You need to optimize query performance without scaling the Data Warehouse Units (DWUs). Which action should you take?

A.Increase the DWU setting to allocate more resources.

B.Create materialized views for frequently joined tables.

C.Configure result-set caching for the dedicated SQL pool.

D.Implement workload classification and assign the queries to a higher importance level.

AnswerC

Result-set caching stores query results in SSD, reducing resource usage for repeated queries and alleviating contention.

Why this answer

Option D is correct because result-set caching can significantly reduce query time for repeated queries by storing results in SSD, reducing resource contention. Option A is wrong because increasing DWUs changes the scale, which is not allowed per the requirement. Option B is wrong because materialized views help but do not directly address contention from repeated queries.

Option C is wrong because workload classification manages concurrency but does not reduce resource usage for repeated queries.

Full explanation →

673

Matchingmedium

Match each Azure security feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Role-based access control for Azure resources

Cloud-based identity and access management service

Manage cryptographic keys and secrets

Private connectivity to Azure services over VNet

Why these pairings

These security features are essential for data protection.

Full explanation →

674

Multi-Selectmedium

Which TWO options are correct approaches to handle schema drift in Azure Data Factory Mapping Data Flows?

Select 2 answers

A.Use a conditional split to route rows with different schemas to separate sinks.

B.Define a rigid schema in the source dataset and reject rows that don't match.

C.Disable schema drift to improve performance.

D.Enable 'Allow schema drift' in the source transformation.

E.Use a derived column transformation to provide default values for missing columns.

AnswersD, E

This allows the data flow to handle changing columns.

Why this answer

Option D is correct because enabling 'Allow schema drift' in the source transformation is the primary mechanism in Mapping Data Flows to handle incoming columns that are not defined in the dataset schema. This setting allows the data flow to dynamically adapt to changes in the source data structure, such as new or missing columns, without requiring manual schema updates.

Exam trap

The trap here is that candidates often confuse handling schema drift with data routing or error handling, and they overlook that enabling schema drift is the foundational step that must be taken before any other transformations can work with the drifted columns.

Full explanation →

675

Multi-Selectmedium

A company is designing a data storage solution for a global application that requires low-latency reads and writes for user session data. The solution must support automatic failover across multiple Azure regions. Which TWO Azure services meet these requirements?

Select 2 answers

A.Azure Table Storage

B.Azure Cache for Redis

C.Azure Blob Storage

D.Azure Cosmos DB

E.Azure SQL Database

AnswersB, D

Supports geo-replication and automatic failover.

Why this answer

Azure Cache for Redis is correct because it provides an in-memory data store with sub-millisecond latency for both reads and writes, making it ideal for user session data. It supports automatic failover across Azure regions through geo-replication, where data from a primary cache is asynchronously replicated to a secondary cache in a paired region, ensuring high availability and disaster recovery.

Exam trap

The trap here is that candidates often confuse Azure Cosmos DB's multi-region writes with the specific requirement for low-latency session data, but Cosmos DB, while supporting automatic failover, has higher latency than an in-memory cache like Redis for frequent, small reads and writes typical of session state.

Full explanation →

Page 9 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →