Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 226–300

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 4 of 12

226

MCQeasy

You are tasked with designing a data storage solution for a social media analytics company. They need to store user profile data (JSON) and social media posts (text and images). The data is used for machine learning models that require fast random access to individual user profiles and the ability to run analytical queries over posts. The solution must provide low-latency reads for user profiles (milliseconds) and support for large-scale analytics on posts. Which combination of Azure data services should you recommend?

A.Azure Cosmos DB for user profiles and Azure Data Lake Storage Gen2 for posts

B.Azure Cosmos DB for both user profiles and posts

C.Azure SQL Database for both user profiles and posts

D.Azure Table Storage for user profiles and Azure Blob Storage for posts

AnswerA

Cosmos DB gives low-latency reads; ADLS Gen2 supports analytics.

Why this answer

Azure Cosmos DB provides low-latency (millisecond) reads for user profiles via its indexing and partitioning capabilities, ideal for fast random access. Azure Data Lake Storage Gen2 (ADLS Gen2) combines a hierarchical namespace with Blob Storage, enabling large-scale analytical queries on posts using tools like Azure Synapse Analytics or Spark, while efficiently storing text and images.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for both workloads (Option B) because they assume its multi-model support handles analytics, but they overlook that Cosmos DB is a transactional database not designed for large-scale analytical queries, while ADLS Gen2 is purpose-built for data lakes and analytics.

How to eliminate wrong answers

Option B is wrong because using Azure Cosmos DB for both profiles and posts would be cost-prohibitive for large-scale analytics on posts, as Cosmos DB is optimized for transactional workloads, not petabyte-scale analytical queries, and lacks native support for hierarchical file storage. Option C is wrong because Azure SQL Database is a relational store that struggles with semi-structured JSON profiles and large binary images, and it cannot scale to handle massive analytical workloads on posts without significant performance degradation and cost. Option D is wrong because Azure Table Storage is a NoSQL key-value store that does not support complex queries or indexing for fast random access to JSON profiles, and Azure Blob Storage lacks a hierarchical namespace and native analytical integration, making large-scale analytics inefficient.

Full explanation →

227

Multi-Selecthard

Which THREE considerations are important when designing a data processing solution using Azure Databricks for ETL workloads? (Select three.)

Select 3 answers

A.Auto-scaling of compute resources

B.Use of Delta Lake for data reliability

C.Fixed pricing model

D.Minimum network bandwidth

E.Cluster configuration and autoscaling

AnswersA, B, E

Auto-scaling helps handle variable workloads efficiently.

Why this answer

Option A is correct because cluster configuration affects performance and cost. Option C is correct because Delta Lake provides ACID transactions and optimization. Option E is correct because auto-scaling adjusts resources based on workload.

Option B is wrong because fixed pricing is not a design consideration. Option D is wrong because minimum bandwidth is not a typical consideration.

Full explanation →

228

MCQhard

You are designing a data lake on Azure Data Lake Storage Gen2. The data will be used by both batch processing (Spark) and interactive querying (Azure Synapse Serverless SQL). The data is partitioned by date and stored as Parquet. What is the optimal folder structure to minimize cross-partition scans for both workloads?

A.All files in a single folder

B./year/month/day/ (e.g., /2023/12/25/)

C./yyyy-mm-dd/ (e.g., /2023-12-25/)

D.Files named by date (e.g., data_20231225.parquet)

AnswerB

Why this answer

Option B (/year/month/day/) is optimal because it aligns with Hive-style partitioning, which both Spark and Azure Synapse Serverless SQL can leverage for partition pruning. Spark uses partition discovery to read only relevant directories, and Synapse Serverless SQL uses the file path metadata to filter partitions, minimizing cross-partition scans and reducing data read overhead.

Exam trap

The trap here is that candidates often assume a flat date-based folder or filename pattern is sufficient for partitioning, but both Spark and Synapse Serverless SQL require hierarchical folder structures to enable automatic partition pruning and avoid full scans.

Why the other options are wrong

A

No partitioning at all, causing full scans.

C

Single-level partitioning does not allow efficient pruning for yearly or monthly queries.

D

Partition pruning requires folder hierarchy, not file names.

Full explanation →

229

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. The solution must support exactly-once semantics and handle late-arriving data. Which Azure service should you use to implement this solution?

A.Azure Data Factory with tumbling window trigger.

B.Azure Stream Analytics.

C.Azure Functions with Event Hubs trigger.

D.Azure HDInsight Spark Structured Streaming.

AnswerB

Azure Stream Analytics provides exactly-once semantics and handles late-arriving data.

Why this answer

Azure Stream Analytics is the correct choice because it natively integrates with Azure Event Hubs and dedicated SQL pools, supports exactly-once semantics through checkpointing and output deduplication, and provides built-in handling for late-arriving data via configurable late arrival tolerance windows and out-of-order event policies.

Exam trap

The trap here is that candidates often confuse batch-oriented services like Azure Data Factory with streaming solutions, or assume that any event-driven compute (like Azure Functions) can provide exactly-once semantics and late-arriving data handling without understanding the specialized streaming engine requirements.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory with a tumbling window trigger is a batch-oriented orchestration service that cannot process streaming data in real time; it lacks native support for exactly-once semantics in streaming contexts and cannot handle late-arriving data with event-time ordering. Option C is wrong because Azure Functions with an Event Hubs trigger processes events individually or in small batches, does not provide built-in exactly-once output guarantees to a dedicated SQL pool, and lacks native support for late-arriving data handling such as watermarking or out-of-order policies. Option D is wrong because Azure HDInsight Spark Structured Streaming requires significant manual configuration for exactly-once semantics (e.g., managing checkpoint locations and idempotent sinks) and does not offer the same level of integrated, low-latency output to dedicated SQL pools as Azure Stream Analytics; it also adds operational overhead for cluster management.

Full explanation →

230

MCQeasy

Your company is migrating an on-premises SQL Server database to Azure SQL Database. The database includes a large fact table with hourly updates. You need to minimize downtime during migration. Which Azure service should you use to replicate data continuously?

A.Use Azure SQL Managed Instance as a target

B.Use Azure Data Factory with a copy activity

C.Use Azure Database Migration Service with online migration mode

D.Use Azure Synapse Link for SQL Server

AnswerC

B is correct because DMS online migration uses CDC to replicate changes continuously.

Why this answer

Azure Database Migration Service (DMS) with online migration mode uses continuous change data capture (CDC) to replicate ongoing changes from the source SQL Server to Azure SQL Database with minimal downtime. This is the only option that supports near-zero downtime migration by synchronizing data continuously until the final cutover, which is critical for a large fact table with hourly updates.

Exam trap

The trap here is that candidates confuse Azure Data Factory's copy activity (a batch tool) with continuous replication, or assume Azure SQL Managed Instance inherently supports migration, when in fact DMS online mode is the specific service designed for minimal-downtime migrations.

How to eliminate wrong answers

Option A is wrong because Azure SQL Managed Instance is a target platform, not a migration service; it does not provide continuous data replication from an on-premises source. Option B is wrong because Azure Data Factory with a copy activity is a batch-oriented ETL tool that performs periodic snapshots, not continuous real-time replication, and would require downtime for the final data sync. Option D is wrong because Azure Synapse Link for SQL Server is designed for real-time analytics on operational data in Azure Synapse, not for migrating databases to Azure SQL Database with continuous replication.

Full explanation →

231

Multi-Selecthard

Which THREE factors should be considered when choosing between Azure Stream Analytics and Azure Databricks for a real-time data processing solution?

Select 3 answers

A.Integration with Power BI for real-time dashboards

B.Need for complex transformations and machine learning model integration

C.Volume of data per second (throughput)

D.Requirement for exactly-once semantics

E.Maximum allowed latency for late-arriving data

AnswersB, C, D

Databricks supports complex ML pipelines natively.

Why this answer

Option B is correct because Azure Databricks provides native support for complex transformations (e.g., windowed aggregations, multi-step ETL) and seamless integration with machine learning libraries (e.g., MLflow, Spark MLlib), which are not natively available in Azure Stream Analytics. Stream Analytics uses a SQL-like query language and is optimized for simpler, declarative transformations, making Databricks the better choice when advanced analytics or ML model scoring is required in real-time pipelines.

Exam trap

The trap here is that candidates often assume Power BI integration or late-arriving data handling are unique to one service, when in fact both services support these features, and the key differentiators are throughput scalability, exactly-once semantics, and the ability to perform complex transformations with ML integration.

Full explanation →

232

MCQeasy

A company is designing a data storage solution for IoT device telemetry data. The data is append-only, needs to be stored cost-effectively for long-term analytics, and must support querying by device ID and timestamp. Which Azure storage solution should they use?

A.Azure Data Lake Storage Gen2

B.Azure Cosmos DB

C.Azure SQL Database

D.Azure Blob Storage with hot access tier

AnswerA

ADLS Gen2 is designed for big data analytics, supports hierarchical namespace, and is cost-effective for long-term storage.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines the cost-effective, append-only blob storage of Azure Blob Storage with a hierarchical namespace that enables directory-level operations and POSIX-like access control. This makes it ideal for storing large volumes of IoT telemetry data at low cost while supporting efficient querying by device ID and timestamp through partition pruning in tools like Azure Synapse Analytics or Apache Spark.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with ADLS Gen2, assuming that Blob Storage alone supports hierarchical namespace and efficient querying, when in fact ADLS Gen2 is required for the hierarchical namespace and POSIX-like directory structure that enables partition pruning and cost-effective analytics on append-only data.

How to eliminate wrong answers

Option B (Azure Cosmos DB) is wrong because it is a NoSQL database optimized for low-latency, transactional workloads with flexible schemas, not for cost-effective long-term storage of append-only telemetry data; its RU-based pricing model becomes prohibitively expensive for high-volume, append-only IoT data. Option C (Azure SQL Database) is wrong because it is a relational database designed for OLTP workloads with strong consistency and indexing, but its per-core licensing and storage costs are too high for storing petabytes of append-only telemetry data, and it does not natively support the hierarchical namespace needed for efficient partition pruning by device ID and timestamp. Option D (Azure Blob Storage with hot access tier) is wrong because while it provides cost-effective storage, the hot access tier incurs higher storage costs than the cool or archive tiers, and without the hierarchical namespace of ADLS Gen2, querying by device ID and timestamp requires full scans or external indexing, making it less efficient for analytics workloads.

Full explanation →

233

MCQmedium

When designing a data processing solution using Azure Databricks, what is the recommended approach to handle schema evolution when reading data from Delta Lake tables?

A.Set the option 'mergeSchema' to 'true' on write

B.Set the option 'overwriteSchema' to 'true' on write

C.Manually alter the table schema using ALTER TABLE

D.Ignore schema changes and use 'failOnDataLoss' flag

AnswerA

Why this answer

Option A is correct because in Delta Lake, schema evolution is automatically handled by setting the 'mergeSchema' option to 'true' on write operations. This allows new columns to be added or existing column types to be safely widened without manual intervention, preserving existing data and metadata integrity.

Exam trap

The trap here is that candidates often confuse 'mergeSchema' with 'overwriteSchema', mistakenly thinking both handle schema evolution similarly, but 'overwriteSchema' replaces the entire schema and can cause data loss, while 'mergeSchema' safely merges new columns or type changes.

Why the other options are wrong

B

overwriteSchema replaces the entire schema and can cause data loss.

C

Manual approach is error-prone and not recommended for automated pipelines.

D

failOnDataLoss is for streaming jobs, not for schema evolution.

Full explanation →

234

MCQmedium

A data engineering team is designing a batch processing pipeline that reads from Azure Data Lake Storage Gen2, transforms data using Azure Databricks, and writes to Azure Synapse Analytics. The pipeline must process data incrementally and handle late-arriving data up to 2 hours. Which approach should they use to track processed files?

A.Use Blob Storage event triggers to invoke Azure Functions

B.Use Azure Synapse Pipelines with a schedule and full load each time

C.Use Azure Data Factory with watermark columns in the source

D.Store processed file names in a Delta table and compare with source folder listing

AnswerD

Delta table provides a reliable way to track processed files and can be updated incrementally.

Why this answer

Storing processed file names in a Delta table allows the pipeline to track which files have been processed and compare against the source folder to identify new or late-arriving files. This is a common pattern for incremental processing.

Full explanation →

235

Multi-Selecthard

Your organization uses Azure Purview for data governance. You need to ensure that sensitive data is properly classified and that access to it is monitored. Which THREE actions should you take? (Choose three.)

Select 3 answers

A.Define Azure Policy initiatives to enforce classification on all storage accounts.

B.Use Azure Sentinel to classify data as it is ingested.

C.Create custom sensitivity labels in Microsoft Purview Information Protection and apply them to data sources.

D.Integrate Azure Purview with Microsoft Defender for Cloud Apps to monitor access to sensitive data.

E.Set up automated scanning in Azure Purview to discover and classify sensitive data.

AnswersC, D, E

Correct: Sensitivity labels help enforce protection policies and are used in monitoring.

Why this answer

Options A, B, and D are correct. A: Automated scanning classifies data. B: Labeling helps in applying sensitivity tags.

D: Microsoft Defender for Cloud Apps can monitor access and generate alerts. Option C is wrong because Azure Policy does not directly classify or monitor data access. Option E is wrong because Azure Sentinel is for security incident detection, not data classification.

Full explanation →

236

MCQeasy

You are designing a batch processing pipeline that reads CSV files from Azure Blob Storage, performs aggregations using Azure Databricks, and writes results to Azure Synapse Analytics. The pipeline must handle schema drift (new columns appearing in source files). Which approach should you recommend?

A.Use Azure Data Factory mapping data flows with schema drift enabled, mapping to a fixed sink schema.

B.Define a fixed schema in the source and ignore any new columns.

C.Use Spark with mergeSchema option when reading, and write using a Delta table to evolve schema automatically.

D.Use Azure Stream Analytics to pre-process and enforce schema.

AnswerC

Handles schema drift automatically.

Why this answer

Option C is correct because Spark's `mergeSchema` option, when used with Delta Lake, automatically evolves the schema to accommodate new columns in CSV files. This allows the batch pipeline to handle schema drift without manual intervention, and writing to a Delta table ensures the schema evolution is persisted and compatible with downstream writes to Azure Synapse Analytics.

Exam trap

The trap here is that candidates often confuse schema drift handling with schema enforcement, assuming that a fixed sink schema or streaming pre-processing can accommodate dynamic schema changes, when in fact only a schema-on-read approach like Spark's `mergeSchema` with Delta Lake provides the necessary flexibility for batch pipelines.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory mapping data flows with schema drift enabled can handle new columns, but mapping to a fixed sink schema would discard or fail on those new columns, defeating the purpose of handling drift. Option B is wrong because defining a fixed schema and ignoring new columns directly contradicts the requirement to handle schema drift, leading to data loss or pipeline failures. Option D is wrong because Azure Stream Analytics is designed for real-time streaming data, not batch processing, and it enforces a fixed schema rather than evolving it dynamically.

Full explanation →

237

MCQeasy

You are designing a data pipeline in Azure Data Factory that processes streaming data from Azure Event Hubs and stores it in Azure Data Lake Storage Gen2. The data must be encrypted at rest and in transit. Which configuration ensures encryption in transit?

A.Enable encryption at rest using Azure Storage Service Encryption.

B.Use HTTPS endpoint for Azure Data Lake Storage Gen2.

C.Configure the Azure Data Factory integration runtime to use TLS 1.2.

D.Deploy Azure Firewall to inspect traffic between Event Hubs and Data Lake.

AnswerB

HTTPS encrypts data in transit.

Why this answer

Option C is correct because HTTPS ensures encryption in transit. Option A is wrong because encryption at rest does not cover transit. Option B is wrong because TLS is used for SQL connections, not for Data Lake.

Option D is wrong because Azure Firewall does not provide encryption.

Full explanation →

238

MCQeasy

You need to monitor the performance of an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline runs on a self-hosted integration runtime. Which metric is most important to monitor to ensure the self-hosted IR is not a bottleneck?

A.Pipeline duration metric

B.Queue depth for the self-hosted IR

C.Number of active connections to the IR

D.Data read and data written metrics for the pipeline

AnswerB

High queue depth indicates the IR is unable to process activities quickly enough.

Why this answer

Option A is correct because the queue depth indicates how many activities are waiting to be processed, directly showing if the IR is overloaded. Option B is wrong because data read/write measures throughput, not queuing. Option C is wrong because pipeline duration includes all activities.

Option D is wrong because connection count does not directly indicate bottleneck.

Full explanation →

239

MCQeasy

A data engineer needs to store CSV files containing customer data in Azure Blob Storage. The files must be encrypted at rest using a customer-managed key stored in Azure Key Vault. What should they configure?

A.Azure Disk Encryption

B.Azure Storage Firewall

C.Azure Storage Service Encryption (SSE) with customer-managed keys

D.Azure Information Protection

AnswerC

Correct. SSE encrypts data at rest and can use CMK from Key Vault.

Why this answer

Azure Storage Service Encryption (SSE) for Blob Storage encrypts data at rest automatically. By choosing customer-managed keys (CMK) stored in Azure Key Vault, the customer retains control over the encryption keys, meeting the requirement for customer-managed key encryption at rest. SSE with CMK is the correct service for encrypting blobs with a key the customer manages.

Exam trap

The trap here is confusing Azure Disk Encryption (which encrypts VM disks) with Azure Storage Service Encryption (which encrypts blob data), leading candidates to select Option A when the requirement is for blob-level encryption with customer-managed keys.

How to eliminate wrong answers

Option A is wrong because Azure Disk Encryption uses BitLocker or DM-Crypt to encrypt OS and data disks of virtual machines, not the data stored in Azure Blob Storage. Option B is wrong because Azure Storage Firewall controls network access to the storage account via IP rules and virtual network rules, it does not provide encryption at rest. Option D is wrong because Azure Information Protection is a classification and labeling solution for documents and emails, not an encryption mechanism for data at rest in Azure Blob Storage.

Full explanation →

240

MCQeasy

An organization is using Azure Synapse Analytics and wants to implement column-level security to restrict access to sensitive columns. Which feature should they use?

A.Dynamic data masking

B.Azure Purview

C.Column-level security using GRANT

D.Row-level security

AnswerC

Column-level security allows you to restrict access to specific columns via GRANT statements.

Why this answer

Option D is correct because column-level security in Azure Synapse Analytics uses GRANT statements on specific columns. Option A is wrong because row-level security filters rows, not columns. Option B is wrong because dynamic data masking masks data at query time but does not restrict access.

Option C is wrong because Azure Purview is for governance, not access control.

Full explanation →

241

MCQeasy

You have an Azure Data Factory pipeline that uses a Copy activity to move data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with a timeout error. You need to improve the reliability of the data transfer. Which configuration change should you make?

A.Use staged copy with an intermediate Azure Blob Storage.

B.Use PolyBase as the sink.

C.Enable fault tolerance and configure skip incompatible rows.

D.Increase the retry count in the pipeline activity.

AnswerC

This allows the copy to continue even if some rows fail, improving reliability.

Why this answer

Option C is correct because enabling fault tolerance and configuring 'skip incompatible rows' allows the Copy activity to continue processing even when some rows cause errors (e.g., type conversion failures), which can manifest as timeouts when the activity repeatedly retries the same problematic rows. This setting improves reliability by skipping rows that cannot be copied, preventing the entire pipeline from failing on intermittent data issues.

Exam trap

The trap here is that candidates confuse 'fault tolerance' with 'retry policy,' assuming that increasing retries is the only way to handle failures, whereas fault tolerance addresses row-level errors that cause timeouts without requiring a full activity restart.

How to eliminate wrong answers

Option A is wrong because using staged copy with an intermediate Azure Blob Storage is designed to improve performance for large data transfers or to enable PolyBase, not to address timeout errors caused by incompatible rows or transient failures. Option B is wrong because PolyBase is a sink for loading data into Azure Synapse Analytics (SQL Data Warehouse), not for Azure Blob Storage, and it does not resolve timeout errors in a Copy activity. Option D is wrong because increasing the retry count only re-executes the entire activity on failure, which can exacerbate timeout issues if the root cause is incompatible rows or data skew, leading to longer execution times without addressing the underlying problem.

Full explanation →

242

MCQmedium

You are designing a solution for a social media company that needs to store user profile data with strong consistency and low latency (under 10 ms) for reads and writes. The data model is simple key-value with occasional queries on secondary attributes. Which Azure data store meets these requirements?

A.Azure Table Storage

B.Azure SQL Database with clustered index on user ID

C.Azure Cache for Redis with persistence

D.Azure Cosmos DB with session consistency and secondary indexes

AnswerD

Cosmos DB provides <10 ms latency, strong consistency, and indexing on any attribute.

Why this answer

Azure Cosmos DB with session consistency and secondary indexes meets the requirements because it provides single-digit-millisecond latency for both reads and writes, supports strong consistency (session consistency offers monotonic reads and writes), and allows efficient queries on secondary attributes via indexing. This makes it ideal for a key-value store with occasional secondary attribute queries, unlike simpler stores that lack indexing or consistency guarantees.

Exam trap

The trap here is that candidates often choose Azure Cache for Redis (Option C) for low latency, overlooking that it lacks secondary indexes for attribute queries and does not provide strong consistency for writes, which are critical for profile data with occasional secondary lookups.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage does not support secondary indexes, making queries on non-key attributes inefficient (full table scans), and its consistency model is only eventual by default, not strong. Option B is wrong because Azure SQL Database, while supporting strong consistency and secondary indexes, typically has higher latency (often 10-30 ms) for simple key-value operations due to relational overhead and network round-trips, failing the under-10-ms requirement. Option C is wrong because Azure Cache for Redis with persistence is an in-memory cache designed for low latency but lacks built-in secondary indexes for querying non-key attributes, and its persistence model (RDB/AOF) does not guarantee strong consistency for writes (e.g., data loss on failover).

Full explanation →

243

MCQhard

A company is migrating an on-premises Hadoop cluster to Azure. The cluster uses Hive tables stored as Parquet files on HDFS. They want to minimize changes to existing Hive queries and continue using HiveQL. Which Azure storage solution should they choose?

A.Azure HDInsight with Hive and Azure Data Lake Storage Gen2

B.Azure SQL Database with PolyBase

C.Azure Databricks with Delta Lake

D.Azure Synapse Analytics with external tables

AnswerA

HDInsight provides Hive-compatible environment; ADLS Gen2 replaces HDFS seamlessly.

Why this answer

Azure HDInsight with Hive and Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it provides a fully managed Hadoop service that supports HiveQL with minimal changes. ADLS Gen2 offers a hierarchical namespace and is optimized for Hadoop workloads, allowing the existing Parquet files on HDFS to be directly mounted and queried without data movement or schema changes.

Exam trap

The trap here is that candidates often confuse 'supporting HiveQL' with 'running Hive on any Azure service,' but only HDInsight provides the native Hive runtime and HDFS-compatible storage (ADLS Gen2) needed to run existing HiveQL queries unchanged.

How to eliminate wrong answers

Option B is wrong because Azure SQL Database with PolyBase is a relational database engine that does not natively support HiveQL or the Hadoop file system; it would require rewriting queries and converting Parquet files to a relational format. Option C is wrong because Azure Databricks with Delta Lake is an Apache Spark-based platform that, while supporting HiveQL via Spark SQL, introduces Delta Lake's transactional layer, which changes the storage format and requires query modifications; it is not a direct Hive-on-HDFS replacement. Option D is wrong because Azure Synapse Analytics with external tables uses PolyBase to query external data, but it does not run HiveQL natively and requires creating external table definitions, altering the query interface and adding complexity.

Full explanation →

244

MCQhard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. The pool experiences high concurrency and frequent data skew. Which indexing strategy should you recommend to improve query performance for large fact tables?

A.Use hash-distributed tables with clustered columnstore indexes.

B.Use hash-distributed tables with round-robin distribution and clustered columnstore indexes.

C.Use replicated tables with clustered indexes.

D.Use heap tables with non-clustered indexes.

AnswerB

Round-robin distribution evenly distributes data, and columnstore indexes provide compression and performance.

Why this answer

Option B is correct because hash-distributed round-robin tables distribute data evenly, reducing skew. Option A is wrong because hash-distributed clustered columnstore may still have skew if the distribution column is poorly chosen. Option C is wrong because replicated tables are for small dimension tables.

Option D is wrong because heap tables are for staging, not optimized for large fact tables.

Full explanation →

245

MCQmedium

You are using Azure Data Lake Storage Gen2 as the data lake for your organization. You need to process files in the 'incoming' folder using a scheduled Azure Databricks notebook. After processing, the files should be moved to the 'processed' folder. The files are large (up to 10 GB) and you want to minimize the time to move them. Which approach should you use?

A.Use the Azure Databricks dbutils.fs.mv() to move the file.

B.Use Azure Data Factory with a Copy activity to move the file, then delete the source.

C.Change the file's metadata to update its directory path.

D.Use the Azure Databricks dbutils.fs.cp() to copy the file to the processed folder, then delete the original.

AnswerA

The move operation in ADLS Gen2 is a metadata rename, which is instantaneous.

Why this answer

Option A is correct because `dbutils.fs.mv()` performs a metadata-only rename operation on Azure Data Lake Storage Gen2, which is instantaneous regardless of file size. This avoids any data movement, making it the fastest approach for moving large files (up to 10 GB) between folders within the same storage account.

Exam trap

The trap here is that candidates often assume moving large files requires copying, but Azure Data Lake Storage Gen2's hierarchical namespace enables instant metadata-only renames, making `dbutils.fs.mv()` the optimal choice.

How to eliminate wrong answers

Option B is wrong because Azure Data Factory Copy activity physically copies the file data across folders, which is slower and incurs additional read/write costs, even though it can delete the source afterward. Option C is wrong because Azure Data Lake Storage Gen2 does not support moving files by changing metadata; directory paths are part of the file's hierarchical namespace and cannot be updated via metadata alone. Option D is wrong because `dbutils.fs.cp()` performs a full data copy, which is time-consuming for large files, and then requires an explicit delete, adding unnecessary overhead compared to a rename.

Full explanation →

246

MCQhard

Your organization uses Azure Synapse Analytics dedicated SQL pool. You need to implement a solution that reduces storage costs for historical data that is rarely accessed but must be available for querying within minutes. The solution should not require application changes. What should you do?

A.Create external tables pointing to data stored in Azure Data Lake Storage Gen2 with appropriate tiering

B.Use Azure Blob Storage with PolyBase and configure lifecycle management to archive data

C.Drop older partitions and reload data when needed

D.Change the distribution to round-robin for the fact table

AnswerA

External tables allow querying data in Data Lake Storage without moving it, and tiering reduces cost.

Why this answer

Option A is correct because creating external tables in Azure Synapse Analytics dedicated SQL pool that point to data stored in Azure Data Lake Storage Gen2 (ADLS Gen2) allows you to query historical data directly from low-cost storage tiers (e.g., cool or archive) without moving it into the pool. This reduces storage costs for rarely accessed data while keeping it queryable within minutes, and requires no application changes since the external tables are accessed via standard T-SQL queries.

Exam trap

The trap here is that candidates often confuse PolyBase with external tables, assuming PolyBase requires data to be in hot tier or that lifecycle management alone solves the query latency requirement, but they overlook the rehydration delay of archive tier and the need for zero application changes.

How to eliminate wrong answers

Option B is wrong because Azure Blob Storage with PolyBase and lifecycle management to archive data would require application changes to switch query targets and does not natively support querying archived data within minutes (archive tier has a rehydration delay of up to 15 hours). Option C is wrong because dropping older partitions and reloading data when needed is a manual, time-consuming process that violates the requirement of being available for querying within minutes and would require application changes to manage the reload logic. Option D is wrong because changing the distribution to round-robin for the fact table improves query performance for certain workloads but does not reduce storage costs for historical data or address the requirement of tiering rarely accessed data.

Full explanation →

247

MCQmedium

Your company has an Azure Synapse Analytics dedicated SQL pool. You need to implement a solution that automatically moves data between the 'PRIMARY' filegroup and a secondary filegroup based on data age, without manual intervention. Which feature should you use?

A.Materialized views

B.Azure Data Lake Storage tiering

C.PolyBase

D.Partition switching

AnswerD

Partition switching enables efficient data movement between partitions, which can be automated with partition management.

Why this answer

Partition switching in Azure Synapse Analytics dedicated SQL pool allows you to efficiently move data between filegroups by switching partitions between tables. By aligning partitions with data age, you can automatically transfer older data to a secondary filegroup without manual intervention, using a scheduled process like a stored procedure or Azure Data Factory.

Exam trap

The trap here is that candidates confuse data movement between filegroups with storage tiering or external data access, leading them to choose PolyBase or Azure Data Lake Storage tiering instead of recognizing that partition switching is the native mechanism for intra-database data relocation in Synapse dedicated SQL pools.

How to eliminate wrong answers

Option A is wrong because materialized views improve query performance by pre-computing and storing results, but they do not move data between filegroups or manage data lifecycle. Option B is wrong because Azure Data Lake Storage tiering manages data across hot, cool, and archive storage tiers at the storage account level, not within a dedicated SQL pool's filegroups. Option C is wrong because PolyBase is used for querying external data sources (e.g., Azure Blob Storage) using T-SQL, not for moving data between filegroups within the same SQL pool.

Full explanation →

248

MCQmedium

You are optimizing cost for an Azure Data Lake Storage Gen2 account that stores historical data. The data is accessed infrequently after 30 days and must be retained for 7 years. Which lifecycle management rule should you apply?

A.Move blobs to archive tier immediately after 30 days.

B.Delete blobs after 30 days.

C.Move blobs to premium tier after 30 days.

D.Move blobs to cool tier after 30 days, then to archive tier after 1 year.

AnswerD

Optimizes cost based on access patterns.

Why this answer

Option B is correct because moving blobs to cool tier after 30 days and then to archive after a longer period is cost-effective. Option A is wrong because deleting after 30 days loses data. Option C is wrong because premium tier is expensive.

Option D is wrong because archive after 30 days would incur high retrieval costs.

Full explanation →

249

MCQmedium

Your team needs to provide near-real-time analytics on IoT sensor data streaming into Azure Event Hubs. The data must be stored in Azure Data Lake Storage Gen2 in Parquet format, partitioned by date and device ID. Which architecture should you implement?

A.Use Azure Stream Analytics with output to Data Lake Storage Gen2, using partitioning by date and device ID.

B.Use Azure Data Factory with a tumbling window trigger to copy data from Event Hubs to Data Lake Storage.

C.Use Azure Functions to read from Event Hubs and write to Data Lake Storage.

D.Use Azure Databricks with Structured Streaming to read from Event Hubs and write to Data Lake Storage.

AnswerA

Stream Analytics provides native partitioning and Parquet output.

Why this answer

Azure Stream Analytics provides native, low-latency processing of streaming data from Event Hubs with direct output to Azure Data Lake Storage Gen2. It supports automatic partitioning by specifying a partition key (e.g., date and device ID) in the output configuration, enabling efficient, near-real-time writes in Parquet format without additional orchestration.

Exam trap

Microsoft often tests the misconception that any service capable of reading from Event Hubs is suitable for near-real-time analytics, ignoring the critical requirements for native partitioning, low latency, and managed checkpointing that Stream Analytics uniquely provides.

How to eliminate wrong answers

Option B is wrong because Azure Data Factory with a tumbling window trigger operates on a batch schedule (minimum 1 minute), not near-real-time, and cannot natively read from Event Hubs as a streaming source. Option C is wrong because Azure Functions, while capable of event-driven processing, lack native support for checkpointing and exactly-once semantics for high-throughput streaming, leading to potential data loss or duplication. Option D is wrong because Azure Databricks Structured Streaming can achieve near-real-time processing but introduces significant overhead (cluster startup, cost, complexity) compared to the simpler, fully managed Stream Analytics solution for this specific use case.

Full explanation →

250

Drag & Dropmedium

Drag and drop the steps to set up Azure Purview for data cataloging and lineage tracking into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First create the Purview account, then register sources, scan them, set classifications, and finally explore the catalog.

Full explanation →

251

MCQhard

You are troubleshooting slow COPY INTO performance in Azure Synapse Analytics dedicated SQL pool when loading Parquet files from Azure Data Lake Storage Gen2. The files are 1 GB each. What should you do to improve performance?

A.Reduce the file size to 100 MB to increase parallelism

B.Use PolyBase instead of COPY INTO

C.Increase the number of files to match the number of distributions

D.Disable parallel processing in the COPY command

AnswerC

D is correct because each distribution can read a file in parallel, maximizing throughput.

Why this answer

Option C is correct because COPY INTO in Azure Synapse dedicated SQL pool distributes data across 60 distributions. To maximize parallelism, the number of input files should match or exceed the number of distributions. With 1 GB files, you have too few files to fully utilize all distributions, causing some distributions to remain idle.

Increasing the number of files to at least 60 ensures each distribution gets work, improving throughput.

Exam trap

The trap here is that candidates focus on file size reduction (Option A) as a general optimization, but the specific requirement in Synapse dedicated SQL pool is to match the number of files to the number of distributions (60) to avoid distribution skew and maximize parallelism.

How to eliminate wrong answers

Option A is wrong because reducing file size to 100 MB increases the number of files but does not guarantee they match the distribution count; the key is file count, not size, and 100 MB files may still result in fewer than 60 files. Option B is wrong because PolyBase is an older technology that uses external tables and has additional overhead; COPY INTO is the recommended, optimized method for loading Parquet files and is generally faster. Option D is wrong because disabling parallel processing would force sequential loading, drastically reducing performance; COPY INTO inherently uses parallel processing to leverage all distributions.

Full explanation →

252

MCQmedium

A data engineer is designing a solution that uses Azure Data Factory to copy data from an on-premises SQL Server to Azure Synapse Analytics. The data transfer must be encrypted in transit. Which property should be configured in the linked service?

A.ConnectionString with Integrated Security

B.AuthenticateVia

C.EncryptedConnection

D.UseSystemTrustStore

AnswerC

Enables TLS encryption for data transfer.

Why this answer

Option C is correct because the EncryptedConnection property in an Azure Data Factory linked service enforces encryption for data in transit between the on-premises SQL Server and Azure Synapse Analytics. When set to true, it uses TLS/SSL to encrypt the connection, ensuring that data transferred over the network is protected from interception or tampering.

Exam trap

The trap here is that candidates often confuse 'encryption in transit' with authentication methods (like Integrated Security) or certificate validation settings (like UseSystemTrustStore), leading them to select options that address identity or trust rather than the actual encryption of the data channel.

How to eliminate wrong answers

Option A is wrong because ConnectionString with Integrated Security specifies Windows authentication credentials but does not control encryption of the data in transit; it is unrelated to TLS/SSL enforcement. Option B is wrong because AuthenticateVia defines the authentication method (e.g., Managed Identity, Service Principal) for the linked service, not the encryption of the data channel. Option D is wrong because UseSystemTrustStore determines whether to use the system's certificate trust store for validating the server's TLS certificate, but it does not enable or disable encryption itself; encryption must be explicitly set via EncryptedConnection.

Full explanation →

253

MCQhard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. A fact table named Sales is partitioned by date and has a hash-distributed column ProductKey. You notice that queries filtering on OrderDate are performing poorly. You need to improve query performance for date range filters. What should you do?

A.Ensure the table is partitioned on the same column used in WHERE clause (OrderDate) and enable partition elimination.

B.Create a non-clustered index on OrderDate.

C.Add ProductKey as a distribution column alongside the existing hash.

D.Change the distribution to round-robin to spread data evenly.

AnswerA

Partition elimination reduces the data scanned by skipping partitions not matching the filter.

Why this answer

Option A is correct because Partition Elimination allows the engine to skip irrelevant partitions. Option B is wrong because round-robin distribution does not improve partition elimination. Option C is wrong because increasing distribution columns may not help with date filtering.

Option D is wrong because indexes are not the primary solution for partition pruning.

Full explanation →

254

MCQhard

You are reviewing an Azure Data Factory pipeline JSON. Based on the exhibit, what will be the behavior of the Copy activity when copying files from a source folder that contains subfolders?

A.The copy will use staging to improve performance.

B.Only files in the root folder will be copied.

C.Files will be copied preserving the source folder structure.

D.All files from all subfolders will be copied into a single folder in the sink.

AnswerD

FlattenHierarchy merges all files into one folder.

Why this answer

Option B is correct because recursive: true ensures all files from subfolders are copied, and FlattenHierarchy writes them to the sink without preserving the folder structure. Option A is wrong because recursive is true. Option C is wrong because FlattenHierarchy does not preserve structure.

Option D is wrong because staging is disabled.

Full explanation →

255

MCQmedium

You are reviewing a script to create an external data source in Azure Synapse Analytics serverless SQL pool. Based on the exhibit, what is the purpose of the SAS token?

A.To provide read access to the container for querying data.

B.To provide write access to the container for storing query results.

C.To encrypt the connection between the serverless pool and storage.

D.To authenticate the user to the serverless SQL pool.

AnswerA

The SAS includes 'sp=rl' which grants read and list permissions.

Why this answer

Option B is correct because the SAS token provides delegated access to the storage account for reading (sp=r) and listing (sp=l). Option A is wrong because the SAS is for a container, not a specific file. Option C is wrong because the SAS does not have write permission (sp=rl).

Option D is wrong because the SAS is used in the credential for the external data source.

Full explanation →

256

MCQmedium

You are a data engineer at a healthcare analytics company. The company uses Azure Data Factory (ADF) to orchestrate data pipelines that ingest patient data from on-premises SQL Server databases into Azure Synapse Analytics. Recently, the pipeline has been failing intermittently with the following error: 'Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Cannot connect to SQL Server Database. The TCP connection to the host <server_name>, port 1433 has failed. Error: 'Connection timed out.'.' The on-premises SQL Server is behind a corporate firewall. The ADF self-hosted integration runtime (SHIR) is installed on a VM inside the corporate network. You have verified that the SHIR is running and that the SQL Server is accessible from the SHIR VM using SQL Server Management Studio (SSMS). The error occurs sporadically, not consistently. What is the most likely cause of the intermittent connection timeout?

A.The data being transferred is skewed, causing the sink to be overwhelmed.

B.The corporate firewall or network device is closing idle TCP connections to the SQL Server database.

C.The SQL Server database is experiencing high CPU utilization during the pipeline execution window.

D.The self-hosted integration runtime is running out of memory during peak loads.

AnswerB

Firewalls often drop idle connections after a timeout period. When the pipeline uses a connection from the pool that has been idle, the connection is no longer valid, causing a timeout. This explains the intermittent nature.

Why this answer

The intermittent nature of the timeout, combined with the fact that the SHIR VM can connect to SQL Server via SSMS, strongly suggests that the corporate firewall or a network intermediary (such as a load balancer or NAT device) is closing idle TCP connections. ADF pipelines may hold connections open between activities or during long-running data transfers, and if no keep-alive packets are sent within the firewall's idle timeout window (commonly 4–30 minutes), the firewall drops the TCP session. When ADF attempts to reuse that connection, it receives a 'Connection timed out' error because the socket is no longer valid.

Exam trap

The trap here is that candidates assume the error is due to resource exhaustion (CPU, memory, or data skew) because those are common causes of intermittent failures, but the specific 'Connection timed out' error points to a network-layer issue, not a server-side performance bottleneck.

How to eliminate wrong answers

Option A is wrong because data skew would cause performance issues like slow writes or out-of-memory errors on the sink, not a TCP connection timeout to the source SQL Server. Option C is wrong because high CPU utilization on SQL Server would manifest as query timeouts or slow performance, not a TCP-level connection timeout (which occurs before any query is sent). Option D is wrong because SHIR running out of memory would produce out-of-memory exceptions or pipeline failures with different error codes, not a TCP connection timeout to the database.

Full explanation →

257

Multi-Selectmedium

You are monitoring the performance of an Azure Data Factory pipeline that uses a Copy activity to load data into Azure Synapse Analytics. Which THREE metrics should you monitor to identify potential performance bottlenecks?

Select 3 answers

A.Throughput (data read/written per second).

B.Integration runtime CPU utilization.

C.Pipeline run duration.

D.Copy activity duration.

E.Data read and written metrics.

AnswersA, D, E

Throughput indicates the speed of data transfer.

Why this answer

Options A, C, and D are correct. Data read and written, copy activity duration, and throughput are direct indicators. Option B is wrong because pipeline run duration includes orchestration overhead.

Option E is wrong because integration runtime CPU is not a standard metric in Azure Data Factory.

Full explanation →

258

MCQhard

Contoso Ltd. runs a real-time analytics solution on Azure Databricks with data streaming from Event Hubs. They need to ensure that all data in transit between Event Hubs and Databricks is encrypted using TLS 1.2 or higher. Currently, the Event Hubs namespace is configured with the default TLS version (1.0). The Databricks cluster uses a public endpoint. Compliance requires that only TLS 1.2 is accepted. You need to configure the environment to enforce TLS 1.2 without disrupting ongoing streaming. What should you do?

A.Update the Event Hubs namespace to require TLS 1.2, then modify the Databricks streaming job's connection string to include 'TransportType=AmqpTls' and restart the streaming job.

B.Change the Event Hubs namespace minimum TLS version to 1.2 in the Azure portal, then reboot the Databricks cluster.

C.In the Event Hubs namespace, set 'Minimum TLS version' to 1.2 and redeploy the Databricks cluster with a new init script that forces TLS 1.2.

D.Use Azure CLI to set the Event Hubs namespace TLS version to 1.2 and update the Databricks cluster's Spark configuration to use TLS 1.2.

AnswerA

Enforces TLS 1.2 with minimal disruption.

Why this answer

Step 1: Update the Event Hubs namespace to require TLS 1.2 via 'Minimum TLS version' setting. Step 2: Configure the Databricks cluster to connect using TLS 1.2 by setting spark.conf to 'spark.eventhubs.connectionString' with TLS parameter. Step 3: Restart the streaming job to apply new connection settings.

Option D includes all steps in order. Option A reboots without config change. Option B uses Azure CLI but doesn't cover Databricks.

Option C only updates Event Hubs but not Databricks.

Full explanation →

259

Multi-Selecthard

Which TWO strategies can be used to optimize storage costs for historical data in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete to recover data

B.Use geo-redundant storage (GRS) for durability

C.Implement lifecycle management policies to move data to archive tier

D.Store data in compressed columnar format like Parquet

E.Encrypt data with Azure Storage Service Encryption

AnswersC, D

Archive tier is cheapest.

Why this answer

Option C is correct because Azure Blob Storage lifecycle management policies allow you to automatically transition data from hot to cool to archive tiers based on age or last modification time. Moving historical data to the archive tier significantly reduces storage costs, as archive is the lowest-cost storage tier, though it incurs higher retrieval latency and costs.

Exam trap

The trap here is that candidates confuse data protection features (soft delete, encryption, replication) with cost optimization strategies, but only tiering and compression directly reduce the amount or cost of stored data.

Full explanation →

260

Multi-Selecthard

Which THREE methods can you use to monitor and optimize the performance of an Azure Data Lake Storage Gen2 account?

Select 3 answers

A.Use Azure Advisor to get performance recommendations.

B.Enable Azure Monitor metrics for the storage account.

C.Implement lifecycle management policies to move data to cooler tiers.

D.Use Azure SQL Analytics to query storage logs.

E.Configure Storage Analytics logs for read and write requests.

AnswersB, C, E

Metrics like latency and throughput help monitor performance.

Why this answer

Options A, C, and D are correct. Azure Monitor metrics provide performance data; Storage Analytics logs (classic) give request details; lifecycle management optimizes costs by tiering. Option B (Azure SQL Analytics) is for SQL databases.

Option E (Azure Advisor) provides recommendations but is not a monitoring method per se; it's an advisory tool.

Full explanation →

261

Multi-Selecthard

Which THREE are best practices for optimizing query performance in Azure Synapse Analytics dedicated SQL pool?

Select 3 answers

A.Use materialized views for complex aggregations

B.Use the largest resource class for all queries

C.Create clustered columnstore indexes

D.Use hash distribution on columns used in JOINs

E.Use round-robin distribution for large fact tables

AnswersA, C, D

Pre-computes results.

Why this answer

Materialized views precompute and store the results of complex aggregations, such as SUM, COUNT, AVG, or GROUP BY operations, in Azure Synapse dedicated SQL pool. When a query references the same aggregation pattern, the optimizer can automatically substitute the materialized view, significantly reducing compute and I/O overhead by avoiding full table scans and recomputation. This is a best practice for improving performance on repetitive analytical workloads.

Exam trap

The trap here is that candidates often confuse resource class with performance optimization, assuming larger resource classes always speed up queries, when in fact they reduce concurrency and can cause resource contention, making them a poor general-purpose best practice.

Full explanation →

262

MCQhard

You are analyzing a Kusto query in Azure Data Explorer that calculates total sales per product for January 2024 and filters for products with sales over 10,000. The query uses the materialize() function. You notice that the query runs slower than expected. What is the primary reason the materialize() function may not be providing the expected performance benefit in this query?

A.The join with the Products table forces a shuffle that bypasses the materialized result

B.The query uses summarize, which already materializes results internally

C.The datetime range filter is not sargable, causing full table scan

D.The materialize() result is referenced only once in the query, so materialization adds unnecessary overhead

AnswerD

materialize() caches the result; if used once, caching is wasted.

Why this answer

Option B is correct because materialize() only helps if the result is referenced multiple times; in this query, TotalSales is used only once, so materialize() adds overhead without benefit. Option A is wrong because the query is already using summarize. Option C is wrong because datetime filtering is fine.

Option D is wrong because order by does not conflict with materialize.

Full explanation →

263

Multi-Selecteasy

You are optimizing a Spark DataFrame transformation in Azure Synapse Analytics. The DataFrame has 20 columns and 100 million rows. You notice that the job is slow due to many small files being written to the output. Which two actions can you take to reduce the number of output files? (Choose two.)

Select 2 answers

A.Use coalesce() to reduce the number of partitions without a shuffle.

B.Enable caching on the DataFrame before writing.

C.Apply bucketing on a column to group data.

D.Increase the number of partitions using repartition() with a larger number.

E.Use repartition() with a smaller number of partitions.

AnswersA, E

Coalesce reduces partitions and thus output files, minimizing shuffle.

Why this answer

Option A is correct because `coalesce()` reduces the number of partitions without triggering a full shuffle, which minimizes the number of output files while preserving performance. Since the DataFrame already has 100 million rows and 20 columns, coalescing to fewer partitions directly reduces the number of files written, addressing the small-file problem efficiently.

Exam trap

The trap here is that candidates often confuse `coalesce()` with `repartition()`, assuming both cause a shuffle, or they mistakenly think increasing partitions (Option D) will improve performance when it actually exacerbates the small-file issue.

Full explanation →

264

MCQmedium

You are designing a solution to store streaming data from multiple sources into Azure Data Lake Storage Gen2. The data must be organized by ingestion time and source system. Each source system produces data in a different format: CSV, JSON, and Parquet. The solution must allow efficient querying using Azure Synapse Serverless SQL and must support partitioning on ingestion date. What is the recommended folder structure?

A./data/{date}/{source_system}/ (e.g., /data/2023-01-01/SourceA/)

B./data/{source_system}/{date}/ (e.g., /data/SourceA/2023-01-01/)

C./data/{source_system}/ with files named {timestamp}.csv/.json/.parquet

D./data/{source_system}/{year}/{month}/{day}/ (e.g., /data/SourceA/2023/01/01/)

AnswerB

This structure separates sources and dates, enabling efficient query pruning.

Why this answer

Option B is correct because it places the source system partition first, which aligns with Azure Synapse Serverless SQL's partition elimination behavior when querying by source system. The date partition at the end allows efficient pruning on ingestion date, and the hierarchical folder structure maps directly to Hive-style partitioning, which Synapse Serverless SQL natively supports for CSV, JSON, and Parquet formats.

Exam trap

The trap here is that candidates often choose Option D because they assume year/month/day granularity is required for performance, but Azure Synapse Serverless SQL treats each folder level as a separate partition key, and a single date folder is sufficient for efficient pruning without unnecessary complexity.

How to eliminate wrong answers

Option A is wrong because placing the date before the source system forces Synapse Serverless SQL to scan all source system folders for a given date, preventing efficient partition elimination when filtering by source system. Option C is wrong because flat file naming without folder-level partitioning prevents Synapse Serverless SQL from using partition pruning at all, leading to full scans even for date- or source-filtered queries. Option D is wrong because while it uses Hive-style partitioning, the year/month/day hierarchy is redundant when a single date partition is sufficient; it adds unnecessary folder depth without performance benefit and complicates partition management.

Full explanation →

265

MCQhard

You are a data engineer at a healthcare analytics company. The company stores patient records in an Azure Data Lake Storage Gen2 account organized by /patient/{patientId}/year={yyyy}/month={MM}/day={dd}/*.parquet. There are 10,000 patients, and each patient has about 1 GB of data per year. The data is used by data scientists who run ad-hoc queries using Azure Synapse Serverless SQL. They complain that queries scanning multiple patients over the last year take too long and consume too much data. They often need to filter by patientId and a date range. You need to improve query performance and reduce the amount of data scanned. You cannot change the folder structure because it is used by other processes. What should you do?

A.Reorganize the folder structure to /year={yyyy}/month={MM}/day={dd}/patientId={patientId}/*.parquet.

B.Convert the Parquet files to CSV format to improve compression and reduce file size.

C.Create views that aggregate data by patient and date, and instruct data scientists to query the views.

D.Create external tables in Synapse Serverless SQL that use the folder structure as partitions, and ensure queries filter on year, month, and day.

AnswerD

External tables with partition elimination reduce scanned data.

Why this answer

Option D is correct because creating external tables in Synapse Serverless SQL with the existing folder structure as partitions allows the query engine to perform partition elimination. When queries filter on year, month, and day, Synapse Serverless SQL will only scan the relevant folders, drastically reducing data scanned and improving performance. This approach does not require changing the folder structure, which is used by other processes.

Exam trap

The trap here is that candidates may think views can improve performance, but in Synapse Serverless SQL, views are non-materialized and do not reduce data scanned unless the underlying data is partitioned and queries filter on partition columns.

How to eliminate wrong answers

Option A is wrong because it suggests reorganizing the folder structure, which is explicitly prohibited by the requirement that the folder structure cannot be changed. Option B is wrong because converting Parquet to CSV would increase file size (CSV is not compressed by default and lacks columnar storage benefits), worsening performance and data scanned. Option C is wrong because creating views does not change the underlying data layout or partition elimination; views are just saved queries and do not reduce the amount of data scanned unless they are materialized, which Synapse Serverless SQL does not support.

Full explanation →

266

MCQmedium

You are designing a data transformation solution for a retail company. The company receives daily CSV files from 200 stores via SFTP. The files must be cleaned, validated, and aggregated before loading into Azure Synapse dedicated SQL pool. The solution must minimize administrative overhead and support easy monitoring. Which approach do you recommend?

A.Use Azure Functions to process each file and write to Synapse via REST API

B.Use PolyBase external tables to load raw data and then use T-SQL stored procedures for transformation

C.Use Azure Databricks with Python notebooks to process the files and write to Synapse

D.Use Azure Data Factory with Mapping Data Flows to clean, validate, and aggregate the data, then load into Synapse SQL pool

AnswerD

Mapping Data Flows provide a visual interface for transformations, are serverless, and have rich monitoring via ADF.

Why this answer

Option D is correct because Azure Data Factory (ADF) with Mapping Data Flows provides a fully managed, code-free ETL service that can read CSV files from SFTP, perform cleaning, validation, and aggregation at scale using Spark clusters, and load the results directly into Azure Synapse dedicated SQL pool via the PolyBase sink. This minimizes administrative overhead by eliminating infrastructure management and supports easy monitoring through ADF’s built-in integration with Azure Monitor and pipeline run views.

Exam trap

The trap here is that candidates often overestimate the simplicity of Azure Functions for batch ETL or assume PolyBase alone handles transformations, when in fact ADF Mapping Data Flows are purpose-built for visual, scalable, and monitorable ETL with minimal overhead.

How to eliminate wrong answers

Option A is wrong because Azure Functions are stateless, event-driven compute units that lack native connectors for SFTP and Synapse, requiring custom code for file parsing, state management, and batch loading, which increases administrative overhead and complexity. Option B is wrong because PolyBase external tables can only load raw data into staging tables, but the transformation logic (cleaning, validation, aggregation) would need to be implemented in T-SQL stored procedures, which are harder to monitor, scale, and maintain compared to a visual ETL tool like ADF. Option C is wrong because Azure Databricks with Python notebooks introduces significant administrative overhead for cluster management, notebook orchestration, and monitoring, and requires more specialized skills than ADF’s low-code Mapping Data Flows, making it less suitable for minimizing overhead.

Full explanation →

267

Multi-Selecteasy

Which TWO features are available in Azure Data Lake Storage Gen2 but not in Azure Blob Storage? (Choose two.)

Select 2 answers

A.Hierarchical namespace

B.Immutable storage

C.Soft delete for blobs

D.Lifecycle management policies

E.POSIX-compliant access control lists

AnswersA, E

B is correct because hierarchical namespace enables directory structures.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) extends Azure Blob Storage by adding a hierarchical namespace, which organizes objects into a directory structure similar to a file system. This enables efficient directory-level operations (e.g., renaming or deleting a directory in O(1) time) and supports POSIX-compliant access control lists (ACLs) for fine-grained permissions. These two features are not available in standard Azure Blob Storage, which uses a flat namespace and only supports container-level access policies.

Exam trap

The trap here is that candidates often assume features like soft delete or lifecycle management are exclusive to ADLS Gen2, when in fact they are shared with Blob Storage, while the hierarchical namespace and POSIX ACLs are the true differentiators.

Full explanation →

268

MCQhard

You are troubleshooting a Synapse Spark notebook that fails when reading Parquet files from Azure Data Lake Storage Gen2. The error message indicates 'Permission denied'. The notebook uses a managed identity (System-assigned) for authentication. The Data Lake Storage account has a firewall enabled with 'Allow Azure services on the trusted services list' turned on. The storage account's RBAC role assignments include 'Storage Blob Data Contributor' for the managed identity. What is the most likely cause of the failure?

A.Parquet files require special permissions that are not granted by RBAC roles

B.The managed identity has not been granted the 'Storage Blob Data Reader' role in addition to 'Storage Blob Data Contributor'

C.The storage account firewall does not have a 'Resource instances' exception for the managed identity

D.The notebook is using an incorrect connection string with account key

AnswerC

Firewall rules require explicit addition of the managed identity as a resource instance to allow access when the firewall is enabled.

Why this answer

Option C is correct because when a storage account firewall is enabled with 'Allow Azure services on the trusted services list' turned on, it only allows trusted Azure platform services to access the storage account, but it does not automatically grant access to a specific managed identity. To allow a managed identity to bypass the firewall, you must add a 'Resource instances' exception for that managed identity's resource (e.g., the Synapse workspace). Without this explicit exception, the managed identity's request is blocked by the firewall, resulting in a 'Permission denied' error even though the RBAC role assignment is correct.

Exam trap

The trap here is that candidates assume 'Allow Azure services on the trusted services list' automatically includes all Azure resources with managed identities, but it only covers specific Azure platform services, not custom managed identities from services like Synapse.

How to eliminate wrong answers

Option A is wrong because Parquet files do not require special permissions beyond what RBAC roles provide; RBAC roles like 'Storage Blob Data Contributor' grant sufficient permissions to read and write Parquet files. Option B is wrong because 'Storage Blob Data Contributor' already includes all permissions of 'Storage Blob Data Reader' (read, write, delete), so adding the reader role is redundant and not the cause of the failure. Option D is wrong because the notebook uses a managed identity for authentication, not a connection string with an account key; the error is about permission denied, not an incorrect connection string or key.

Full explanation →

269

Multi-Selectmedium

Which TWO of the following are valid methods to load data into a dedicated SQL pool in Azure Synapse Analytics?

Select 2 answers

A.BULK INSERT

B.Azure Data Factory Copy Activity

C.COPY statement

D.Azure Import/Export service

E.PolyBase

AnswersC, E

COPY is a highly recommended loading method.

Why this answer

The COPY statement is a first-class, high-throughput ingestion method for dedicated SQL pools in Azure Synapse Analytics. It provides a simple and flexible way to load data from Azure Data Lake Storage Gen2 or Azure Blob Storage with built-in error handling, automatic schema inference, and support for various file formats (CSV, Parquet, ORC). Unlike PolyBase, the COPY statement does not require external tables and is optimized for performance with parallel loading.

Exam trap

The trap here is that candidates often confuse the COPY statement with BULK INSERT, assuming both are valid for dedicated SQL pools, but BULK INSERT is only supported in SQL Server and Azure SQL Database, not in Synapse dedicated SQL pools.

Full explanation →

270

MCQeasy

You need to monitor the performance of an Azure Stream Analytics job that processes real-time IoT data. Which metric indicates the number of events that are being dropped or delayed due to insufficient processing capacity?

A.Watermark delay.

B.Output events.

C.Backlogged input events.

D.Input events.

AnswerC

Backlogged input events shows the number of events that are queued but not yet processed, indicating capacity issues.

Why this answer

Option B is correct because 'Backlogged input events' indicates events that are waiting to be processed. Option A is wrong because 'Input events' is the total received. Option C is wrong because 'Output events' is the total sent.

Option D is wrong because 'Watermark delay' measures latency.

Full explanation →

271

Multi-Selecthard

Which THREE of the following are required to configure a managed private endpoint for Azure Data Factory when connecting to an Azure SQL Database that has a private endpoint?

Select 3 answers

A.The managed identity of Azure Data Factory must be granted access to the private endpoint.

B.The private endpoint must be in the same subscription as the Azure Data Factory.

C.A private DNS zone for Azure SQL Database must be linked to the virtual network.

D.The Azure SQL Database firewall must have 'Allow Azure services and resources to access this server' enabled.

E.The private endpoint must be in the same region as the Azure Data Factory.

AnswersC, D, E

Required for name resolution.

Why this answer

Option C is correct because when using a managed private endpoint for Azure Data Factory to connect to an Azure SQL Database that has a private endpoint, a private DNS zone (privatelink.database.windows.net) must be linked to the virtual network. This ensures that DNS resolution for the Azure SQL Database private endpoint resolves to the private IP address within the virtual network, enabling connectivity over the private link without relying on public endpoints.

Exam trap

The trap here is that candidates often confuse granting the managed identity access to the private endpoint (which is unnecessary) with granting it access to the Azure SQL Database itself, or assume that the private endpoint must be in the same subscription or region as the Data Factory, when in fact only the virtual network region matters and cross-subscription private endpoints are supported with proper approvals.

Full explanation →

272

Multi-Selecthard

Your company uses Azure Synapse Analytics for a data warehouse. The fact table is 500 GB and distributed by hash on CustomerID. You notice that queries joining the fact table with the Customer dimension table are slow due to data movement. The Customer dimension table is 10 GB. Which THREE actions should you take to improve query performance?

Select 3 answers

A.Use heap index on the fact table

B.Change the fact table distribution to round-robin

C.Change the fact table distribution to hash on CustomerID

D.Implement partitioning on the fact table by date

E.Replicate the Customer dimension table to all compute nodes

AnswersC, D, E

Hash distribution on the join key colocates data with the dimension.

Why this answer

Option C is correct because the fact table is already distributed by hash on CustomerID, which is the join key with the Customer dimension table. This co-locates matching rows on the same compute node, eliminating data movement during joins. Keeping this distribution is essential for performance; changing it would break co-location and increase shuffle overhead.

Exam trap

The trap here is that candidates might think changing the distribution method (e.g., to round-robin) would help, but they overlook that the fact table is already correctly hash-distributed on the join key, and the real issue is the dimension table not being replicated, causing unnecessary data movement.

Full explanation →

273

MCQhard

You are designing a near-real-time analytics pipeline for a retail company. Transaction data is generated in Azure SQL Database and must be replicated to Azure Synapse Analytics (dedicated SQL pool) with less than 5 minutes latency. The source table has 50 million rows and 200 columns, but only 30 columns are needed for analytics. Which approach should you recommend?

A.Use Azure SQL Database Change Tracking and push changes to Azure Event Hubs, then use Azure Stream Analytics to write to Synapse.

B.Enable Change Data Capture (CDC) on the source table and use Azure Data Factory with a 1-minute tumbling window to copy changes into Synapse.

C.Use Azure Synapse PolyBase to directly query the source SQL database every 5 minutes.

D.Schedule a full copy of the entire table every 5 minutes using Azure Data Factory.

AnswerB

CDC captures only changed rows, and ADF can run frequently to meet latency target.

Why this answer

Option B is correct because Azure Data Factory (ADF) with Change Data Capture (CDC) on the source SQL database can incrementally copy only changed rows (inserts, updates, deletes) into Azure Synapse Analytics using a 1-minute tumbling window, meeting the sub-5-minute latency requirement while minimizing data volume. This approach efficiently handles 50 million rows by transferring only the 30 needed columns, avoiding full table scans and reducing network load.

Exam trap

The trap here is that candidates often confuse Change Tracking (which only tracks that a row changed, not the actual changes) with Change Data Capture (which captures the before-and-after values), leading them to choose Option A without realizing the missing push mechanism and the need for additional services to achieve near-real-time replication.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database Change Tracking does not natively push changes to Event Hubs; it requires custom logic or additional services (e.g., Azure Functions) to bridge the gap, adding complexity and potential latency that may not guarantee sub-5-minute replication. Option C is wrong because PolyBase in Synapse is designed for batch querying of external data sources, not for near-real-time incremental replication; querying the source SQL database every 5 minutes would perform full table scans on 50 million rows, causing high source database load and failing to meet latency requirements. Option D is wrong because scheduling a full copy of the entire 50-million-row table every 5 minutes is extremely inefficient, consumes excessive bandwidth and Synapse storage resources, and would likely exceed the latency window due to the time required for a full data transfer.

Full explanation →

274

MCQeasy

A manufacturing company uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled and Azure Databricks for analytics. The security team requires that all data stored in the 'raw' container be encrypted at rest using customer-managed keys. The data is ingested via Azure Data Factory. What should the data engineer configure to meet the requirement?

A.Assign an Azure Policy that requires encryption at rest.

B.Enable Azure Information Protection on the storage account.

C.Configure the storage account to use Azure Key Vault for customer-managed key encryption.

D.Enable the 'require secure transfer' setting on the storage account.

AnswerC

This enables encryption at rest with a customer-managed key.

Why this answer

Option C is correct because Azure Data Lake Storage Gen2 with hierarchical namespace supports encryption at rest using customer-managed keys (CMK) via Azure Key Vault. To meet the security requirement, the data engineer must configure the storage account's encryption settings to use a key from Azure Key Vault, which allows the organization to control and rotate the encryption keys independently of Azure.

Exam trap

The trap here is that candidates may confuse encryption at rest (which is always enabled by default) with the specific requirement for customer-managed keys, leading them to pick Azure Policy or 'require secure transfer' as a catch-all security measure.

How to eliminate wrong answers

Option A is wrong because an Azure Policy can enforce encryption at rest, but it does not specify the use of customer-managed keys; it only ensures that encryption is enabled (which is already default with Microsoft-managed keys). Option B is wrong because Azure Information Protection is a classification and labeling service for data, not an encryption-at-rest mechanism for storage accounts. Option D is wrong because 'require secure transfer' enforces HTTPS for data in transit, not encryption at rest, and does not involve customer-managed keys.

Full explanation →

275

Multi-Selectmedium

You are designing a data storage solution for a manufacturing company that collects sensor data from machines. The data is stored in Azure Data Lake Storage Gen2. You need to ensure that the solution can handle large volumes of streaming data (up to 100 MB/s) and provide real-time dashboards. Which TWO services should you include?

Select 2 answers

A.Azure Analysis Services

B.Azure Data Factory

C.Azure Databricks

D.Azure Stream Analytics

E.Azure Event Hubs

AnswersD, E

Stream Analytics processes streaming data and can output to real-time dashboards.

Why this answer

Azure Stream Analytics is correct because it is a real-time analytics service designed to process high-velocity streaming data (up to 100 MB/s) from sources like Event Hubs and output to dashboards and storage. It provides low-latency, SQL-based querying for real-time dashboards, making it ideal for manufacturing sensor data scenarios.

Exam trap

Microsoft often tests the misconception that Azure Databricks is a real-time dashboard service, but it is primarily a processing engine that requires additional integration for dashboard output, whereas Stream Analytics is purpose-built for direct, low-latency dashboarding.

Full explanation →

276

MCQmedium

You have an Azure Synapse Analytics dedicated SQL pool. You notice that some queries are taking longer than expected. After reviewing the query plans, you see that some queries are spilling to tempdb. What should you do to reduce tempdb spills?

A.Increase the resource class for the user executing the queries.

B.Redistribute the tables using hash distribution.

C.Rebuild all columnstore indexes.

D.Add partitioning to the tables.

AnswerA

Larger resource class allocates more memory, reducing tempdb spills.

Why this answer

Tempdb spills occur when a query requires more memory than is allocated to it, forcing intermediate results to be written to disk. Increasing the resource class for the user executing the queries allocates more memory to that user's queries, reducing the likelihood of spills. This directly addresses the memory constraint that causes spills in a dedicated SQL pool.

Exam trap

The trap here is that candidates often confuse performance tuning techniques like indexing or partitioning with memory management, assuming any optimization will fix spills, when only increasing memory allocation (via resource class) directly addresses the root cause.

How to eliminate wrong answers

Option B is wrong because redistributing tables using hash distribution improves data movement and join performance but does not directly increase per-query memory allocation to prevent tempdb spills. Option C is wrong because rebuilding columnstore indexes improves compression and scan performance but does not address the memory grant issue that causes spills. Option D is wrong because adding partitioning can improve partition elimination and manageability but does not increase the memory available to individual queries, so it will not reduce tempdb spills.

Full explanation →

277

Multi-Selectmedium

You are implementing a data processing solution using Azure Data Factory. You have a pipeline that copies data from Azure Blob Storage to Azure Data Lake Storage Gen2. You need to ensure that the copy activity uses managed identity for authentication and that the data is transferred securely. Which TWO configurations should you apply?

Select 2 answers

A.Set the copy activity's 'enableDataIntegrityValidation' to true.

B.Set the copy activity's 'UseManagedIdentity' property to true.

C.Configure the source and sink linked services to use managed identity authentication.

D.Set the integration runtime to 'AutoResolveIntegrationRuntime' to ensure data stays within Azure.

E.Enable staged copy with a temporary blob store to improve security.

AnswersC, D

Linked services support managed identity for authentication.

Why this answer

Options B and D are correct. Option B is correct because managed identity authentication is configured in the linked service. Option D is correct because setting integration runtime to 'AutoResolveIntegrationRuntime' uses Azure's network, ensuring data transfer remains within Azure and does not go over public internet (if using managed VNet).

Option A is wrong because the copy activity does not have a 'UseManagedIdentity' setting; it's in the linked service. Option C is wrong because HTTPS is already used by default; setting it explicitly is not required. Option E is wrong because staging is not needed for security.

Full explanation →

278

MCQmedium

A data engineer is designing a solution to store historical sales data for a retail company. The data is append-only and accessed infrequently for compliance reports. The solution must minimize storage costs while allowing retrieval within 24 hours. Which storage tier should be used for the data?

A.Hot tier

B.Archive tier

C.Cool tier

D.Premium tier

AnswerC

Cost-effective for infrequently accessed data with retrieval within hours.

Why this answer

The Cool tier is the correct choice because it is optimized for data that is infrequently accessed and stored for at least 30 days, offering low storage costs with retrieval times in the range of seconds to hours, which meets the 24-hour retrieval requirement. The data is append-only and used for compliance, so the Cool tier balances cost and accessibility without the high retrieval costs or long rehydration delays of the Archive tier.

Exam trap

The trap here is that candidates often choose the Archive tier because it has the lowest storage cost, overlooking the rehydration latency and the fact that retrieval within 24 hours is not guaranteed with standard priority rehydration, especially under heavy demand.

How to eliminate wrong answers

Option A is wrong because the Hot tier is designed for frequently accessed data and has higher storage costs, which would unnecessarily increase expenses for infrequently accessed compliance data. Option B is wrong because the Archive tier has the lowest storage cost but requires a rehydration process that can take up to 15 hours (and often longer), which may not guarantee retrieval within 24 hours and incurs significant read and data retrieval costs. Option D is wrong because the Premium tier is for high-performance, low-latency access (e.g., for transactional or real-time workloads) and is the most expensive option, making it unsuitable for cost-minimized, infrequently accessed historical data.

Full explanation →

279

Multi-Selecthard

Which THREE metrics should you monitor to evaluate the performance of an Azure Stream Analytics job?

Select 3 answers

A.Input Events Backlogged

B.Output Events

C.Conversion Errors

D.SU (Memory) Utilization

E.Watermark Delay (seconds)

AnswersA, B, E

Shows backlog of unprocessed events.

Why this answer

WatermarkDelay (indicates latency), InputEventsBacklog (backlog of unprocessed events), and OutputEvents (throughput) are key performance metrics. Option D is wrong because SU (Memory) utilization is a resource metric, not a performance metric. Option E is wrong because ConversionErrors is an error metric, not a performance indicator.

Full explanation →

280

MCQmedium

You are designing a data processing solution using Azure Stream Analytics. You need to ensure that the output to Azure SQL Database is optimized to minimize the number of write operations. Which output configuration should you use?

A.Use a partitioned output to distribute writes across multiple tables.

B.Set the compatibility level to 1.2.

C.Increase the event serialization format batch size.

D.Use a windowed aggregation to batch writes.

AnswerD

Windowed aggregations (e.g., tumbling, hopping windows) collect events over a time window and output a single result, reducing the number of write operations.

Why this answer

Option D is correct because using a windowed aggregation (e.g., TumblingWindow) reduces the number of writes by batching results. Option A is wrong because increasing batch size alone is not a Stream Analytics configuration. Option B is wrong because partitioning increases parallelism but not necessarily reduces writes.

Option C is wrong because the 'All' compatibility level does not optimize write operations.

Full explanation →

281

Multi-Selecteasy

Which TWO of the following are supported sources for Azure Data Factory Copy activity? (Choose two.)

Select 2 answers

A.Power BI Dataset

B.Azure DevOps

C.Amazon S3

D.Azure Blob Storage

E.Azure Analysis Services

AnswersC, D

Amazon S3 is supported via the Amazon S3 connector.

Why this answer

Option A is correct because Azure Blob Storage is a supported source. Option C is correct because Amazon S3 is a supported source via a connector. Option B is wrong because Power BI is not a data source for Copy activity.

Option D is wrong because Azure Analysis Services is not a supported source. Option E is wrong because Azure DevOps is not a data store.

Full explanation →

282

MCQmedium

You have a dedicated SQL pool in Azure Synapse that stores a fact table with over 100 billion rows. Query performance is degrading over time. You notice that the table is hash-distributed on a column with many duplicate values. What is the most likely impact?

A.Statistics on the table are outdated.

B.The table is not properly partitioned.

C.Data compression is not working efficiently.

D.Data is unevenly distributed across distributions, causing some distributions to be overloaded.

AnswerD

Duplicate values in hash column cause skew.

Why this answer

D is correct because a hash-distributed table with a column that has many duplicate values leads to data skew. When the hash function maps many rows to the same distribution, some distributions become overloaded with data while others are underutilized. This imbalance causes query performance to degrade as the overloaded distributions become bottlenecks for processing.

Exam trap

The trap here is that candidates often confuse distribution skew with partitioning or statistics issues, but the key clue is the mention of 'many duplicate values' in the hash-distributed column, which directly points to data skew as the root cause.

How to eliminate wrong answers

Option A is wrong because outdated statistics can cause suboptimal query plans, but the primary issue described is data skew due to hash distribution on a column with many duplicates, not statistics freshness. Option B is wrong because partitioning is a separate concept from distribution; while partitioning can help with partition elimination, it does not address the fundamental data skew caused by hash distribution on a high-duplicate column. Option C is wrong because data compression efficiency is affected by data patterns and storage, not directly by distribution skew; compression works at the page level and is not the root cause of query performance degradation from uneven distribution.

Full explanation →

283

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool that stores sensitive financial data. You need to ensure that all queries accessing the data are audited and that any use of unapproved client tools is blocked. What should you implement?

A.Enable Microsoft Defender for Cloud on the server.

B.Configure Azure Private Link for the dedicated SQL pool.

C.Enable auditing and set an IP firewall rule that allows only approved IP ranges.

D.Assign Azure RBAC roles to users for the SQL pool.

AnswerC

Auditing logs queries, and IP firewall restricts access to approved clients.

Why this answer

Option C is correct because Azure Synapse Analytics supports IP firewall rules and auditing. By enabling auditing and creating a firewall rule that only allows approved client IP ranges, you can both audit queries and block unapproved tools. Option A is wrong because Azure Private Link only provides network-level isolation, not auditing or blocking.

Option B is wrong because Microsoft Defender for Cloud provides threat detection but not granular control over client tools. Option D is wrong because Azure role-based access control (RBAC) alone does not enforce client tool restrictions.

Full explanation →

284

MCQhard

You manage an Azure Synapse Analytics dedicated SQL pool that contains a large fact table 'Orders' with 500 million rows. The table is hash-distributed on 'OrderDate' and uses a clustered columnstore index. Query performance has degraded over time. You check the system DMVs and find that the columnstore segments have poor quality, with many deleted rows and compressed rowgroups below 1 million rows. You need to improve query performance without blocking writes to the table. What should you do?

A.Run ALTER INDEX REORGANIZE with COMPRESS_ALL_ROW_GROUPS = ON.

B.Drop and recreate the clustered columnstore index.

C.Re-cluster the table using a different distribution key.

D.Run ALTER INDEX REBUILD on the clustered columnstore index.

AnswerA

Online operation that improves columnstore quality.

Why this answer

Option B is correct because ALTER INDEX REORGANIZE with COMPRESS_ALL_ROW_GROUPS option rebuilds columnstore segments online without blocking. Option A is wrong because REBUILD is offline and blocks writes. Option C is wrong because dropping and recreating the index is offline.

Option D is wrong because reclustering reorganizes data within a distribution but does not fix columnstore quality.

Full explanation →

285

MCQhard

You have a pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection closed' error after 1 hour. The data volume is 500 GB over a slow network. What is the most likely cause and solution?

A.The source SQL Server query command timeout is set to 1 hour; increase the command timeout or use staging copy.

B.The network bandwidth is insufficient; use a faster network connection.

C.The copy activity is using a single parallel copy; increase the degree of copy parallelism.

D.The Self-Hosted Integration Runtime (SHIR) has a time-to-live (TTL) setting that is too short; extend the TTL.

AnswerA

The default command timeout for SQL source is 30 minutes, but if set to 1 hour, the connection closes after that. Increasing timeout or using staging copy resolves it.

Why this answer

The 'Connection closed' error after exactly 1 hour indicates the SQL Server source query command timeout (default 60 minutes) is being hit. When copying 500 GB over a slow network, the query execution exceeds this timeout, causing the connection to drop. Increasing the command timeout or using staging copy (which breaks the query into chunks) resolves this by allowing the query to run longer or reducing the per-query duration.

Exam trap

The trap here is that candidates confuse a network bandwidth issue (Option B) with a command timeout, but the exact 1-hour failure is a dead giveaway of the default SQL command timeout, not a generic connectivity problem.

How to eliminate wrong answers

Option B is wrong because insufficient bandwidth would cause slow transfer or eventual timeout, but not a precise 1-hour 'Connection closed' error; the error is tied to a timeout setting, not bandwidth. Option C is wrong because increasing parallelism improves throughput but does not address the root cause of the source query timing out after a fixed duration. Option D is wrong because the Self-Hosted Integration Runtime TTL controls how long the SHIR stays active between activities, not the duration of a single copy operation; the error occurs during the copy, not after idle time.

Full explanation →

286

MCQeasy

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime and runs successfully during business hours. However, after a recent network security update, the pipeline fails with a connection error to the on-premises SQL Server. What is the most likely cause?

A.A firewall rule on the on-premises SQL Server is blocking the self-hosted integration runtime.

B.The Azure Blob Storage account has been moved to a different subscription.

C.The self-hosted integration runtime node has been de-registered from Azure Data Factory.

D.The blob container has reached its maximum capacity.

AnswerA

Network security updates often change firewall rules, blocking the IR.

Why this answer

The self-hosted integration runtime (SHIR) connects to on-premises SQL Server via TCP port 1433 by default. A recent network security update likely added a firewall rule on the SQL Server or the on-premises network that blocks outbound or inbound traffic on this port, preventing the SHIR from establishing a connection. Since the pipeline ran successfully before the update, the most probable cause is a new firewall restriction targeting the SHIR's IP address or subnet.

Exam trap

The trap here is that candidates may confuse a source-side connectivity failure with a sink-side or authentication issue, but the question explicitly states a 'connection error to the on-premises SQL Server,' which points directly to network or firewall blocking at the source, not to storage capacity or SHIR registration status.

How to eliminate wrong answers

Option B is wrong because moving an Azure Blob Storage account to a different subscription does not affect the connectivity between the on-premises SHIR and the on-premises SQL Server; it would only change the storage account's resource ID and require updating linked service credentials, not cause a connection error to SQL Server. Option C is wrong because if the SHIR node were de-registered, the pipeline would fail with an integration runtime not found error, not a connection error to the on-premises SQL Server; the error message would reference the SHIR status, not a network-level timeout or refused connection. Option D is wrong because a blob container reaching maximum capacity (5 TB per container) would result in a storage write error (e.g., 403 or 409) when copying data, not a connection error to the on-premises SQL Server; the error would occur at the sink, not the source.

Full explanation →

287

MCQmedium

You are reviewing a copy job configuration in Azure Data Factory that copies Parquet files from Azure Data Lake Storage Gen2 to Azure Synapse Analytics. The exhibit shows the job settings. If the source folder contains a file that is not in Parquet format (e.g., a CSV file), what will happen?

A.The copy job will skip the CSV file and stop.

B.The copy job will fail with an error.

C.The copy job will skip the CSV file and continue copying other Parquet files.

D.The copy job will attempt to read the CSV file as Parquet and may produce corrupt data.

AnswerC

skipIncompatibleFiles=true causes skipping non-Parquet files.

Why this answer

When using Azure Data Factory's Copy Activity with a wildcard file path or a dataset that filters for Parquet files (e.g., *.parquet), the service evaluates the file pattern before attempting to read the file. If a CSV file is present in the same folder but does not match the Parquet filter, ADF simply ignores it and continues processing only the matching Parquet files. This behavior is by design to allow flexible file selection without causing failures.

Exam trap

The trap here is that candidates assume ADF will attempt to read all files in a folder regardless of extension, leading them to choose Option D (corrupt data) or Option B (failure), when in fact ADF respects the file pattern filter and silently skips non-matching files.

How to eliminate wrong answers

Option A is wrong because the copy job does not stop after skipping a non-matching file; it continues processing remaining files that match the filter. Option B is wrong because the copy job does not fail with an error when encountering a non-Parquet file; it only fails if the file matches the pattern but cannot be parsed as Parquet. Option D is wrong because ADF does not attempt to read a CSV file as Parquet when the file pattern explicitly excludes it; the file is simply not processed.

Full explanation →

288

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must ensure that sensitive columns containing personally identifiable information (PII) are masked at query time for users without explicit permissions. Which Azure Synapse Analytics feature should you use?

A.Row-Level Security

B.Transparent Data Encryption

C.Dynamic Data Masking

D.Always Encrypted

AnswerC

DDM masks sensitive columns in query results for users without UNMASK permission.

Why this answer

Dynamic Data Masking (DDM) is the correct feature because it obfuscates sensitive data in query results for users without the UNMASK permission. Option A is wrong because Always Encrypted protects data at rest and in transit but requires client-side changes. Option C is wrong because Row-Level Security restricts rows, not columns.

Option D is wrong because Transparent Data Encryption encrypts the entire database at rest.

Full explanation →

289

MCQmedium

Your company has an Azure Data Factory pipeline that ingests data from multiple sources into Azure Data Lake Storage Gen2. The pipeline uses a self-hosted integration runtime (IR) running on an on-premises Windows server. Recently, the pipeline started failing with 'Connection timed out' errors during peak hours. You suspect network congestion. You need to resolve this issue with minimal cost and without modifying the pipeline activities. What should you do?

A.Implement Azure ExpressRoute to provide dedicated bandwidth.

B.Increase the 'Polling Interval' setting in the copy activity.

C.Scale out the self-hosted IR by adding more nodes to the cluster.

D.Migrate the self-hosted IR to Azure-SSIS IR.

AnswerC

Distributes load and improves throughput.

Why this answer

Option C is correct because scaling out the self-hosted IR by adding more nodes distributes the load and reduces timeout issues. Option A is wrong because moving to Azure-SSIS IR is expensive and unnecessary. Option B is wrong because increasing polling interval does not fix timeouts.

Option D is wrong because Azure ExpressRoute is a costly network upgrade.

Full explanation →

290

Multi-Selecteasy

You are monitoring an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. You notice frequent failures due to transient network errors. Which TWO actions should you take to improve reliability?

Select 2 answers

A.Deploy a self-hosted integration runtime on a VM in Azure.

B.Use staged copy with Azure Data Lake as intermediate storage.

C.Enable fault tolerance in the copy activity to skip incompatible rows.

D.Configure a retry policy on the copy activity.

E.Increase the degree of copy parallelism.

AnswersC, D

Fault tolerance allows pipeline to continue despite errors.

Why this answer

Options B and C are correct. Retry policy and fault tolerance handle transient errors. Option A is wrong because staging is for bulk copies, not for transient errors.

Option D is wrong because self-hosted IR is for connectivity, not for retry. Option E is wrong because ParallelCopy increases throughput but not reliability.

Full explanation →

291

MCQmedium

Your company is building a real-time dashboard for monitoring website traffic. The data is ingested from web servers into Azure Event Hubs. You need to design a stream processing solution using Azure Stream Analytics that computes the number of unique visitors per minute, per country, and outputs the results to Azure Synapse Analytics for reporting. The solution must handle out-of-order events with a maximum late arrival of 10 seconds. You also need to ensure exactly-once semantics for the output. Which combination of settings should you use?

A.Use a Sliding window of length 1 minute, set out-of-order policy to 10 seconds, and use 'At least once' output mode.

B.Use a Session window with timeout of 1 minute and maximum duration of 10 seconds, set late arrival to 10 seconds, and use 'Exactly once' output mode.

C.Use a Hopping window of size 1 minute and hop 1 minute, set late arrival policy to 10 seconds, and use 'Exactly Once' output mode.

D.Use a Tumbling window of size 1 minute, set late arrival policy to 10 seconds, and configure output to Azure Synapse Analytics with `Exactly once` semantics.

AnswerD

Tumbling window aligns with fixed intervals; late arrival policy handles out-of-order events; Azure Synapse Analytics supports exactly-once with proper configuration.

Why this answer

Option D is correct because a Tumbling window counts events per minute, and the late arrival policy with a 10-second window handles out-of-order events. Exactly-once semantics require that output format is Delta Lake or similar with Idempotent writes; however, Azure Synapse Analytics supports exactly-once with appropriate configuration. Option A is wrong because Hopping windows are for overlapping windows.

Option B is wrong because Sliding windows are for point-in-time counts. Option C is wrong because Session windows are for periods of inactivity.

Full explanation →

292

MCQmedium

Your organization uses Azure Synapse Analytics to run large-scale ETL jobs. A pipeline that loads data from Azure Data Lake Storage Gen2 into a dedicated SQL pool is failing with 'Out of memory' errors during the staging step. The source data is 500 GB in size, and the SQL pool is currently set to DW500c. What should you do to resolve the issue without changing the pipeline logic?

A.Use PolyBase instead of the Copy activity.

B.Switch the source to Azure Data Lake Storage Gen1.

C.Increase the DWU of the dedicated SQL pool to a higher tier (e.g., DW1000c).

D.Partition the source data into smaller files.

AnswerC

Higher DWU provides more memory per distribution, resolving OOM errors.

Why this answer

The 'Out of memory' error during the staging step indicates that the dedicated SQL pool lacks sufficient memory resources to handle the 500 GB data load within the current DWU setting. Increasing the DWU (Data Warehouse Units) from DW500c to a higher tier, such as DW1000c, scales up the memory and compute resources available to the pool, resolving the memory pressure without altering the pipeline logic or source data.

Exam trap

The trap here is that candidates may assume partitioning the source data is a 'logical' fix without realizing it requires pipeline logic changes, or they may confuse PolyBase as a memory-saving alternative rather than a loading protocol that still relies on the pool's resources.

How to eliminate wrong answers

Option A is wrong because PolyBase is not a separate activity but a technology used within the Copy activity or other loading methods; switching to PolyBase does not inherently increase memory and would still fail under the same resource constraints. Option B is wrong because switching the source to Azure Data Lake Storage Gen1 does not affect the memory allocation of the dedicated SQL pool and introduces no performance benefit for the staging step. Option D is wrong because partitioning the source data into smaller files would require changing the pipeline logic (e.g., adding loops or multiple activities) to process files sequentially, which violates the constraint of not changing pipeline logic.

Full explanation →

293

MCQmedium

A company uses Azure Key Vault to store secrets for data pipelines. They need to rotate the secrets automatically every 90 days. What should they implement?

A.Use Azure Policy to enforce secret expiration.

B.Assign RBAC roles to a service principal to update the secret.

C.Create a manual process to update the secret in Key Vault.

D.Configure Key Vault secret rotation with an expiration date of 90 days.

AnswerD

Key Vault can automatically rotate secrets based on expiration.

Why this answer

Option C is correct because Key Vault supports automatic rotation with a specified expiration period. Option A is wrong because manual rotation is not automatic. Option B is wrong because Azure Policy does not rotate secrets.

Option D is wrong because RBAC controls access, not rotation.

Full explanation →

294

MCQhard

Refer to the exhibit. The pipeline fails with a 'Type mismatch' error. The source file has a column 'Name' of type String, and the destination table expects 'FullName' of type String. What is the most likely cause of the failure?

A.The recursive setting on the source store is causing duplicate reads.

B.The source file does not contain a column named 'Name'.

C.The source file has leading/trailing whitespace in the column names.

D.The sink column 'FullName' is expecting a different data type than String.

AnswerB

The column mapping references a source column 'Name', but if the actual file has a different column header, the copy activity fails with type mismatch.

Why this answer

The pipeline fails with a 'Type mismatch' error because the source file does not contain a column named 'Name'. When the source dataset is configured to read a 'Name' column but the actual file lacks that column, Azure Data Factory or Synapse Pipelines cannot map it to the sink column 'FullName', resulting in a type mismatch error. The error message is misleading because the mismatch is not about data types but about missing source columns, which the service treats as a type incompatibility.

Exam trap

The trap here is that candidates assume 'Type mismatch' always refers to incompatible data types (e.g., String vs. Int), when in fact it can also be triggered by a missing source column that the pipeline expects to map, causing a schema-level mismatch.

How to eliminate wrong answers

Option A is wrong because the recursive setting on the source store controls whether subfolders are included in the read operation; it does not cause duplicate reads that would lead to a type mismatch error. Option C is wrong because leading/trailing whitespace in column names would cause a different error (e.g., column not found) but not a 'Type mismatch' error, and the pipeline would still attempt to map the column if the name matched after trimming. Option D is wrong because the sink column 'FullName' is of type String, matching the source column 'Name' type String, so there is no data type mismatch between the two; the error arises from the missing source column, not from incompatible types.

Full explanation →

295

MCQhard

You are designing a solution to store telemetry data from millions of devices. Each device sends a JSON payload every 5 seconds. The data must be partitioned by device ID and time for efficient querying and must support real-time streaming ingestion. Which Azure storage solution should you recommend?

A.Azure SQL Database

B.Azure Blob Storage with Azure Event Hubs

C.Azure Cosmos DB

D.Azure Table Storage

AnswerC

Correct. Cosmos DB supports real-time ingestion, automatic partitioning, and low-latency queries.

Why this answer

Azure Cosmos DB is the correct choice because it offers a multi-model, globally distributed database service with native support for real-time streaming ingestion via the Change Feed and automatic indexing. Its partition key design (device ID + time) enables efficient querying across millions of devices, and it guarantees single-digit millisecond read/write latencies essential for telemetry data arriving every 5 seconds.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with Event Hubs as a streaming solution, but Blob Storage is not designed for real-time, low-latency writes from millions of devices, and the combination adds unnecessary complexity and latency compared to Cosmos DB's native streaming support.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database that does not natively support real-time streaming ingestion at the scale of millions of devices every 5 seconds, and its partitioning capabilities are limited compared to Cosmos DB's horizontal scaling. Option B is wrong because Azure Blob Storage is an object store optimized for large, unstructured data, not for low-latency, high-frequency writes from millions of devices, and while Event Hubs can ingest streams, the combination requires additional processing to write to Blob Storage, adding latency and complexity. Option D is wrong because Azure Table Storage is a NoSQL key-value store that lacks native support for real-time streaming ingestion, automatic indexing, and the flexible querying capabilities needed for time-based and device-ID-based queries at this scale.

Full explanation →

296

MCQhard

You are optimizing a data pipeline in Azure Synapse Analytics that loads data from a CSV file in ADLS Gen2 into a dedicated SQL pool using PolyBase. The load is slow and you need to improve performance. Which action would be MOST effective?

A.Increase the service level (DWU) of the dedicated SQL pool.

B.Change the file format from CSV to Avro.

C.Combine the CSV files into fewer, larger files before loading.

D.Use Azure Data Factory to stage the data in Azure Blob Storage before loading.

AnswerC

PolyBase performs more efficiently with fewer, larger files due to reduced metadata operations.

Why this answer

Combining many small CSV files into fewer, larger files reduces the number of file open/close operations and minimizes the overhead of PolyBase's external file enumeration and parallel split logic. PolyBase performs best when each file is at least 256 MB, as it can then assign a full file to each distribution, avoiding the overhead of splitting tiny files across multiple threads.

Exam trap

The trap here is that candidates assume scaling up (DWU) or changing file formats always improves performance, but the exam specifically tests the understanding that PolyBase's parallel processing is most efficient when file sizes align with distribution boundaries, making file consolidation the most effective optimization.

How to eliminate wrong answers

Option A is wrong because increasing DWU scales resources but does not address the root cause of slow PolyBase loads—file fragmentation and metadata overhead—and may incur unnecessary cost. Option B is wrong because changing to Avro improves compression and schema evolution but does not reduce the number of file operations; the performance gain from file count reduction is more direct. Option D is wrong because staging in Azure Blob Storage adds an extra copy step without reducing the number of files PolyBase must process; the bottleneck remains the file enumeration and split overhead.

Full explanation →

297

Multi-Selectmedium

Your organization uses Azure Data Lake Storage Gen2 to store parquet files. You need to secure the data at rest and control access. Which THREE methods should you implement?

Select 3 answers

A.Set POSIX-like ACLs on directories and files.

B.Configure RBAC roles to control access to storage accounts.

C.Configure Azure Storage Firewall to allow only trusted IPs.

D.Enable Azure Storage Service Encryption (SSE) for data at rest.

E.Enable soft delete for blobs.

AnswersA, B, D

ACLs provide fine-grained access control.

Why this answer

Options A, B, and D are correct. Encryption at rest is done by Azure Storage Service Encryption. Access control is via RBAC and ACLs.

Option C is wrong because firewall restricts network access, not data at rest. Option E is wrong because soft delete is for data recovery, not security.

Full explanation →

298

Multi-Selectmedium

Which TWO actions should you take to secure data in transit between an Azure Synapse Analytics serverless SQL pool and a client application?

Select 2 answers

A.Use Azure RBAC to restrict access to the SQL pool.

B.Configure the serverless SQL pool to enforce TLS 1.2 connections.

C.Use Azure Virtual Network service endpoints for the SQL pool.

D.Disable SSL encryption to reduce latency.

E.Use Azure ExpressRoute to connect to the SQL pool.

AnswersB, C

TLS 1.2 is the minimum recommended protocol.

Why this answer

Options A and C are correct. Enforcing TLS 1.2 ensures a modern secure protocol; using VNet service endpoints keeps traffic within Azure. Option B (disable SSL) is insecure.

Option D (use Azure RBAC) is for auth, not encryption. Option E (use ExpressRoute) provides private connectivity but does not encrypt data by itself; it's more for network security.

Full explanation →

299

MCQeasy

Your team has deployed an Azure Stream Analytics job that writes output to Azure Cosmos DB. You need to monitor the job for data latency and ensure it meets a service-level agreement (SLA) of under 10 seconds from input to output. Which metric should you track in Azure Monitor?

A.Output events.

B.Runtime errors.

C.Watermark delay.

D.Input events.

AnswerC

Watermark delay measures the maximum time difference between the input and output, indicating end-to-end latency.

Why this answer

Option B is correct because 'Watermark delay' indicates the maximum time between the input event being received and the output being produced. Option A is wrong because 'Input events' shows volume, not latency. Option C is wrong because 'Output events' shows throughput.

Option D is wrong because 'Runtime errors' indicates failures.

Full explanation →

300

Multi-Selectmedium

Which THREE metrics should you monitor to optimize the performance of an Azure Synapse Analytics dedicated SQL pool? (Choose three.)

Select 3 answers

A.Storage space used

B.Queued queries

C.DWU (Data Warehouse Unit) usage

D.Login failures

E.TempDB usage

AnswersB, C, E

Queries waiting for resources indicate concurrency issues.

Why this answer

Options A, C, and D are correct. A: DWU usage indicates overall resource utilization. C: Queued queries indicate concurrency throttling.

D: TempDB usage can cause performance degradation if high. Option B is wrong because storage space is not a performance metric (though important for capacity). Option E is wrong because login failures are security-related, not performance.

Full explanation →

Page 4 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →