Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 526–600

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 8 of 12

526

Multi-Selecteasy

Which TWO Azure services can be used to ingest streaming data into Azure Synapse Analytics?

Select 2 answers

A.Azure Databricks Auto Loader.

B.Azure Stream Analytics.

C.Azure Data Factory.

D.Azure Event Hubs.

E.Azure Logic Apps.

AnswersB, D

Stream Analytics can output to Synapse SQL pool or ADLS Gen2.

Why this answer

Azure Stream Analytics is correct because it is a fully managed stream processing engine that can ingest streaming data from sources like Azure Event Hubs and output directly to Azure Synapse Analytics (via SQL pool or dedicated SQL pool). It enables real-time analytics on streaming data before landing it in Synapse for further analysis.

Exam trap

The trap here is that candidates often confuse batch ingestion tools (like Azure Data Factory or Auto Loader) with real-time streaming services, or mistakenly think Event Hubs alone can ingest into Synapse without a processing layer like Stream Analytics.

Full explanation →

527

Multi-Selecthard

Which THREE Azure services can you use together to build a serverless, event-driven data processing pipeline that ingests data from Azure Blob Storage, transforms it using custom code, and loads it into Azure Cosmos DB?

Select 3 answers

A.Azure Functions

B.Azure Cosmos DB output binding (with Azure Functions)

C.Azure Batch

D.Azure Event Grid

E.Azure Logic Apps

AnswersA, B, D

Functions can be triggered by Blob events and run custom code.

Why this answer

Options A, C, and E are correct. Azure Functions can be triggered by Blob Storage events to run custom code, and the output can be written to Cosmos DB via the Cosmos DB output binding. Option B is wrong because Azure Batch is for large-scale parallel computing, not event-driven triggers.

Option D is wrong because Azure Logic Apps is more for orchestration, not custom code execution.

Full explanation →

528

MCQhard

Refer to the exhibit. You are reviewing the workload classifier configuration for an Azure Synapse Analytics dedicated SQL pool. You notice that the 'HeavyLoader' classifier has a queryExecutionTimeoutSeconds of 0. What is the implication of this setting?

A.Queries classified as 'HeavyLoader' will wait indefinitely for resources.

B.The configuration is invalid; queryExecutionTimeoutSeconds must be greater than 0.

C.Queries classified as 'HeavyLoader' will not have a timeout.

D.Queries classified as 'HeavyLoader' will timeout immediately.

AnswerC

A value of 0 disables the query execution timeout.

Why this answer

Option B is correct because a timeout of 0 means no timeout is applied. Option A is wrong because 0 does not mean immediate timeout. Option C is wrong because 0 does not mean infinite wait; it means no timeout.

Option D is wrong because 0 is not invalid.

Full explanation →

529

MCQmedium

Your company runs a critical data pipeline using Azure Data Factory (ADF) that ingests data from multiple sources into an Azure Synapse Analytics dedicated SQL pool. Recently, you have observed that the pipeline frequently fails with the error: 'Operation for target table failed: 'Cannot insert duplicate key row in object 'dbo.FactSales' with unique index 'PK_FactSales'. The duplicate key value is (20241001, 12345).'' The pipeline uses a Copy activity with a stored procedure sink that merges data into the fact table. The fact table has a clustered columnstore index and a unique constraint on (DateKey, ProductKey). You need to modify the pipeline to handle duplicates without losing data and without impacting performance significantly. What should you do?

A.Configure the Copy activity sink to use 'upsert' behavior with the unique key columns.

B.Change the distribution of the fact table to round-robin and remove the unique constraint.

C.Use a staging table and then execute a T-SQL MERGE statement to update or insert.

D.Add a pre-copy script to delete existing rows that match the incoming data before the copy.

AnswerA

ADF's upsert uses the source to update matching rows and insert new ones, avoiding duplicate key violations.

Why this answer

Option A is correct because Azure Data Factory's Copy activity supports native upsert behavior when using a stored procedure sink, allowing it to handle duplicate key violations by updating existing rows instead of failing. By specifying the unique key columns (DateKey, ProductKey) in the upsert configuration, the pipeline can merge incoming data into the fact table without requiring manual staging or pre-cleanup, minimizing performance impact by leveraging the existing clustered columnstore index and unique constraint.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing a manual staging table approach (Option C) or a destructive pre-copy script (Option D), not realizing that ADF's native upsert feature is designed specifically to handle duplicate key violations in a performant and atomic manner.

How to eliminate wrong answers

Option B is wrong because changing the distribution to round-robin and removing the unique constraint would eliminate the duplicate detection mechanism, potentially allowing data integrity issues and degrading query performance due to data movement during joins. Option C is wrong because using a staging table and a T-SQL MERGE statement introduces additional latency and complexity, and while it can handle duplicates, it is less efficient than the native upsert feature in ADF, which is optimized for such scenarios. Option D is wrong because adding a pre-copy script to delete existing rows before the copy is a workaround that can cause data loss (deleting legitimate rows) and does not handle concurrent inserts or updates gracefully, leading to potential race conditions and performance overhead.

Full explanation →

530

Multi-Selectmedium

Which TWO actions are appropriate when designing a data processing solution that must meet strict SLAs for latency and throughput?

Select 2 answers

A.Partition data by date and hour to improve query performance

B.Implement Auto-Tune for Spark workloads in Azure Synapse Analytics

C.Process all data synchronously to ensure consistency

D.Use a single large cluster for all workloads to simplify management

E.Use a single node for orchestration to reduce complexity

AnswersA, B

Partitioning reduces data scanned and improves throughput.

Why this answer

Partitioning data by date and hour (Option A) is appropriate because it enables partition elimination, where queries only scan relevant partitions rather than the entire dataset. This directly reduces latency and improves throughput by minimizing I/O and compute resources needed for time-range queries, which is critical for meeting strict SLAs in data processing solutions.

Exam trap

The trap here is that candidates confuse synchronous processing with data consistency guarantees, overlooking that distributed systems can achieve consistency via idempotent writes or checkpointing without sacrificing latency and throughput.

Full explanation →

531

Multi-Selecthard

Which THREE considerations are important when designing a table distribution strategy for an Azure Synapse Analytics dedicated SQL pool? (Choose three.)

Select 3 answers

A.Align distribution keys on tables that are frequently joined together

B.Minimize data skew by choosing a distribution key with many unique values

C.Use round-robin distribution for large fact tables to distribute data evenly

D.Use replicated tables for large tables to avoid data movement

E.Consider the size of the table and the frequency of joins

AnswersA, B, E

B is correct because colocated joins avoid data movement.

Why this answer

Option A is correct because aligning distribution keys on tables that are frequently joined together ensures that the join columns are hash-distributed on the same key, enabling collocated joins. This avoids data movement across distributions during query execution, which significantly improves performance in Azure Synapse Analytics dedicated SQL pools.

Exam trap

The trap here is that candidates often confuse round-robin distribution as a good choice for large fact tables because it distributes data evenly, but they overlook the severe performance penalty from data movement during joins and aggregations.

Full explanation →

532

Matchingmedium

Match each performance optimization technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Dividing data into smaller manageable segments

Creating structures to speed up data retrieval

Pre-computed and stored query results

Column-based storage for analytics queries

Why these pairings

These techniques improve query performance in Azure Synapse.

Full explanation →

533

Multi-Selectmedium

You are designing a batch processing solution in Azure Synapse Analytics using pipelines. The solution must load data from multiple sources (Azure Blob Storage, Azure SQL Database, and REST API) into a dedicated SQL pool. After loading, you need to run a stored procedure to aggregate the data. Which two activities should you include in the pipeline? (Choose two.)

Select 2 answers

A.Execute Pipeline activity

B.Copy activity

C.Azure Function activity

D.Stored procedure activity

E.Data Flow activity

AnswersB, D

Copy activity loads data from sources to SQL pool.

Why this answer

Options B and D are correct. Copy activity is used to copy data from sources. Stored procedure activity is used to execute the aggregation procedure.

Option A is wrong because Data Flow is for transformations, but the requirement is to load and then run a stored procedure. Option C is wrong because Execute Pipeline is for running another pipeline, not for executing a stored procedure. Option E is wrong because Azure Function activity is for custom code, not needed here.

Full explanation →

534

MCQeasy

A data engineer needs to store log data from multiple applications in Azure. The data is append-only, heavily compressed, and queried infrequently. Cost minimization is critical. Which storage solution is best?

A.Azure Table Storage

B.Azure Cosmos DB with analytical store

C.Azure Blob Storage with cool or archive access tier

D.Azure Data Lake Storage Gen2 with hot tier

AnswerC

Low cost for infrequent access.

Why this answer

Azure Blob Storage with cool or archive access tier is the best choice because the data is append-only, heavily compressed, and infrequently queried, making cost minimization the top priority. The cool tier offers low storage costs with higher access charges, while the archive tier provides the lowest storage cost for data that is rarely accessed and can tolerate hours of retrieval latency. This aligns perfectly with the append-only, infrequently queried nature of the log data.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage's access tiers with Data Lake Storage Gen2's tiers, assuming the hot tier is always the default for log data, but the question's emphasis on 'cost minimization' and 'infrequently queried' explicitly points to cool or archive tiers, not the hot tier.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for structured, semi-structured data with frequent point queries, not for large, append-only, compressed log blobs; it lacks the cost-optimized access tiers needed for infrequent access. Option B is wrong because Azure Cosmos DB with analytical store is a globally distributed, multi-model database optimized for low-latency transactional and analytical workloads, which is over-engineered and costly for append-only, infrequently queried log data; its analytical store is designed for near-real-time analytics, not cold storage. Option D is wrong because Azure Data Lake Storage Gen2 with hot tier is optimized for high-frequency access and big data analytics, with higher storage costs than cool or archive tiers, making it unsuitable for cost minimization when data is infrequently queried.

Full explanation →

535

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query data in Parquet files stored in Azure Data Lake Storage Gen2. The queries are slow when filtering on a date column. You need to improve query performance without changing the data structure. What should you do?

A.Create materialized views in the serverless SQL pool

B.Increase the service level objective (SLO) of the serverless SQL pool

C.Convert the Parquet files to CSV format

D.Partition the Parquet files into folders by date

AnswerD

Partition elimination allows the serverless SQL pool to read only relevant folders, improving performance.

Why this answer

D is correct because partitioning Parquet files into folders by date enables partition elimination in Azure Synapse serverless SQL pool. When queries filter on the date column, the engine can prune entire folders from the scan, reading only the relevant Parquet files. This reduces I/O and improves performance without altering the data structure or format.

Exam trap

The trap here is that candidates may assume serverless SQL pool supports materialized views or SLO adjustments like dedicated SQL pool, but serverless SQL pool lacks these features and relies on data layout optimizations such as partitioning for performance.

How to eliminate wrong answers

Option A is wrong because materialized views in serverless SQL pool are not supported; they are only available in dedicated SQL pool. Option B is wrong because serverless SQL pool does not have a configurable service level objective (SLO); it scales automatically based on workload and cannot be manually increased. Option C is wrong because converting Parquet to CSV would increase file size and degrade performance due to the lack of columnar compression and predicate pushdown capabilities.

Full explanation →

536

MCQhard

You are a data engineer for a retail company. The company uses Azure Data Lake Storage Gen2 to store raw transaction data partitioned by date. Each day, a folder is created with the format 'YYYY/MM/DD' containing thousands of small JSON files (each ~10 KB). An Azure Databricks job runs daily to read the previous day's folder, transform the data, and write to a Delta table for reporting. Over time, the job's execution time has increased from 15 minutes to over 2 hours. The job uses a cluster with 4 nodes (each 16 GB memory). Monitoring shows that the job spends most of its time in the 'listing files' stage. Which optimization should you implement to reduce the job duration?

A.Increase the number of nodes in the cluster to 16.

B.Change the output format from JSON to Delta and enable Delta caching.

C.Pre-process the raw data to coalesce small JSON files into larger parquet files (e.g., 256 MB each).

D.Use Azure Data Factory instead of Databricks to copy the raw data.

AnswerC

Reduces the number of files, drastically cutting listing time.

Why this answer

The job spends most of its time in the 'listing files' stage because reading thousands of small JSON files (each ~10 KB) from Azure Data Lake Storage Gen2 incurs high metadata operation overhead. Coalescing these small files into larger Parquet files (e.g., 256 MB each) reduces the number of files that Spark must list and process, dramatically cutting down the listing stage time and improving overall throughput.

Exam trap

The trap here is that candidates often assume scaling the cluster (Option A) will solve any performance issue, but they fail to recognize that metadata operations like file listing are not parallelized across nodes and are limited by the storage account's API limits, not compute resources.

How to eliminate wrong answers

Option A is wrong because increasing the number of nodes to 16 does not address the root cause of high metadata overhead from listing thousands of small files; it would only add more parallelism to a bottleneck that is I/O and metadata-bound, not CPU-bound. Option B is wrong because changing the output format to Delta and enabling Delta caching optimizes the write/read side of the Delta table, but the bottleneck is in the input stage (listing and reading raw JSON files), not in the output stage. Option D is wrong because using Azure Data Factory to copy the raw data does not solve the file listing problem; it would still need to list the same small files and would not transform the data, and it introduces an unnecessary extra service without addressing the core issue of small file overhead.

Full explanation →

537

MCQeasy

You are designing a data storage solution for a retail company. The data includes transactional data that requires low-latency queries (under 10 milliseconds) and large historical data for analytics. The solution must minimize storage costs. Which approach should you recommend?

A.Use Azure Data Lake Storage Gen2 for both transactional and historical data

B.Use Azure Cache for Redis for transactional data and Azure SQL Database for historical data

C.Use Azure Cosmos DB for transactional data and Azure Blob Storage for historical data

D.Use Azure SQL Database with Hyperscale tier for both transactional and historical data

AnswerC

Cosmos DB offers low-latency reads/writes, and Blob Storage is cheap for bulk historical data.

Why this answer

Azure Cosmos DB provides single-digit millisecond latency for transactional workloads, meeting the under-10ms requirement, while Azure Blob Storage offers low-cost storage for large historical data. This combination minimizes storage costs by using the most cost-effective service for each workload type.

Exam trap

The trap here is that candidates may assume a single service like Azure SQL Database or Data Lake Storage can handle both transactional and analytical workloads efficiently, overlooking the cost and performance trade-offs that make a hybrid approach optimal.

How to eliminate wrong answers

Option A is wrong because Azure Data Lake Storage Gen2 is optimized for big data analytics, not low-latency transactional queries, and cannot guarantee under 10ms response times. Option B is wrong because Azure Cache for Redis is an in-memory cache, not a durable transactional store, and Azure SQL Database for historical data incurs higher storage costs compared to Blob Storage. Option D is wrong because Azure SQL Database Hyperscale, while scalable, is more expensive for large historical data storage and does not minimize costs as effectively as Blob Storage.

Full explanation →

538

Multi-Selecthard

You are designing a stream processing solution using Azure Stream Analytics. The job must reference a static lookup table (product catalog) stored in Azure Blob Storage. The catalog is updated once daily. The job should automatically pick up the latest version without restarting. Which two configurations are required? (Choose two.)

Select 2 answers

A.Configure the reference input with a static blob path

B.Set the reference input's 'Path pattern' to include date and time placeholders

C.Enable 'Automatic refresh' and set the refresh rate to 1 day

D.Use Azure Event Grid to trigger job restart on blob update

E.Store the reference data in Azure SQL Database instead of Blob Storage

AnswersB, C

Why this answer

Option B is correct because Azure Stream Analytics reference data inputs support path pattern placeholders like {date} and {time} to dynamically resolve the latest blob file. This allows the job to automatically load a new version of the static lookup table when the blob is updated, without requiring a job restart. The path pattern must be structured to match the naming convention of the uploaded file, such as 'catalog/{date}/{time}/products.csv'.

Exam trap

The trap here is that candidates often think a static path or Event Grid restart is needed, but the question specifically tests the combination of dynamic path patterns and automatic refresh to achieve zero-downtime updates in Azure Stream Analytics.

Why the other options are wrong

A

Static path does not trigger auto-refresh; you need a pattern to detect new blobs.

D

Restarting the job is not required; auto-refresh avoids restart.

E

While SQL Database is an option, the question specifies Blob Storage; auto-refresh works with Blob Storage.

Full explanation →

539

MCQmedium

You have an Azure Databricks workspace with a cluster that uses a Standard_LRS managed disk. You need to ensure that data at rest is encrypted using a customer-managed key (CMK). What should you configure?

A.Configure Azure Storage Service Encryption with a customer-managed key

B.Enable double encryption with a customer-managed key in Azure Disk Encryption

C.Enable Transparent Data Encryption (TDE) in Azure SQL Database

D.Use Azure Purview to classify and encrypt data

AnswerB

Azure Databricks clusters can use Azure Disk Encryption with CMK.

Why this answer

Option A is correct because Azure Databricks supports CMK encryption via Azure Disk Encryption or by using a key vault key for the managed disk. Option B (Azure Storage Service Encryption) is for storage accounts, not managed disks. Option C (Azure SQL Database TDE) is irrelevant.

Option D (Azure Purview) is for data governance, not encryption.

Full explanation →

540

MCQhard

You are designing a batch processing solution in Azure Databricks that reads Parquet files from Azure Data Lake Storage Gen2, performs aggregations, and writes results to a Delta table. The data volume is expected to grow to 10 TB per batch. You need to minimize shuffle operations during the aggregation step. Which approach should you recommend?

A.Partition the Delta table by the aggregation key and use the same partitioning when reading.

B.Use bucketing on a different column to reduce partition skew.

C.Use a broadcast join to avoid shuffle on the aggregation.

D.Use coalesce to reduce the number of partitions before aggregation.

AnswerA

Partitioning on the aggregation key ensures data is already grouped, minimizing shuffle.

Why this answer

Option B is correct because partitioning the input data on the aggregation key ensures that each partition contains all rows for a given key, eliminating the need for a full shuffle. Option A is wrong because bucketing with a different column would not avoid shuffle on the aggregation key. Option C is wrong because coalesce reduces partitions but does not avoid shuffle.

Option D is wrong because broadcast join is for joining small tables, not for aggregations.

Full explanation →

541

MCQhard

You are a data engineer at a financial services company. Your Azure Synapse Analytics dedicated SQL pool contains a fact table named 'Transactions' with 10 billion rows. The table is hash-distributed on 'AccountID' and partitioned by month. You notice that queries filtering on 'TransactionDate' (a date column) are performing slowly despite partition elimination. You also observe that the 'Transactions' table is frequently joined with a 'DimAccount' dimension table on 'AccountID'. You need to optimize query performance for the most common workload: monthly reports that aggregate transaction amounts by account for the last 12 months. Additionally, you need to ensure that the solution minimizes maintenance overhead. What should you do?

A.Create a clustered columnstore index on the table

B.Redistribute the table on TransactionDate using hash distribution

C.Change distribution to round-robin to evenly distribute data

D.Use table replication for the Transactions table

AnswerA

Improves compression and scan performance for aggregations

Why this answer

Option C is correct because implementing a columnstore index on the fact table will compress data and improve scan performance for aggregations. Option A is wrong because hash distribution on TransactionDate would cause data skew (many rows per date) and is not suitable for high-cardinality columns. Option B is wrong because round-robin distribution would eliminate collocation benefits for joins with DimAccount.

Option D is wrong because table replication is for small dimension tables, not large fact tables.

Full explanation →

542

Multi-Selecteasy

Which TWO of the following Azure services can be used to orchestrate data pipelines that include data transformation?

Select 2 answers

A.Azure Synapse Pipelines

B.Azure Data Factory

C.Azure Logic Apps

D.Azure Databricks

E.Azure Functions

AnswersA, B

Synapse Pipelines are built on ADF and provide similar capabilities.

Why this answer

Azure Synapse Pipelines (option A) is correct because it provides a native orchestration service within Azure Synapse Analytics for building, scheduling, and monitoring data pipelines that include data transformation activities, such as Data Flow or Stored Procedure tasks. It leverages the same integration runtime as Azure Data Factory and supports both code-free and code-based transformation steps.

Exam trap

The trap here is that candidates often confuse compute services (like Databricks or Functions) with orchestration services, mistakenly thinking they can replace Azure Data Factory or Synapse Pipelines for end-to-end pipeline management, when in fact they are typically used as activities within an orchestrated pipeline.

Full explanation →

543

Matchingmedium

Match each Azure data integration tool to its typical use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Query external data in Azure Storage using T-SQL

High-throughput data ingestion into Synapse SQL

Orchestrate data movement and transformation

Complex data engineering with notebooks

Why these pairings

These tools serve different data integration needs.

Full explanation →

544

MCQmedium

You are monitoring a production data pipeline in Azure Data Factory that runs hourly. You notice that the pipeline has been failing for the last 3 hours due to a timeout error when writing to the sink Azure SQL Database. The database is a General Purpose tier with 100 DTUs. The pipeline writes approximately 1 million rows per run. What is the most effective way to prevent the timeout?

A.Reduce the degree of copy parallelism in the copy activity.

B.Enable staged copy using Azure Blob Storage as an interim store.

C.Scale up the Azure SQL Database to a higher DTU tier.

D.Increase the writeBatchSize in the sink settings to reduce the number of batch inserts.

AnswerD

Larger batch size reduces the number of round trips and can prevent timeout.

Why this answer

Option D is correct because increasing the batch size reduces the number of round trips, which can speed up the write operation and reduce timeout risk. Option A is wrong because reducing parallelism might increase duration. Option B is wrong because increasing DTUs helps but is more costly and may not be necessary.

Option C is wrong because using a staging blob could add overhead and still timeout.

Full explanation →

545

Multi-Selecthard

You are optimizing the performance of an Azure Synapse Analytics dedicated SQL pool. Which TWO actions can help reduce data movement during query execution?

Select 2 answers

A.Use hash distribution on a column that is not used in joins.

B.Use replicated tables for small dimension tables.

C.Use round-robin distribution for large fact tables.

D.Increase the resource class for the loading user.

E.Distribute fact tables on the join key columns.

AnswersB, E

Replicated tables copy data to all nodes, avoiding movement for joins.

Why this answer

Options A and B are correct. Distributing tables on join keys reduces shuffling. Using replicated tables for small dimension tables avoids broadcasting.

Option C is wrong because round-robin distribution increases data movement. Option D is wrong because hash distribution on a non-join column may cause unnecessary shuffling. Option E is wrong because increasing resource class does not directly reduce data movement.

Full explanation →

546

MCQeasy

You need to ensure that data stored in Azure Data Lake Storage Gen2 is encrypted at rest using customer-managed keys. Which Azure service should you use to manage the keys?

A.Azure Key Vault

B.Microsoft Purview

C.Azure Confidential Computing

D.Microsoft Entra ID

AnswerA

Azure Key Vault stores customer-managed encryption keys.

Why this answer

Option A is correct because Azure Key Vault is used to store customer-managed keys for Azure Storage encryption. Option B is wrong because Microsoft Purview is for data governance. Option C is wrong because Azure Confidential Computing is for compute.

Option D is wrong because Microsoft Entra ID is for identity.

Full explanation →

547

MCQmedium

Your organization has an Azure Synapse Analytics dedicated SQL pool that stores sensitive customer data. You need to ensure that only authorized users can access the data, and auditing must be enabled to track all access attempts. What should you do first?

A.Implement column-level security to restrict sensitive columns.

B.Enable auditing on the SQL pool and configure a storage account for audit logs.

C.Configure Microsoft Entra ID authentication and use RBAC to grant only necessary permissions.

D.Apply dynamic data masking to the sensitive columns.

AnswerC

RBAC with Microsoft Entra ID is the primary method to control access.

Why this answer

Option B is correct because enabling Microsoft Entra ID authentication and assigning roles via RBAC is the first step to control access. Option A is wrong because enabling auditing does not restrict access. Option C is wrong because column-level security only restricts specific columns, not overall access.

Option D is wrong because dynamic data masking obfuscates data but does not prevent access.

Full explanation →

548

MCQhard

You are troubleshooting a data processing job in Azure Synapse Pipelines that fails intermittently with the error: 'Operation on target Sink failed: The request was aborted: Could not create SSL/TLS secure channel.' The pipeline reads from Azure Blob Storage and writes to an Azure SQL Database. The source and sink are in the same region. What is the most likely cause?

A.Azure SQL Database firewall rules blocking the IP address of the integration runtime.

B.Transient network connectivity issues between the services.

C.The Azure SQL Database DTU limit has been exceeded, causing throttling.

D.The self-hosted integration runtime is using TLS 1.0, which is not supported by the services.

AnswerD

SSL/TLS handshake failure often stems from TLS version mismatch.

Why this answer

The error 'Could not create SSL/TLS secure channel' indicates a TLS version mismatch. Azure SQL Database and Azure Blob Storage require at least TLS 1.2 for secure connections. If the self-hosted integration runtime (SHIR) is configured to use TLS 1.0, the handshake fails because the services reject the older protocol.

This is the most likely cause given the intermittent nature and the specific error message.

Exam trap

The trap here is that candidates confuse SSL/TLS errors with network connectivity or throttling issues, overlooking the specific protocol version mismatch that Azure services now enforce.

How to eliminate wrong answers

Option A is wrong because firewall rules blocking the SHIR IP would produce a different error (e.g., 'Cannot open server '...' requested by the login') and would be consistent, not intermittent. Option B is wrong because transient network issues typically result in timeouts or retryable errors, not a specific SSL/TLS channel creation failure. Option C is wrong because DTU throttling would cause performance degradation or 'Request limit exceeded' errors, not an SSL/TLS handshake failure.

Full explanation →

549

MCQeasy

You are monitoring Azure Stream Analytics job performance. The job is falling behind in processing real-time data. You notice that the SU (Streaming Unit) utilization is consistently at 90% or higher. What is the most appropriate action to improve throughput?

A.Change the output to use a partition scheme

B.Reduce the window duration in the query

C.Increase the number of Streaming Units (SUs)

D.Decrease the event ordering tolerance

AnswerC

Scaling out increases processing capacity

Why this answer

Option A is correct because increasing the number of SUs (scale out) is the direct way to increase throughput when SU utilization is high. Option B is wrong because it can cause data loss. Option C is wrong because it reduces throughput.

Option D is wrong because partitioning might help but the immediate need is to scale.

Full explanation →

550

MCQeasy

You need to monitor the performance of Azure Synapse Analytics dedicated SQL pool queries. Which Azure service should you use to identify long-running queries and resource bottlenecks?

A.Microsoft Purview Data Map.

B.Azure Synapse Studio monitoring hub and dynamic management views (DMVs).

C.Azure Log Analytics queries.

D.Azure Monitor Workbooks.

AnswerB

Monitoring hub and DMVs are designed for real-time query performance analysis.

Why this answer

Option A is correct because SQL pool monitoring provides DMVs and metrics for query performance. Option B is wrong because Azure Monitor Logs can be used but is not the primary tool for live query monitoring. Option C is wrong because Log Analytics requires setting up data collection.

Option D is wrong because Microsoft Purview is for data governance, not performance monitoring.

Full explanation →

551

MCQeasy

You need to store semi-structured JSON data from a web application that requires low-latency reads and writes at a global scale. The data must be indexed automatically and support SQL-like queries. Which Azure data store should you use?

A.Azure SQL Database

B.Azure Blob Storage

C.Azure Cosmos DB (NoSQL API)

D.Azure Table Storage

AnswerC

Cosmos DB provides global distribution, auto-indexing, and SQL API for JSON.

Why this answer

Azure Cosmos DB with the NoSQL API is the correct choice because it natively stores semi-structured JSON documents, provides automatic indexing of all properties, supports SQL-like queries via its query engine, and offers low-latency reads and writes at global scale through multi-region replication and configurable consistency levels. This combination directly matches the requirements for a globally distributed web application needing fast, queryable JSON storage.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's key-value capabilities with Cosmos DB's document model, mistakenly thinking Table Storage supports SQL-like queries and automatic indexing, when in fact it only supports OData queries and requires explicit partition and row keys for efficient access.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database that requires a fixed schema and is not optimized for semi-structured JSON data without manual schema management or JSON functions, nor does it provide automatic indexing of all JSON properties. Option B is wrong because Azure Blob Storage is an object store for unstructured data that does not support SQL-like queries or automatic indexing; it requires a separate compute layer (e.g., Azure Data Lake Analytics) for querying. Option D is wrong because Azure Table Storage is a NoSQL key-value store that does not support SQL-like queries, automatic indexing of all fields, or native JSON document storage; it uses OData queries and a flat schema.

Full explanation →

552

MCQeasy

Refer to the exhibit. An Azure Policy is defined to enforce network security on storage accounts. What does this policy do?

A.Denies storage accounts that do not have any IP rules defined

B.Denies storage accounts that have firewall rules configured

C.Denies storage accounts that have public network access disabled

D.Denies storage accounts that allow public network access from all networks

AnswerD

defaultAction=Allow means public access from all networks.

Why this answer

The Azure Policy in the exhibit uses the 'Deny' effect with a condition that checks if the 'networkAcls.defaultAction' property is set to 'Allow'. When 'defaultAction' is 'Allow', the storage account permits traffic from all networks, including the internet. The policy denies such configurations to enforce network security by requiring that public network access be restricted.

Exam trap

The trap here is that candidates confuse the 'defaultAction' property with the presence of IP rules or firewall settings, leading them to think the policy denies accounts with any firewall rules rather than those that allow all networks.

How to eliminate wrong answers

Option A is wrong because the policy does not evaluate the presence or absence of IP rules; it only checks the 'defaultAction' property. Option B is wrong because the policy denies accounts that allow all networks, not those with firewall rules configured; firewall rules are a separate mechanism. Option C is wrong because the policy denies accounts where public network access is enabled (defaultAction = 'Allow'), not disabled; disabling public access would set defaultAction to 'Deny', which the policy does not target.

Full explanation →

553

MCQeasy

You are designing data security for Azure Data Lake Storage Gen2. The requirement is to prevent data from being accessed by anyone outside the corporate network. Which feature should you enable?

A.Use Azure Private Endpoint or service endpoint with a VNet.

B.Assign RBAC roles to deny access to all except corporate users.

C.Configure IP firewall rules to allow only corporate IP ranges.

D.Enable encryption at rest using customer-managed keys.

AnswerA

Private endpoint ensures data is accessed only from within the VNet.

Why this answer

Azure Private Endpoint or service endpoint with a VNet ensures that all traffic to the storage account stays within the corporate network and never traverses the public internet. Private Endpoint assigns a private IP from the VNet to the storage account, effectively isolating it from public access. This meets the requirement to prevent access from outside the corporate network by enforcing network-level isolation.

Exam trap

The trap here is that candidates often confuse network-level security (Private Endpoint) with access control (RBAC) or data protection (encryption), thinking that denying RBAC roles or enabling encryption alone can prevent external access, when only network isolation truly blocks traffic from outside the corporate network.

How to eliminate wrong answers

Option B is wrong because RBAC roles control authorization (who can access data) but do not enforce network boundaries; a user with the correct role could still access data from outside the corporate network. Option C is wrong because IP firewall rules can be bypassed if an attacker spoofs an allowed IP address or if the corporate network uses dynamic public IPs, and they do not provide the same level of isolation as Private Endpoint. Option D is wrong because encryption at rest protects data at the storage layer but does not control network access; data could still be accessed from outside the corporate network if other security measures are not in place.

Full explanation →

554

MCQhard

You are building a streaming pipeline in Azure Stream Analytics that reads from an Azure Event Hubs input with 10 partitions. The query performs a GROUP BY on a column that is not the partition key. To ensure consistency, which partitioning scheme should you use?

A.Use 'Passthrough' partitioning

B.Use 'PartitionBy' with the GROUP BY column

C.Increase the number of SUs to handle skew

D.Use 'INTO' with a 'PARTITION BY' clause

AnswerB

Why this answer

Option B is correct because when performing a GROUP BY on a column that is not the partition key, you must use the PARTITION BY clause in the query to ensure that all rows with the same grouping value are processed by the same Stream Analytics node. This guarantees consistency and correctness of the aggregation, as it avoids data being split across multiple nodes without proper alignment.

Exam trap

The trap here is that candidates often confuse 'Passthrough' partitioning with automatic handling of GROUP BY, or they think increasing SUs can fix data skew, but the core requirement is explicit repartitioning via PARTITION BY to align the data with the grouping key.

Why the other options are wrong

A

Passthrough keeps the original partition scheme, which may not align with the GROUP BY column.

C

Scaling SUs does not fix partitioning alignment issues.

D

INTO is for output, not for repartitioning within the query.

Full explanation →

555

MCQhard

Your company uses Azure Data Factory to orchestrate data pipelines that ingest data from on-premises SQL Server to Azure Data Lake Storage Gen2. The network team has implemented a firewall that only allows outbound traffic on port 443. The on-premises SQL Server is not accessible via public endpoint. You need to configure a secure connection that complies with the firewall rules and uses managed identity for authentication. What should you use?

A.Use Azure ExpressRoute to connect the on-premises network to Azure, then use an Azure Integration Runtime with a VNet injection.

B.Set up a point-to-site VPN from Azure to on-premises and use an Azure Integration Runtime with a VNet integration.

C.Install a Self-hosted Integration Runtime on an on-premises VM, register it with Azure Data Factory using managed identity, and configure the pipeline to use this IR for the SQL Server connection.

D.Use an Azure Integration Runtime with a public endpoint and configure a firewall rule to allow the Azure IR IP addresses.

AnswerC

Uses private network and port 443 for communication.

Why this answer

Option B is correct: A self-hosted integration runtime (IR) installed on-premises with managed identity can connect to SQL Server using private network, then communicate with Azure Data Factory over port 443. Option A is incorrect: Azure IR cannot connect to on-premises SQL Server without a public endpoint. Option C is incorrect: Azure VPN Gateway requires additional network configuration and doesn't use managed identity.

Option D is incorrect: ExpressRoute is a private connection but doesn't address authentication.

Full explanation →

556

MCQeasy

You are building a data transformation in Azure Databricks using PySpark. The data includes a column 'timestamp' in string format 'yyyy-MM-dd HH:mm:ss'. You need to convert this to a timestamp type and extract the date part for partitioning. Which code snippet should you use?

A.df.withColumn('date', col('timestamp').cast('date'))

B.df.withColumn('date', to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss'))

C.df.withColumn('date', to_date('timestamp', 'yyyy-MM-dd HH:mm:ss'))

D.df.withColumn('date', to_date('timestamp'))

AnswerC

Correctly converts string to date with format.

Why this answer

Option C is correct because `to_date` with the format string 'yyyy-MM-dd HH:mm:ss' converts the string column to a date type, extracting only the date part (year, month, day) as required for partitioning. This matches the requirement to convert the timestamp string to a date for partitioning, not a full timestamp.

Exam trap

The trap here is that candidates often confuse `to_date` and `to_timestamp`, assuming both extract only the date, or they forget that `cast('date')` does not accept a custom format string, leading to runtime errors or null values.

How to eliminate wrong answers

Option A is wrong because `cast('date')` on a string column will fail or produce null if the string is not in a default date format (e.g., 'yyyy-MM-dd'), and it does not accept a custom format pattern. Option B is wrong because `to_timestamp` converts the string to a full timestamp type (including time), not just the date part, which is not suitable for partitioning by date. Option D is wrong because `to_date('timestamp')` without a format string relies on the default date format (typically 'yyyy-MM-dd'), which will fail or produce incorrect results for strings with time components like 'yyyy-MM-dd HH:mm:ss'.

Full explanation →

557

MCQhard

Your Azure Data Factory pipeline uses Copy Activity to ingest data from Azure Blob Storage to Azure Synapse Analytics. You need to minimize network latency and data transfer costs. Which data integration approach should you choose?

A.Use a staging copy with Azure Blob as intermediate

B.Use PolyBase to load data directly from Blob Storage

C.Use a self-hosted integration runtime in the same region

D.Use an Azure integration runtime in the same region as the data stores

AnswerD

Minimizes data transfer costs and latency.

Why this answer

Using Azure IR in the same region as the source and sink ensures data movement stays within Azure backbone, minimizing egress costs. Option A (staging) adds cost. Option B (self-hosted IR) is for on-premises.

Option C (polybase) is for loading into SQL DW but still uses IR.

Full explanation →

558

MCQmedium

You are designing a data pipeline in Azure Synapse Analytics that ingests streaming taxi trip data from Azure Event Hubs. The data must be processed in near real-time and stored in a dedicated SQL pool. The pipeline should handle late-arriving data (up to 30 minutes late) without reprocessing the entire stream. Which Azure service should you use to process the streaming data?

A.Azure Databricks Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Functions with Event Hubs trigger

AnswerB

Azure Stream Analytics provides built-in support for late-arriving events and can output directly to Synapse SQL pool.

Why this answer

Azure Stream Analytics is the correct choice because it is designed for real-time stream processing with native support for Event Hubs as an input and dedicated SQL pool as an output. It can handle late-arriving data via its built-in 'late arrival' window (configurable up to 30 minutes) using event time processing, without requiring reprocessing of the entire stream.

Exam trap

The trap here is that candidates often confuse Azure Data Factory's 'real-time' monitoring or Azure Functions' 'event-driven' nature with true stream processing, overlooking the need for built-in windowing and late-arrival handling that only Azure Stream Analytics provides.

How to eliminate wrong answers

Option A is wrong because Azure Databricks Structured Streaming is a batch-micro-batch engine that, while capable of streaming, requires manual management of late-arriving data via watermarking and checkpointing, and does not natively integrate with dedicated SQL pool as a sink without additional complexity. Option C is wrong because Azure Data Factory is an orchestration and ETL service for batch data movement, not a real-time stream processing engine; it cannot process streaming data from Event Hubs in near real-time. Option D is wrong because Azure Functions with Event Hubs trigger processes events one at a time in a serverless compute model, which lacks the built-in windowing, aggregation, and late-arrival handling capabilities needed for near real-time stream processing at scale.

Full explanation →

559

MCQhard

A company uses Azure Synapse Analytics dedicated SQL pool for data warehousing. They notice that queries against a large fact table are slow. The table is hash-distributed on ProductID, but many queries filter on OrderDate. What should the data engineer do to improve query performance?

A.Change the distribution to round-robin

B.Replicate the table to all distributions

C.Create a columnstore index on OrderDate

D.Change the distribution to hash on OrderDate

AnswerD

Aligns distribution with filter column, minimizing data movement.

Why this answer

Option D is correct because changing the distribution key to OrderDate aligns the physical data layout with the most common query filter predicate. In a dedicated SQL pool, hash distribution distributes rows across distributions based on the hash of the distribution column. When queries filter on OrderDate, a hash on OrderDate enables partition elimination and distribution-level pruning, reducing data movement and improving scan performance.

Exam trap

The trap here is that candidates often confuse indexing (columnstore) with distribution strategy, assuming a non-clustered index on the filter column is sufficient, when in fact the distribution key must match the most frequent filter predicate to avoid full distribution scans.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without any logical grouping, which forces full table scans and data shuffling for all queries, making performance worse for filtered queries. Option B is wrong because replicating a large fact table to all distributions would consume excessive storage and cause significant overhead during data loading, and is typically reserved for small dimension tables. Option C is wrong because a columnstore index on OrderDate improves compression and scan efficiency but does not address the distribution mismatch; queries would still need to scan all distributions, missing the benefit of distribution elimination.

Full explanation →

560

MCQhard

You are designing a batch processing solution using Azure Databricks. The data source is a large Parquet dataset stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The processing requires joining two datasets: one with 10 billion rows and another with 1 million rows. The cluster uses Photon runtime. Which optimization should you apply to minimize shuffle?

A.Broadcast the smaller table (1 million rows) to all worker nodes.

B.Increase the cluster size to reduce shuffle overhead.

C.Create bucketed tables on the join key for both datasets.

D.Use Delta Lake and optimize file layout with OPTIMIZE command.

AnswerA

Broadcasting the smaller table avoids shuffling the large table, significantly reducing data movement.

Why this answer

Broadcasting the smaller table (1 million rows) to all worker nodes is the correct optimization because it eliminates the need for a full shuffle during the join. With Photon runtime, broadcast joins are highly efficient as they replicate the small table to each executor, allowing map-side joins that avoid costly data movement across the network. Given the 10:1 row ratio, the 1-million-row table is well within the default broadcast threshold (10 MB compressed, configurable via spark.sql.autoBroadcastJoinThreshold), making this the most effective shuffle-minimization technique.

Exam trap

The trap here is that candidates often assume increasing cluster size (Option B) is a universal performance fix, but the DP-203 exam specifically tests the understanding that shuffle reduction techniques like broadcast joins are more impactful than simply adding more nodes, especially when one dataset is small enough to fit in executor memory.

How to eliminate wrong answers

Option B is wrong because increasing cluster size does not reduce shuffle overhead; it only adds more parallelism, which can actually increase shuffle traffic and does not address the fundamental need to avoid shuffling large datasets. Option C is wrong because creating bucketed tables on the join key requires both datasets to be bucketed with the same number of buckets and a compatible bucketing scheme; while this can reduce shuffle, it involves significant upfront data reorganization and is not as immediate or lightweight as broadcasting the small table. Option D is wrong because using Delta Lake and the OPTIMIZE command improves file layout and read performance (e.g., bin-packing small files) but does not directly reduce shuffle during a join operation; shuffle reduction requires join-specific optimizations like broadcast or bucketing.

Full explanation →

561

MCQmedium

You have an Azure Databricks workspace that uses a managed resource group. The security team requires that all cluster nodes use no public IP addresses and that all outbound traffic goes through a firewall. What should you configure?

A.Configure service endpoints for Azure Storage and Azure Data Lake Storage.

B.Deploy the workspace in a VNet with forced tunneling enabled and a firewall.

C.Apply network security groups (NSGs) to the subnet that restrict outbound traffic.

D.Enable Azure Private Link for the Databricks workspace.

AnswerB

VNet injection with forced tunneling ensures cluster nodes have no public IPs and all outbound traffic goes through the firewall.

Why this answer

Option D is correct because Azure Databricks can be deployed in a VNet injected configuration with no public IPs and route all traffic through a firewall. Option A is wrong because private endpoints are for data plane services, not for cluster nodes. Option B is wrong because service endpoints do not eliminate public IPs.

Option C is wrong because network security groups control inbound/outbound rules but do not force traffic through a firewall.

Full explanation →

562

Multi-Selecthard

A data engineer is optimizing an Azure Data Lake Storage Gen2 account used for big data analytics. The account contains billions of small files (under 1 MB). The analytics jobs are slow and cost more than expected. Which THREE actions should the engineer take to improve performance and reduce costs?

Select 3 answers

A.Convert data to columnar file formats such as Parquet.

B.Move data to the cool tier to reduce storage costs.

C.Enable soft delete to protect against accidental deletion.

D.Use blob index tags to partition data logically.

E.Consolidate small files into larger files (e.g., 100 MB or more).

AnswersA, D, E

Columnar formats compress data and allow predicate pushdown, reducing I/O.

Why this answer

Option A is correct because converting data to columnar formats like Parquet reduces the amount of data read during analytics queries, as only the necessary columns are scanned. This significantly improves query performance and lowers I/O costs, especially for big data workloads on Azure Data Lake Storage Gen2.

Exam trap

The trap here is that candidates may think moving data to a cooler tier or enabling soft delete directly improves performance, when in reality these actions address cost or protection, not the root cause of slow analytics jobs caused by small file overhead.

Full explanation →

563

Multi-Selectmedium

You are designing an Azure Stream Analytics job to process real-time IoT data from thousands of devices. The job must handle late-arriving events (up to 1 hour late) and out-of-order events (up to 5 minutes). Which two temporal policies should you configure?

Select 2 answers

A.Out of order tolerance window: 5 minutes; Late arrival tolerance window: 1 hour

B.Out of order tolerance window: 1 hour; Late arrival tolerance window: 5 minutes

C.Watermark delay: 1 hour; Out of order tolerance: 5 minutes

D.Use Event Hubs capture to handle late events; no additional configuration needed

AnswersA, C

Why this answer

Azure Stream Analytics uses two distinct temporal policies to handle event timing issues. The 'Late arrival tolerance window' defines how long the system waits for events that arrive after their timestamp (up to 1 hour in this scenario), while the 'Out of order tolerance window' specifies the maximum time difference allowed for events that arrive out of sequence (up to 5 minutes). Option A correctly configures both policies to match the requirements.

Exam trap

The trap here is confusing the two tolerance windows (late arrival vs. out-of-order) or mistaking Spark-specific terminology like 'watermark delay' for Azure Stream Analytics policies.

Why the other options are wrong

B

This swaps the policies; late arrival should be larger than out-of-order.

D

Event Hubs capture is for storing raw events, not for handling out-of-order or late arrival in Stream Analytics.

Full explanation →

564

MCQhard

Your Azure Databricks workspace contains sensitive customer data. You need to ensure that only users with a specific Microsoft Entra ID role can access the workspace, and all access must be logged and monitored. You also need to audit data access at the table level. What should you implement?

A.Configure SCIM provisioning to sync the Entra ID group to Databricks and assign the group to the workspace

B.Set up IP access lists to restrict workspace access to the corporate network and enable diagnostic logs

C.Enable Unity Catalog and assign the Databricks workspace to use Entra ID as the identity provider. Configure audit logs for the workspace

D.Use Microsoft Defender XDR to monitor access to the Databricks workspace

AnswerC

Unity Catalog supports fine-grained access control and audit logging, integrated with Entra ID.

Why this answer

Option C is correct because Azure Databricks with Unity Catalog provides fine-grained access control at the table level and integrates with Entra ID for authentication, and audit logs capture access events. Option A is wrong because SCIM provisioning only syncs users, not access control. Option B is wrong because IP access lists control network access, not data access.

Option D is wrong because Microsoft Defender XDR is for security monitoring across Microsoft 365, not specifically for Databricks table-level auditing.

Full explanation →

565

MCQmedium

You are a data engineer at Northwind Traders. You have an Azure Synapse Analytics workspace with dedicated SQL pools. You need to monitor query performance to identify slow-running queries and understand resource consumption. The solution must provide historical data for the last 30 days and allow alerting when queries exceed a certain duration. You also need to export the data to a Log Analytics workspace for correlation with other metrics. What should you use?

A.Enable Diagnostic Settings for the dedicated SQL pool to send logs to Azure Storage, and query DMVs for historical data.

B.Use Azure Storage Analytics to analyze logs from the storage account backing the SQL pool.

C.Enable Diagnostic Settings to stream SQL pool metrics and logs to a Log Analytics workspace, then create alert rules and use KQL queries for historical analysis.

D.Configure SQL Server Query Store and export data to Azure Blob Storage using elastic query.

AnswerC

Provides historical data and alerting.

Why this answer

Option B is correct: Azure Monitor diagnostic settings can send Synapse SQL pool logs and metrics to Log Analytics. Then you can create alerts and run queries over historical data. Option A is incorrect: Dynamic Management Views (DMVs) only provide current state, not historical 30-day data.

Option C is incorrect: Query Store retains data for a limited time and doesn't integrate natively with Log Analytics. Option D is incorrect: Azure Storage Analytics is for storage accounts, not Synapse SQL.

Full explanation →

566

Multi-Selecteasy

Which TWO are benefits of using Azure Databricks Auto Loader for incremental data ingestion?

Select 2 answers

A.It can process new files as they arrive in cloud storage.

B.It can handle large volumes of data without manual checkpointing.

C.It automatically evolves the schema without any configuration.

D.It provides sub-second latency for real-time streaming.

E.It provides built-in deduplication of records.

AnswersA, B

Auto Loader incrementally processes new files.

Why this answer

Azure Databricks Auto Loader is designed to incrementally ingest new data files as they arrive in cloud storage (e.g., Azure Data Lake Storage Gen2 or Blob Storage) by using a notification-based or directory-listing approach. It automatically tracks which files have already been processed using a structured streaming checkpoint mechanism, eliminating the need for manual checkpoint management even at high data volumes. This makes options A and B correct because Auto Loader's core value is its ability to discover and process new files without manual intervention and to handle large-scale ingestion reliably.

Exam trap

The trap here is that candidates often confuse Auto Loader's schema inference (which is automatic on first read) with automatic schema evolution (which requires explicit configuration), and they also mistakenly assume file-based ingestion can achieve sub-second latency or provide built-in deduplication, which are not features of this service.

Full explanation →

567

Multi-Selecteasy

Which TWO Azure services can be used to perform data transformation in a data pipeline? (Select two.)

Select 2 answers

A.Azure Data Factory

B.Azure Monitor

C.Azure Storage

D.Azure Event Hubs

E.Azure Databricks

AnswersA, E

Data Factory offers mapping data flows and compute activities for transformation.

Why this answer

Azure Data Factory is a cloud-based ETL service that provides a code-free visual interface for orchestrating data movement and transformation at scale. It supports data flows, which allow you to perform transformations like aggregations, joins, and filtering without writing code, making it a correct choice for data transformation in a pipeline.

Exam trap

The trap here is that candidates often confuse data ingestion services (like Event Hubs) or storage services (like Azure Storage) with transformation services, forgetting that transformation requires compute engines like Data Factory or Databricks.

Full explanation →

568

MCQmedium

You are designing a near real-time analytics solution for a retail company. The company has a transactional database in Azure SQL Database that records sales transactions. The data must be available in Azure Synapse Analytics dedicated SQL pool for reporting with less than 15 minutes of latency. The data volume is about 10 GB per day. You need to design the data ingestion pipeline. You also need to ensure that the pipeline can handle schema changes (e.g., new columns added to the source table) without manual intervention. Which approach should you use?

A.Use Azure Databricks with Auto Loader to read from SQL Database and write to Synapse using Delta Lake.

B.Use Azure Data Factory with a change data capture (CDC) mapping data flow to read changes from SQL Database and write to Synapse dedicated SQL pool. Enable schema drift to handle new columns.

C.Use Azure Data Share to share the SQL Database tables with Synapse and schedule snapshots every 15 minutes.

D.Use Azure Synapse Pipelines with a copy activity to perform a full load every 15 minutes.

AnswerB

CDC provides incremental changes; schema drift allows automatic handling of new columns.

Why this answer

Option B is correct because Azure Data Factory's mapping data flows support Change Data Capture (CDC) to incrementally load only changed rows from Azure SQL Database, meeting the <15-minute latency requirement for 10 GB/day. The 'Enable schema drift' option in mapping data flows automatically handles new columns added to the source table without manual intervention, which is essential for schema evolution.

Exam trap

The trap here is that candidates may choose Azure Databricks with Auto Loader (Option A) because it is associated with handling schema evolution, but they overlook that Auto Loader is file-based and not designed for direct CDC from Azure SQL Database, making it unsuitable for this transactional source.

How to eliminate wrong answers

Option A is wrong because Azure Databricks Auto Loader is designed for ingesting files (e.g., from cloud storage) and does not natively connect to Azure SQL Database for CDC; it would require additional connectors and manual schema handling, adding complexity without meeting the near-real-time requirement efficiently. Option C is wrong because Azure Data Share provides snapshot-based sharing with scheduled refreshes, but it does not support CDC or schema drift; it would require full or incremental snapshots that may not handle new columns automatically and could exceed latency if schema changes occur. Option D is wrong because using a full load every 15 minutes for 10 GB/day is inefficient and could cause high resource consumption and potential timeouts; it does not handle schema changes automatically and is not a best practice for near-real-time ingestion.

Full explanation →

569

Multi-Selecthard

You are building a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow to perform a lookup transformation. The lookup source is a dimension table with 10 million rows. You need to optimize the lookup performance. Which THREE actions should you take?

Select 3 answers

A.Increase the batch size in the lookup transformation settings.

B.Partition the dimension table on the lookup key before reading.

C.Enable the 'Broadcast' option on the lookup source transformation if the dimension table is less than 100 MB.

D.Select only the necessary columns in the lookup source transformation.

E.Ensure the dimension table has an index on the columns used for the lookup.

AnswersC, D, E

Broadcasting avoids shuffle for small dimension tables.

Why this answer

Options A, B, and E are correct. Option A is correct because broadcasting the lookup table (if small enough) avoids shuffling. Option B is correct because indexing the lookup columns speeds up the lookup.

Option C is wrong because increasing batch size in a lookup does not directly improve performance; it's for sink. Option D is wrong because partitioning the lookup table on the join key can help, but the question is about the data flow, not the table design. Option E is correct because narrowing the columns reduces data transfer.

Full explanation →

570

Multi-Selecthard

Which THREE components are required to implement a real-time data processing solution using Azure Stream Analytics?

Select 3 answers

A.Power BI as the output sink

B.Azure Data Factory pipeline for orchestration

C.An input source such as Azure Event Hubs or IoT Hub

D.An output sink such as Azure Synapse Analytics or Blob Storage

E.A Stream Analytics job with a defined query

AnswersC, D, E

Streaming input is required for real-time processing.

Why this answer

Option C is correct because Azure Stream Analytics requires a streaming input source to ingest real-time data. Azure Event Hubs and IoT Hub are the primary services that provide high-throughput, low-latency event ingestion, which Stream Analytics can consume via its built-in connector. Without a streaming input, the job cannot process real-time data.

Exam trap

The trap here is that candidates often assume Power BI is a required output for real-time dashboards, but Stream Analytics can function without any visualization sink, and the exam focuses on the minimal required components: input, job with query, and output sink.

Full explanation →

571

Multi-Selectmedium

Which of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2? (Choose two.)

Select 2 answers

A.Azure Storage Service Encryption (SSE) with Microsoft-managed keys

B.Azure Active Directory (Azure AD) authentication for storage accounts

C.Customer-managed keys stored in Azure Key Vault

D.Configure firewall rules to restrict IP access

AnswersA, C

Why this answer

Azure Storage Service Encryption (SSE) with Microsoft-managed keys encrypts data at rest automatically for Azure Data Lake Storage Gen2 using 256-bit AES encryption. This is enabled by default for all storage accounts, ensuring data written to disk is encrypted before being persisted, with no additional configuration required.

Exam trap

The trap here is confusing network security controls (like firewalls or Azure AD authentication) with data-at-rest encryption methods, leading candidates to select options that protect access rather than the stored data itself.

Why the other options are wrong

B

Azure AD authentication controls access, not encryption at rest.

D

Firewall rules control network access, not encryption at rest.

Full explanation →

572

MCQmedium

You are designing a data processing solution for an e-commerce company. The company receives millions of clickstream events per hour from their website and needs to aggregate the data by product category and windowed time intervals for real-time dashboards. You need to minimize latency and cost. Which service should you use?

A.Azure Databricks Structured Streaming

B.Azure Data Factory

C.Azure Stream Analytics

D.Azure Synapse Pipelines

AnswerC

Provides real-time stream processing with windowed aggregations.

Why this answer

Option B (Azure Stream Analytics) is the best choice because it is purpose-built for real-time stream processing, supports windowed aggregations, and integrates with Power BI for dashboards. Option A (Azure Synapse Pipelines) is for orchestrating data movement, not real-time processing. Option C (Azure Databricks) can handle streaming but is more complex and typically more expensive for simple aggregations.

Option D (Azure Data Factory) is for batch data movement, not real-time.

Full explanation →

573

MCQhard

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime. You notice that the copy activity fails intermittently with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed'. The on-premises SQL Server is under heavy load during business hours. What is the most likely cause?

A.The SQL Server is experiencing resource contention or timeout due to heavy load.

B.The Azure Blob Storage account is throttling requests.

C.The authentication method to SQL Server is incorrect.

D.The self-hosted integration runtime is not connected to the network.

AnswerA

Intermittent failures often due to resource pressure.

Why this answer

The error 'SqlOperationFailed' on the source side indicates that the SQL Server itself is failing to complete the query or data extraction operation. Under heavy load, the SQL Server may experience resource contention (CPU, memory, I/O) or reach query timeout thresholds, causing the copy activity to fail intermittently. This is consistent with the described scenario of heavy load during business hours.

Exam trap

The trap here is that candidates may confuse a source-side error with a sink-side error, or assume that any intermittent failure must be a network or connectivity issue, rather than recognizing that SQL Server resource contention under heavy load is a classic cause of intermittent 'SqlOperationFailed' errors.

How to eliminate wrong answers

Option B is wrong because Azure Blob Storage throttling would produce an error on the 'Sink' side (e.g., 'StorageError' or 'BlobOperationFailed'), not on the 'Source' side. Option C is wrong because an incorrect authentication method would cause a persistent authentication failure (e.g., 'Login failed for user') on every attempt, not intermittent failures. Option D is wrong because if the self-hosted integration runtime were not connected to the network, the pipeline would fail consistently with a connectivity error (e.g., 'Unable to connect to Integration Runtime'), not an intermittent SQL operation error.

Full explanation →

574

Drag & Dropmedium

Drag and drop the steps to configure Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2 into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, set up the storage account and data. Then create the serverless SQL pool, a database, external data source referencing the storage, and finally external file format and external table to query.

Full explanation →

575

MCQeasy

You are designing a data processing pipeline in Azure Data Factory. The pipeline must copy data from Azure Blob Storage to Azure SQL Database and transform the data using a mapping data flow. The data flow includes a Derived Column transformation. What is the purpose of the Derived Column transformation?

A.Aggregate data by grouping rows.

B.Create new columns or modify existing columns using expressions.

C.Sort data in ascending or descending order.

D.Rename or drop columns.

AnswerB

Derived Column allows expression-based column creation and modification.

Why this answer

The Derived Column transformation in Azure Data Factory mapping data flows is used to create new columns or modify existing columns by applying expressions. This allows you to perform calculations, string manipulations, or conditional logic directly within the data flow, enabling in-flight data transformation before writing to the sink.

Exam trap

The trap here is that candidates confuse the Derived Column transformation with the Select transformation, assuming it is used for renaming or dropping columns, when in fact Derived Column is specifically for creating or modifying column values via expressions.

How to eliminate wrong answers

Option A is wrong because aggregating data by grouping rows is the purpose of the Aggregate transformation, not the Derived Column transformation. Option C is wrong because sorting data is performed by the Sort transformation, which reorders rows based on column values. Option D is wrong because renaming or dropping columns is handled by the Select transformation, which allows you to include, exclude, or alias columns.

Full explanation →

576

MCQhard

You are implementing a streaming solution using Azure Stream Analytics. The input is from an IoT Hub receiving telemetry from thousands of devices. The output is to Azure Synapse Analytics dedicated SQL pool. The requirement is to compute rolling averages over a 5-minute tumbling window and write results every minute. Which windowing function and output configuration should you use?

A.Use a TumblingWindow with duration of 5 minutes and output every 5 minutes.

B.Use a SlidingWindow with duration 5 minutes and output every 1 minute.

C.Use a HoppingWindow with size 5 minutes and hop 1 minute.

D.Use a SessionWindow with timeout 5 minutes and maximum duration 10 minutes.

AnswerC

Hopping windows with a 1-minute hop produce results every minute, each covering the last 5 minutes.

Why this answer

Option C is correct because a HoppingWindow with a size of 5 minutes and a hop of 1 minute allows you to compute rolling averages over a 5-minute window while producing results every minute. This satisfies the requirement of outputting results at a higher frequency than the window duration, which is not possible with a TumblingWindow (which only outputs at the end of the window) or a SlidingWindow (which outputs on each event, not at fixed intervals).

Exam trap

The trap here is that candidates confuse the output frequency with the window duration, assuming a TumblingWindow can produce results more frequently by adjusting the duration, but only a HoppingWindow with a hop smaller than the size can achieve that.

How to eliminate wrong answers

Option A is wrong because a TumblingWindow with a duration of 5 minutes only produces output at the end of each 5-minute window, not every minute. Option B is wrong because a SlidingWindow with a duration of 5 minutes outputs results for every event (or at the end of each sliding interval), not at a fixed 1-minute interval; it also does not align with the requirement to output every minute on a schedule. Option D is wrong because a SessionWindow is designed for event-based sessions with gaps and timeouts, not for fixed-duration rolling averages with periodic output.

Full explanation →

577

MCQhard

You are optimizing a batch processing job in Azure Databricks that reads data from Azure Data Lake Storage Gen2 and writes aggregated results back. The job currently runs slowly due to high shuffle writes. You plan to use Delta Lake and optimize the table layout. Which two actions should you take to reduce shuffle writes? (Select two.)

A.Enable Delta Lake auto-optimize to coalesce small files.

B.Partition the Delta table by the most frequently used filter column.

C.Use a broadcast hash join hint for all joins.

D.Increase the number of shuffle partitions to 400.

E.Use the OPTIMIZE command with Z-Ordering on join keys.

AnswerB, E

Partitioning reduces the amount of data shuffled during queries that filter on that column.

Why this answer

Options A and D are correct. Optimizing the data layout with Z-Ordering and compaction reduces shuffle size. Option B is wrong because increasing the number of shuffle partitions can increase shuffle writes.

Option C is wrong because enabling auto-optimize helps but does not directly reduce shuffle writes. Option E is wrong because using broadcast hash join reduces shuffles if one table is small, but not generally for large tables.

Full explanation →

578

MCQeasy

You need to orchestrate a data pipeline that includes a Python script and a Data Flow in Azure Synapse Analytics. The Python script must run before the Data Flow. Which activity should you use to run the Python script?

A.Notebook activity configured to use a Python kernel

B.Web activity

C.Stored Procedure activity

D.HDInsight Hive activity

AnswerA

A Notebook activity can run Python code in Synapse Spark notebooks.

Why this answer

A Notebook activity in Azure Synapse Analytics can be configured to use a Python kernel, allowing you to run a Python script directly within the pipeline. This is the correct choice because the requirement is to execute a Python script before a Data Flow, and the Notebook activity supports Python execution natively in Synapse pipelines.

Exam trap

The trap here is that candidates may confuse a Notebook activity with a Web activity or a Stored Procedure activity, thinking they can execute arbitrary code, but only the Notebook activity supports Python execution natively in Synapse pipelines.

How to eliminate wrong answers

Option B is wrong because a Web activity calls an HTTP/S endpoint (e.g., a REST API) and cannot run a Python script directly; it is used for invoking external services, not for executing code within Synapse. Option C is wrong because a Stored Procedure activity executes SQL stored procedures in a database, which is not designed for running Python scripts. Option D is wrong because an HDInsight Hive activity runs Hive queries on an HDInsight cluster, not Python scripts; it is meant for HiveQL, not Python execution.

Full explanation →

579

MCQeasy

You are reviewing an ARM template that assigns a role. What role is being assigned, and at what scope?

A.Storage Blob Data Owner at the subscription scope

B.Storage Blob Data Reader at the resource group scope

C.Storage Blob Data Contributor at the resource group scope

D.Storage Blob Data Contributor at the storage account scope

AnswerC

The roleDefinitionId is for 'Storage Blob Data Contributor' and scope is resourceGroup().id.

Why this answer

Option A is correct because the roleDefinitionId corresponds to 'Storage Blob Data Contributor', and the scope is the resource group. Option B is wrong because the scope is not the storage account. Option C is wrong because the role is not 'Storage Blob Data Reader' (the GUID is different).

Option D is wrong because the scope is the resource group, not subscription.

Full explanation →

580

MCQhard

You are reviewing an Azure PowerShell script that sets permissions on a directory in Azure Data Lake Storage Gen2. The script sets a default ACL for a user on the path 'sales/2024/01/'. What is the effect of the -DefaultScope parameter?

A.The ACL replaces the existing access ACL on the directory.

B.The ACL is inherited by all new child items created under this directory.

C.The ACL is applied to all existing files and subdirectories recursively.

D.The ACL is applied only to files, not subdirectories.

AnswerB

Default ACLs set permissions that are inherited by new items.

Why this answer

Option C is correct because default ACLs are inherited by new child items created under that directory. Option A is wrong because default ACLs do not affect existing items. Option B is wrong because default ACLs apply to both files and directories.

Option D is wrong because default ACLs are separate from access ACLs.

Full explanation →

581

MCQeasy

You are troubleshooting an Azure Databricks job that writes data to Azure Data Lake Storage Gen2. The job fails with '403 Forbidden' error. The Databricks workspace uses a managed identity (system-assigned) for authentication. What should you verify?

A.The storage account name is correct

B.The storage account firewall is configured to allow Azure services

C.A private endpoint is configured between Databricks and the storage account

D.The managed identity has 'Storage Blob Data Contributor' role assigned to the storage account

AnswerD

RBAC role is required for write access

Why this answer

Option A is correct because the managed identity must have the 'Storage Blob Data Contributor' RBAC role on the storage account to write data. Option B is wrong because the error is 403, not 404. Option C is wrong because no service endpoint or private endpoint is required.

Option D is wrong because firewall rules would cause a different error.

Full explanation →

582

MCQmedium

Refer to the exhibit. A data engineer is reviewing a Mapping Data Flow in Azure Data Factory. The source reads 1000 rows from a CSV file. The sink writes to a Delta dataset with optimizeWrite enabled. The data flow currently has no transformations. The engineer runs the data flow and notices that the sink writes only 500 rows. What is the most likely cause?

A.The source's rowLimit setting limits the number of rows read, but the sink writes only unique rows.

B.The sink's allowSchemaDrift setting prevents writing rows with additional columns.

C.The sink's partitionBy('hash', 1) setting causes some rows to be dropped due to partitioning logic.

D.The data flow has no transformations, so the sink cannot write data.

AnswerC

If the partition column is missing or null, rows may be dropped.

Why this answer

Option A is correct because the sink script has 'partitionBy('hash', 1)' which forces the data to be repartitioned into a single partition. Since the source reads 1000 rows and the repartition may cause some rows to be dropped due to hash distribution logic in the sink. However, the key point is that the sink's partitioning overrides the source partitioning and may lead to data loss if there is a mismatch.

Actually, the correct cause is that the sink is using hash partitioning on a column that may have many nulls or the partition column is not specified, causing skewed distribution. But the most direct answer is that the sink's partitionBy('hash', 1) creates a single partition and the hash function might drop rows that hash to a non-existent partition? No, hash with 1 partition should include all rows. Another possibility: the source has rowLimit 1000, but sink's partitionBy may cause duplicates? Let's think.

The exhibit shows partitionBy('hash', 1). In ADF, when you use hash partitioning with 1 partition, all rows go to that partition, so no rows should be lost. However, the sink has allowSchemaDrift true, but no transformations.

The discrepancy might be due to the source rowLimit being applied after reading? Actually rowLimit is applied at source. The most plausible cause is that the sink uses a Delta dataset and the Delta write operation may skip rows with nulls in partition columns? Option A mentions the sink's partitioning causes some rows to be dropped. Option B mentions allowSchemaDrift causing schema mismatch; but schema is matched.

Option C mentions source rowLimit being applied before transformations; there are no transformations. Option D mentions missing transformation; not needed. The correct answer is A because in some cases, if the partition column is not in the source, the hash function might fail.

But the script shows no partition column specified. In ADF, if you use partitionBy('hash', 1) without specifying a column, it uses a random hash? Actually it requires a column. The script is incomplete.

Given the context, Option A is the best answer.

Full explanation →

583

MCQeasy

Your team is developing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs. The output must be written to Azure SQL Database. You need to ensure that the processing guarantees exactly-once semantics for the output. Which output configuration should you use?

A.Set output mode to 'When possible'

B.Set output mode to 'Exactly once'

C.Set output mode to 'At least once'

D.Set output mode to 'Best effort'

AnswerB

Ensures no duplicate writes to SQL Database.

Why this answer

Option C is correct because Azure Stream Analytics supports exactly-once delivery to Azure SQL Database when using the 'Exactly once' output mode. Option A is wrong because 'At least once' is the default and can cause duplicates. Option B is wrong because 'When possible' is not a valid output mode.

Option D is wrong because 'Best effort' is not a valid mode.

Full explanation →

584

MCQmedium

A data engineer is designing a monitoring solution for Azure Data Factory pipelines. They need to be alerted when a pipeline run fails or when the duration exceeds a threshold. The solution must minimize cost and operational overhead. Which approach should they use?

A.Configure Azure Event Grid to send pipeline run events to Azure Functions for alerting.

B.Use Azure Monitor metrics and activity logs to create alert rules for pipeline failures and duration.

C.Send all pipeline run logs to Log Analytics and create alert rules based on custom log searches.

D.Create an Azure Logic App that runs every minute to check pipeline run status via REST API.

AnswerB

Azure Monitor provides built-in metrics and alerts for Azure Data Factory with minimal cost.

Why this answer

Option B is correct because Azure Monitor provides native, cost-effective alerting for Azure Data Factory pipelines using metrics (e.g., pipeline run duration) and activity logs (e.g., pipeline run failures). This approach requires no additional compute or log ingestion costs, as alerts are configured directly on the resource's monitoring data, minimizing both cost and operational overhead.

Exam trap

The trap here is that candidates over-engineer the solution by choosing event-driven or log-based approaches (A, C, D) when the simplest, most cost-effective native monitoring (Azure Monitor alerts) is available, often forgetting that Data Factory emits metrics and activity logs by default without additional setup.

How to eliminate wrong answers

Option A is wrong because Azure Event Grid with Azure Functions introduces unnecessary complexity and cost (function execution time) for a scenario that can be handled natively by Azure Monitor alerts without custom code. Option C is wrong because sending all pipeline run logs to Log Analytics incurs ingestion and retention costs, and custom log search alerts are more expensive and operationally heavier than using built-in metrics and activity log alerts. Option D is wrong because running a Logic App every minute to poll the REST API creates recurring execution costs and latency, and is an inefficient polling pattern compared to the event-driven, push-based alerting provided by Azure Monitor.

Full explanation →

585

MCQhard

You are designing a data ingestion pipeline for Azure Synapse Analytics using PolyBase. The source data is in Azure Blob Storage, with files in the Parquet format. The data volume is approximately 2 TB daily. You need to minimize the time to load data into a dedicated SQL pool table. Which approach should you recommend?

A.Split the Parquet files into many small files (e.g., 64 MB each) to maximize parallelism.

B.Convert the Parquet files to CSV format before loading using PolyBase.

C.Use the COPY INTO command to load the Parquet files directly into the dedicated SQL pool.

D.Create an external table pointing to the Parquet files and use CREATE TABLE AS SELECT (CTAS) to load the data.

AnswerD

CTAS with external tables enables predicate pushdown and efficient column pruning.

Why this answer

Option B is correct because creating an external table with the Parquet file format and using CREATE TABLE AS SELECT (CTAS) allows PolyBase to leverage predicate pushdown and column pruning, which reduces data movement and improves performance. Option A is wrong because splitting files into smaller chunks increases the number of parallel reads but the overhead of many small files can degrade performance. Option C is wrong because converting to CSV adds overhead and increases data size.

Option D is wrong because COPY INTO is optimized for small to medium loads, while CTAS with external tables is better for large-scale loads.

Full explanation →

586

Multi-Selectmedium

Which TWO actions should you take to optimize query performance in Azure Synapse Analytics dedicated SQL pool when working with large fact tables?

Select 2 answers

A.Use replicated distribution for the fact table.

B.Use round-robin distribution to evenly distribute data.

C.Create statistics on columns used in WHERE clauses.

D.Use clustered index instead of columnstore index.

E.Implement table partitioning on a date column.

AnswersC, E

Statistics help the optimizer choose efficient query plans.

Why this answer

Option C is correct because creating statistics on columns used in WHERE clauses enables the Azure Synapse Analytics dedicated SQL pool query optimizer to generate more accurate cardinality estimates, leading to better join strategies and index selections. Without up-to-date statistics, the optimizer may choose suboptimal plans, especially for large fact tables where data distribution skew is common.

Exam trap

The trap here is that candidates often confuse distribution methods (replicated, round-robin, hash) with performance tuning for large fact tables, overlooking that statistics maintenance is a critical and separate optimization step that directly impacts query plan quality.

Full explanation →

587

Multi-Selectmedium

You are designing a data processing solution that must handle personally identifiable information (PII). The data will be ingested from multiple sources into Azure Data Lake Storage. Which TWO actions should you take to protect the PII data during processing?

Select 2 answers

A.Apply data masking in Power BI reports to hide PII from end users.

B.Use column-level security in Azure Synapse Analytics to restrict access to sensitive columns.

C.Configure the Azure Data Lake Storage firewall to allow only specific IP addresses.

D.Implement SQL Server Always Encrypted for all data stored in the data lake.

E.Use Azure Purview to classify and label PII data in the data lake.

AnswersB, E

Column-level security allows fine-grained access control at the database level.

Why this answer

Options A and C are correct. Column-level security in Azure Synapse restricts access to sensitive columns. Azure Purview provides data classification and lineage.

Option B (masking in Power BI) is a presentation-layer control. Option D (server-level firewall) is network security. Option E (always encrypted) is for SQL Server, not ADLS.

Full explanation →

588

Multi-Selecteasy

You need to secure access to an Azure Storage account that hosts sensitive data. Which TWO methods provide encryption in transit?

Select 2 answers

A.Enable Azure Storage Service Encryption (SSE)

B.Enforce HTTPS for REST API calls

C.Use Azure Files with SMB 3.0 and encryption

D.Enable Azure Disk Encryption on client VMs

E.Configure an IPsec policy between clients and storage

AnswersB, C

HTTPS encrypts data in transit.

Why this answer

HTTPS and SMB over encryption ensure data is encrypted in transit. Option C (SSE) is at rest. Option D (Azure Disk Encryption) is for VMs.

Option E (IPSec) is not supported by Azure Storage directly.

Full explanation →

589

MCQhard

An organization is using Azure Data Factory to ingest data from multiple on-premises SQL Server databases into Azure Synapse Analytics. They need to ensure that sensitive data is masked during ingestion before landing in the staging area. What is the best approach?

A.Apply an Azure Policy that masks sensitive data in Azure Synapse Analytics.

B.Use Azure SQL Database dynamic data masking on the source databases.

C.Use a Mapping Data Flow with derived column transformations to mask sensitive columns.

D.Use Azure Purview to classify and mask sensitive data automatically.

AnswerC

Mapping Data Flow allows you to apply transformations like mask using derived columns before writing to staging.

Why this answer

Option D is correct because Data Flow allows transformation steps including column masking. Option A is wrong because Azure SQL Database dynamic data masking is applied at query time, not during copy. Option B is wrong because Azure Purview is for governance, not data masking.

Option C is wrong because Azure Policy is for compliance, not data masking at source.

Full explanation →

590

Multi-Selectmedium

Which TWO actions should you take to optimize performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Select 2 answers

A.Disable index on the target table after loading.

B.Use a large batch size (e.g., 100 MB) for each copy operation.

C.Use round-robin distribution for the staging table.

D.Use a small batch size (e.g., 1 MB) for each copy operation.

E.Use clustered columnstore index on the target table during load.

AnswersB, C

Large batches reduce number of transactions.

Why this answer

Option B is correct because using a large batch size (e.g., 100 MB) for each copy operation minimizes the number of round trips and transaction commits, which significantly improves throughput when loading large volumes of data into a dedicated SQL pool. The PolyBase or COPY statement in Azure Synapse performs best when batches are large enough to leverage parallel processing and reduce overhead from frequent small writes.

Exam trap

The trap here is that candidates often confuse batch size optimization with transaction log management, mistakenly thinking smaller batches reduce log pressure, when in fact larger batches reduce overall load time and improve throughput in Azure Synapse's distributed architecture.

Full explanation →

591

Multi-Selectmedium

Which THREE actions can improve the performance of a dedicated SQL pool in Azure Synapse Analytics?

Select 3 answers

A.Use rowstore indexes instead of columnstore indexes.

B.Partition large fact tables on a date column.

C.Use round-robin distribution for all tables.

D.Use replicated tables for small dimension tables.

E.Enable result-set caching.

AnswersB, D, E

Partitioning enables partition elimination, reducing data scanned.

Why this answer

Options A, B, and E are correct. Using replicated tables avoids shuffling for small tables; partitioning large tables enables partition elimination; using result-set caching reduces load. Option C is wrong because rowstore indexes are slower for analytics; columnstore is preferred.

Option D is wrong because round-robin distribution is for large tables with no natural key, but it may cause shuffles.

Full explanation →

592

Multi-Selectmedium

You are designing a hybrid data storage architecture for a global e-commerce platform. Which two Azure services should you combine to achieve low-latency read access for users worldwide and durable archival storage for compliance?

Select 2 answers

A.Azure Cosmos DB

B.Azure Table Storage

C.Azure Blob Storage

D.Azure SQL Database

E.Azure Files

AnswersA, C

Why this answer

Azure Cosmos DB is correct because it provides globally distributed, multi-region write and read capabilities with turnkey global distribution, enabling low-latency access for users worldwide via its multi-homing API. Azure Blob Storage is correct because it offers durable, tiered archival storage (e.g., Cool, Archive access tiers) at low cost, meeting compliance requirements for long-term retention of data such as transaction logs or invoices.

Exam trap

The trap here is that candidates often confuse Azure Table Storage with Cosmos DB's Table API, assuming Table Storage supports global distribution, when in fact only Cosmos DB's Table API (a different service) provides multi-region replication and low-latency reads.

Why the other options are wrong

B

Table Storage is a single-region service with higher latency.

D

Global distribution requires complex sharding and is not as seamless as Cosmos DB.

E

Not designed for archival or global low-latency reads.

Full explanation →

593

MCQmedium

You are monitoring an Azure Data Lake Storage Gen2 account using Azure Monitor. You need to be alerted when the number of storage account requests exceeds 20,000 per hour. What is the most efficient way to set up this alert?

A.Create a Log Analytics workspace and write a KQL query to count requests.

B.Create an Activity Log alert for 'List Storage Account Keys' events.

C.Create a metric alert on the 'Transactions' metric with a threshold of 20,000 and aggregation granularity of 1 hour.

D.Use Azure Advisor to recommend scaling.

AnswerC

Metric alerts are efficient and built-in.

Why this answer

Option B is correct because the metric 'Transactions' is available at the storage account level and can be aggregated per hour. Option A is wrong because a Log Analytics query would be more complex and less efficient. Option C is wrong because Activity Log does not contain transaction counts.

Option D is wrong because Azure Advisor does not provide custom alerting on metrics.

Full explanation →

594

MCQeasy

You need to store streaming data from Azure Event Hubs into Azure Data Lake Storage Gen2 in near real-time. The data should be stored in Avro format with a folder structure: /raw/{eventhub}/{yyyy}/{MM}/{dd}/{HH}/{mm}. Which Azure service should you use to ingest the data?

A.Event Hubs Capture feature to automatically capture events to ADLS Gen2.

B.Azure Stream Analytics with a job that reads from Event Hubs and writes to ADLS Gen2.

C.Azure Data Factory with a tumbling window trigger to copy data from Event Hubs every 5 minutes.

D.Azure Databricks with Auto Loader to read from Event Hubs and write to ADLS Gen2.

AnswerB

Stream Analytics supports Avro output and custom partition path patterns.

Why this answer

Azure Stream Analytics is the correct choice because it natively supports reading from Event Hubs and writing to ADLS Gen2 with built-in time-based partitioning into the exact folder structure /raw/{eventhub}/{yyyy}/{MM}/{dd}/{HH}/{mm}. It provides near real-time processing with sub-minute latency and can output data in Avro format directly, meeting all requirements without additional code or orchestration.

Exam trap

The trap here is that candidates often choose Event Hubs Capture because it seems like a simple 'capture to storage' feature, but they overlook the requirement for near real-time per-minute partitioning, which Capture cannot achieve due to its fixed 5-minute minimum window.

How to eliminate wrong answers

Option A is wrong because Event Hubs Capture writes data in fixed 5-minute or 300 MB windows, not in near real-time per minute, and its folder structure is /{EventHub}/{Namespace}/{YYYY}/{MM}/{DD}/{HH}/{mm} but cannot dynamically include the event hub name as a folder variable in the path. Option C is wrong because Azure Data Factory with a tumbling window trigger introduces at least 5 minutes of latency and is designed for batch processing, not near real-time streaming, and cannot read from Event Hubs directly without a staging layer. Option D is wrong because Azure Databricks with Auto Loader is optimized for incremental file ingestion from cloud storage, not for streaming from Event Hubs; it would require additional structured streaming code and incurs cluster startup and runtime overhead, making it less suitable for simple near real-time ingestion.

Full explanation →

595

MCQeasy

You are running a Spark job in Azure Synapse Analytics that reads from a Delta Lake table and performs multiple transformations. The job fails with an out-of-memory error on the executors. Which action should you take first to resolve the issue?

A.Enable checkpointing to truncate the lineage.

B.Decrease the number of partitions to reduce overhead.

C.Increase the executor memory setting in the Spark configuration.

D.Use the cache() action on intermediate DataFrames.

AnswerC

Increasing executor memory provides more heap space to avoid OOM errors.

Why this answer

Option C is correct because an out-of-memory error on executors indicates that the available memory per executor is insufficient for the data being processed. Increasing the executor memory setting in the Spark configuration directly addresses this by allocating more heap space, allowing transformations to complete without spilling to disk or failing. This is the first and most straightforward action to take before optimizing partitioning or caching.

Exam trap

The trap here is that candidates often confuse memory issues with partitioning or caching optimizations, but the immediate fix for an out-of-memory error is to increase executor memory, not to reduce parallelism or persist data.

How to eliminate wrong answers

Option A is wrong because checkpointing truncates the lineage and helps with recovery and plan optimization, but it does not directly increase available memory or resolve an out-of-memory error. Option B is wrong because decreasing the number of partitions reduces parallelism and can actually increase memory pressure per partition, worsening the out-of-memory issue. Option D is wrong because using cache() persists intermediate DataFrames in memory, which consumes additional memory and can exacerbate the out-of-memory error rather than resolving it.

Full explanation →

596

MCQhard

A data engineering team uses Azure Stream Analytics to process real-time IoT data. They notice that the job's watermark delay is increasing over time, and the output is falling behind. The input is from Event Hubs with 10 partitions. The job uses a 5-minute hopping window with a 1-minute hop. What is the most likely cause?

A.The hopping window size is too large.

B.The late arrival tolerance is set too high.

C.The job is under-provisioned in terms of Streaming Units (SUs).

D.The Event Hubs partition count does not match the Stream Analytics job's parallelism.

AnswerC

Low SUs cause backpressure, increasing watermark delay.

Why this answer

The increasing watermark delay and falling behind output indicate that the Stream Analytics job cannot keep up with the input throughput. With a 5-minute hopping window (1-minute hop) processing 10 Event Hubs partitions, the job requires sufficient Streaming Units (SUs) to handle the compute load. Under-provisioned SUs cause backpressure, leading to rising watermark delay as the job struggles to process events within the window boundaries.

Exam trap

The trap here is that candidates often confuse watermark delay with configuration issues like window size or late arrival tolerance, but the progressive nature of the delay points directly to resource starvation (SU under-provisioning) rather than a static configuration problem.

How to eliminate wrong answers

Option A is wrong because the hopping window size (5 minutes with 1-minute hop) is a standard temporal window configuration and does not inherently cause watermark delay; larger windows actually reduce computational frequency. Option B is wrong because setting the late arrival tolerance too high would allow more late events to be included, potentially increasing watermark delay, but the question states the delay is increasing over time, which is a symptom of insufficient processing capacity, not a configuration that would cause progressive delay. Option D is wrong because Stream Analytics automatically handles partition alignment with Event Hubs partitions when the job's parallelism is set to 1 (default) or when using the same partition count; mismatched partition counts do not cause increasing watermark delay but may cause uneven data distribution or idle partitions.

Full explanation →

597

MCQeasy

Your team is developing a data processing solution that uses Azure Databricks to transform streaming data from Azure Event Hubs. The transformation includes joining the stream with a static reference table stored in Azure Data Lake Storage Gen2. You need to implement the join efficiently. Which approach should you use?

A.Use a watermark on both sides and perform a stream-stream join

B.Use a broadcast join with the static DataFrame loaded from Delta Lake

C.Use foreachBatch to micro-batch the stream and perform a batch join

D.Use a stream-stream join by converting the static table to a stream

AnswerB

Broadcast join avoids shuffling and is efficient for streaming-static joins.

Why this answer

Option A is correct because streaming-static joins in Spark Structured Streaming are optimized when the static data is broadcast to all nodes, avoiding shuffles. Option B is wrong because the reference table is static; streaming-static join is appropriate. Option C is wrong because joining two streams is not needed.

Option D is wrong because foreachBatch with a batch join adds complexity.

Full explanation →

598

MCQmedium

You are building a data processing solution in Azure Synapse Analytics. The solution requires creating a table that stores sales transactions. The table will be used for both point-of-sale lookups and large aggregation queries. The data is not updated frequently. Which table distribution should you recommend?

A.ROUND_ROBIN

B.HASH on TransactionID

C.HASH on SalesDate

D.REPLICATE

AnswerA

ROUND_ROBIN evenly distributes data and works well for mixed workloads.

Why this answer

ROUND_ROBIN is the correct choice because the table is used for both point-of-sale lookups (single-row queries) and large aggregation queries, and the data is not updated frequently. ROUND_ROBIN distributes data evenly across all distributions without a hash key, which provides the best overall performance for mixed workloads where no single distribution key optimizes both lookup and aggregation patterns. It avoids data skew and allows parallel processing for aggregations while still supporting efficient lookups when combined with appropriate indexes.

Exam trap

The trap here is that candidates often choose HASH distribution thinking it always improves query performance, but they overlook that without a clear join or grouping column that matches the hash key, HASH can cause data skew and actually degrade mixed workload performance compared to ROUND_ROBIN.

How to eliminate wrong answers

Option B (HASH on TransactionID) is wrong because hashing on a high-cardinality column like TransactionID would distribute data evenly but would not optimize large aggregation queries that typically group by date or region, and it would not improve point-of-sale lookups unless the lookup filter includes TransactionID. Option C (HASH on SalesDate) is wrong because hashing on a date column can cause severe data skew if most transactions occur on a few dates, leading to uneven distribution and poor query performance; it also does not optimize point-of-sale lookups that rarely filter by date alone. Option D (REPLICATE) is wrong because replication is designed for small dimension tables (typically < 2 GB) and would be impractical for a large sales transaction table, causing excessive storage and maintenance overhead.

Full explanation →

599

MCQeasy

You need to ensure that an Azure Data Factory pipeline can copy data from an Azure SQL Database that is behind a private endpoint. The Data Factory should use a managed virtual network. What should you configure?

A.Install a self-hosted integration runtime on a VM in the same virtual network.

B.Use the default Azure integration runtime.

C.Enable managed virtual network for the Data Factory and create a managed private endpoint for the SQL Database.

D.Use Azure Bastion to connect the Data Factory to the SQL Database.

AnswerC

Managed private endpoints enable secure connectivity over private network.

Why this answer

Option C is correct because a managed private endpoint in the Data Factory's managed virtual network allows secure access to the SQL Database's private endpoint. Option A is wrong because the integration runtime must be in the same virtual network. Option B is wrong because self-hosted IR is for on-premises data sources.

Option D is wrong because Azure Bastion is for VM access, not data factory.

Full explanation →

600

MCQmedium

A company uses Azure Synapse Analytics to process large datasets. They need to transform JSON data stored in Azure Data Lake Storage Gen2 into a star schema. Which data processing approach minimizes data movement and leverages the compute closest to the data?

A.Use Azure Data Factory to copy the JSON data into Azure SQL Database, then use T-SQL to transform.

B.Use Azure Data Factory with SSIS to transform and load into dedicated SQL pool.

C.Load data into a Spark DataFrame in Synapse notebooks, transform, and write back.

D.Create external tables on the JSON files using PolyBase, then use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write transformed Parquet files.

AnswerD

Minimizes movement by querying in place.

Why this answer

Option D is correct because it uses PolyBase external tables and CETAS to transform JSON data directly in Azure Data Lake Storage Gen2, minimizing data movement by leveraging the compute power of the dedicated SQL pool or serverless SQL pool closest to the data. This approach reads JSON in place, transforms it into Parquet format, and writes the star schema tables back to the data lake without copying data to an intermediate store.

Exam trap

The trap here is that candidates often assume Spark notebooks (Option C) are always the best for JSON transformation, but PolyBase with CETAS is more efficient for minimizing data movement because it processes data in-place using SQL compute without loading entire datasets into memory.

How to eliminate wrong answers

Option A is wrong because it copies JSON data into Azure SQL Database first, incurring unnecessary data movement and network transfer, and uses T-SQL in a separate compute environment rather than leveraging compute closest to the data lake. Option B is wrong because it uses SSIS, which requires an Azure-SSIS Integration Runtime and moves data through a separate orchestration layer, adding latency and cost without utilizing Synapse-native processing. Option C is wrong because while Spark DataFrames in Synapse notebooks can process JSON, they require spinning up a Spark pool and loading data into memory, which involves more data movement and overhead compared to the serverless or dedicated SQL pool PolyBase approach that processes data directly in the storage layer.

Full explanation →

Page 8 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →