Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 676–750

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 10 of 12

676

MCQmedium

You have an Azure Data Factory pipeline that executes a stored procedure in Azure SQL Database. The pipeline fails with an error indicating that the stored procedure ran out of memory. What change should you make to the pipeline to resolve this?

A.Add a retry policy to the stored procedure activity.

B.Increase the pipeline activity timeout.

C.Use a Self-Hosted Integration Runtime instead of Azure IR.

D.Scale up the Azure SQL Database to a higher service tier.

AnswerD

Higher service tiers provide more memory for the database.

Why this answer

The error indicates that the stored procedure ran out of memory, which is a resource limitation at the database level, not a transient failure or timeout issue. Scaling up the Azure SQL Database to a higher service tier (e.g., from Standard to Premium or increasing DTU/vCore count) provides more memory and compute resources, directly resolving the out-of-memory condition.

Exam trap

The trap here is that candidates confuse pipeline-level retries or timeouts with database-level resource constraints, assuming that retrying or waiting longer will fix a memory exhaustion error, which is a hard resource limit that requires scaling the database.

How to eliminate wrong answers

Option A is wrong because a retry policy only re-executes the activity on transient failures (e.g., network blips), but an out-of-memory error is a persistent resource constraint that will recur on retry. Option B is wrong because increasing the pipeline activity timeout extends the duration the pipeline waits for completion, but does not address the underlying memory shortage in the database. Option C is wrong because using a Self-Hosted Integration Runtime shifts data movement or activity execution to an on-premises or VM-based runtime, but does not affect the memory allocation of the Azure SQL Database where the stored procedure runs.

Full explanation →

677

MCQhard

You are optimizing an Azure Synapse serverless SQL pool query that queries Parquet files in Azure Data Lake Storage. The query takes longer than expected. You notice that the query reads more data than necessary. What is the most effective way to reduce the amount of data scanned?

A.Split large Parquet files into smaller files of 100 MB each

B.Create external tables with explicit schema and partition by a frequently filtered column

C.Use SELECT with column pruning to only retrieve necessary columns

D.Increase the query's resource allocation by using a larger service level objective

AnswerB

External tables with partition elimination allow serverless SQL to skip entire partitions when filters are applied, reducing data scanned.

Why this answer

Option B is correct because partitioning external tables in Azure Synapse serverless SQL allows the query engine to perform partition elimination, reading only the subdirectories that match the filter criteria. This directly reduces the amount of data scanned from Parquet files in ADLS, addressing the core issue of reading unnecessary data.

Exam trap

The trap here is that candidates often confuse column pruning (reducing columns) with partition pruning (reducing rows), or assume that file size optimization alone reduces data volume, when in fact partition elimination is the key technique for minimizing scanned data in serverless SQL pools.

How to eliminate wrong answers

Option A is wrong because splitting large Parquet files into smaller files does not reduce the total data scanned; it may even increase overhead due to more file open operations. Option C is wrong because column pruning reduces the columns read, not the rows; the query still scans all partitions and files, so it does not address reading more data than necessary when filtering is involved. Option D is wrong because increasing the service level objective (SLO) allocates more resources but does not change the amount of data scanned; it only speeds up the scan, which is not the most effective way to reduce data volume.

Full explanation →

678

MCQeasy

You are developing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. Which authentication method should you use to ensure that the queries use the identity of the caller and adhere to Azure role-based access control (RBAC) permissions?

A.Microsoft Entra ID pass-through authentication.

B.Storage account key.

C.Shared access signature (SAS) token.

D.Service principal with a secret.

AnswerA

Microsoft Entra ID pass-through uses the caller's identity and enforces RBAC permissions.

Why this answer

Microsoft Entra ID pass-through authentication (option A) is correct because it allows the serverless SQL pool to use the caller's identity when accessing Azure Data Lake Storage Gen2. This ensures that Azure RBAC permissions (e.g., Storage Blob Data Reader) assigned to the user are evaluated for each query, providing fine-grained access control without exposing storage account keys or tokens.

Exam trap

The trap here is that candidates often confuse 'service principal' (a fixed identity) with 'user identity' and select option D, not realizing that pass-through authentication is the only method that preserves the caller's individual RBAC permissions.

How to eliminate wrong answers

Option B (Storage account key) is wrong because it uses a shared secret that grants full administrative access to the storage account, bypassing RBAC and the caller's identity entirely. Option C (Shared access signature token) is wrong because it delegates access based on a pre-signed URI with fixed permissions and expiry, not the caller's identity, and does not enforce RBAC. Option D (Service principal with a secret) is wrong because it authenticates as a fixed application identity rather than the individual caller, so RBAC permissions are evaluated against the service principal, not the user who submitted the query.

Full explanation →

679

MCQmedium

Your organization uses Azure Synapse Analytics dedicated SQL pool to store sales data. You need to design a data loading process for a nightly batch that inserts new rows and updates existing rows based on the business key. The table has a clustered columnstore index. Which approach minimizes table fragmentation?

A.Use UPDATE for existing rows and INSERT for new rows.

B.Use DELETE and INSERT statements in a single transaction.

C.Use a MERGE statement to perform upserts.

D.Create a staging table, load data, then use CTAS and partition switching to replace the target partition.

AnswerD

CTAS rebuilds the partition, minimizing fragmentation.

Why this answer

Option C is correct because using staging table with CREATE TABLE AS SELECT (CTAS) and then switching the partition swaps out the entire partition, avoiding individual row modifications that fragment columnstore. Option A is wrong because DELETE/INSERT on large batches causes fragmentation. Option B is wrong because MERGE on columnstore causes fragmentation.

Option D is wrong because UPDATE on columnstore causes fragmentation.

Full explanation →

680

MCQhard

You are troubleshooting a Synapse Pipeline that runs a Copy activity from an on-premises SQL Server to Azure Synapse Dedicated SQL Pool. The pipeline fails with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed.' The on-premises SQL Server has no firewall restrictions. What is the most likely cause?

A.Staging is not enabled for the Copy activity.

B.The destination table in Synapse has a different schema.

C.The SQL Server credentials in the linked service are incorrect.

D.The self-hosted integration runtime is not configured properly.

AnswerD

A self-hosted integration runtime is required to connect from Azure to on-premises networks; misconfiguration is a common cause of source-side failures.

Why this answer

The error indicates a failure on the source side. Since the on-premises SQL Server has no firewall restrictions, the most likely cause is that the self-hosted integration runtime is not installed or configured correctly, preventing connectivity from the cloud to the on-premises network. Option A is plausible but less likely given the error message.

Option C is not a typical cause for a source-side error. Option D is about staging, which is not mentioned.

Full explanation →

681

Multi-Selecteasy

Which TWO options are valid methods to load data from on-premises SQL Server into Azure Synapse Analytics?

Select 2 answers

A.SQL Server Integration Services (SSIS) package

B.Azure Data Factory with incremental copy

C.PolyBase from external table

AnswersB, C

Supports incremental loading.

Why this answer

Azure Data Factory with incremental copy is a valid method because it can connect to on-premises SQL Server via self-hosted integration runtime, and then use change tracking or watermark columns to copy only new or updated rows into Azure Synapse Analytics. This approach minimizes data transfer and supports scheduled, reliable loading.

Exam trap

The trap here is that candidates may think SSIS is the only traditional ETL tool, but Azure Data Factory's incremental copy and PolyBase are both first-class, scalable methods for loading into Synapse, and SSIS is also valid but not listed as correct in this specific question's answer set.

Full explanation →

682

Multi-Selectmedium

You are a data engineer for a company that uses Azure Synapse Analytics dedicated SQL pool. You need to implement security best practices to protect sensitive data. Which TWO actions should you take? (Choose two.)

Select 2 answers

A.Configure a firewall rule to allow only specific IP addresses.

B.Enable Azure Storage encryption for the underlying storage.

C.Enable Transparent Data Encryption (TDE) on the dedicated SQL pool.

D.Use Dynamic Data Masking to obfuscate sensitive data from all users.

E.Implement column-level security to restrict access to sensitive columns.

AnswersC, E

Correct: TDE encrypts the database at rest, protecting data files from unauthorized access.

Why this answer

Options A and C are correct. A: Column-level security allows restricting access to sensitive columns. C: Transparent Data Encryption (TDE) encrypts data at rest.

Option B is wrong because Dynamic Data Masking obfuscates data but does not prevent access at the storage level. Option D is wrong because firewall rules are not a best practice for protecting sensitive data within the database. Option E is wrong because Azure Storage encryption is not applicable to Synapse SQL pool data.

Full explanation →

683

Multi-Selectmedium

Which TWO actions should you take when monitoring Azure Data Lake Storage Gen2 to detect security threats?

Select 2 answers

A.Use Azure Security Center and Azure Defender for Storage.

B.Enable diagnostic settings for the storage account and send logs to Azure Sentinel.

C.Enable soft delete for blobs to recover from accidental deletions.

D.Configure firewall and virtual network service endpoints.

E.Set up alerting on the 'Transactions' metric.

AnswersA, B

Azure Defender provides threat detection and alerts for storage accounts.

Why this answer

Azure Security Center (now Microsoft Defender for Cloud) with Azure Defender for Storage provides built-in threat detection for Azure Data Lake Storage Gen2, including anomaly detection, malware scanning, and alerts for suspicious activities like unauthorized access or data exfiltration. This is a primary action for detecting security threats because it continuously monitors storage telemetry and applies machine learning to identify potential security incidents.

Exam trap

The trap here is that candidates often confuse data protection features (like soft delete) or network controls (like firewalls) with active threat detection, overlooking that only dedicated security monitoring tools (Azure Security Center/Defender and Sentinel) can identify and alert on security threats in real time.

Full explanation →

684

MCQeasy

You are processing streaming data from IoT devices using Azure Stream Analytics. The data includes temperature readings and device IDs. You need to calculate the average temperature per device over a 5-minute window, sliding every 1 minute. Which window function should you use?

A.Hop window

B.Session window

C.Sliding window

D.Tumbling window

AnswerA

Hop windows overlap and advance every hop interval.

Why this answer

A Hop window in Azure Stream Analytics allows you to specify a window size (5 minutes) and a hop size (1 minute), creating overlapping windows that slide forward every minute. This matches the requirement to calculate the average temperature per device over a 5-minute period, recalculated every minute, as the hop window outputs results at each hop interval while retaining data across overlapping windows.

Exam trap

The trap here is that candidates confuse 'sliding' with 'hopping' — a Sliding window in Stream Analytics is event-driven and does not produce periodic outputs, whereas a Hop window is time-driven and explicitly supports overlapping fixed-size windows with a hop interval.

How to eliminate wrong answers

Option B is wrong because a Session window groups events based on inactivity gaps (session timeout), not fixed time intervals, and would not produce consistent 5-minute windows sliding every 1 minute. Option C is wrong because a Sliding window in Stream Analytics outputs results only when an event occurs (e.g., for each new event), not at fixed time intervals, and does not support a predefined hop size. Option D is wrong because a Tumbling window is a series of fixed-size, non-overlapping contiguous time windows (e.g., every 5 minutes), which cannot produce overlapping windows that slide every 1 minute.

Full explanation →

685

MCQeasy

You are developing a real-time data processing solution using Azure Stream Analytics. The input is an Azure Event Hubs stream with JSON data containing a 'timestamp' field. You need to output the average temperature per device every minute using a tumbling window. Which query should you use?

A.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, SlidingWindow(minute, 1)

B.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, TumblingWindow(minute, 1)

C.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, SessionWindow(minute, 1, 1)

D.SELECT DeviceId, AVG(Temperature) AS AvgTemp FROM Input TIMESTAMP BY Timestamp GROUP BY DeviceId, HopWindow(minute, 1, 1)

AnswerB

Tumbling window of 1 minute produces non-overlapping windows.

Why this answer

Option B is correct because a tumbling window is a fixed, non-overlapping time window that groups events into distinct time segments. Using `TumblingWindow(minute, 1)` with `TIMESTAMP BY Timestamp` ensures that the average temperature per device is computed over each one-minute interval without overlap, which matches the requirement of 'every minute'.

Exam trap

The trap here is that candidates confuse `SlidingWindow` or `HopWindow` with `TumblingWindow`, not realizing that only `TumblingWindow` produces non-overlapping, fixed-interval outputs required for a simple per-minute average.

How to eliminate wrong answers

Option A is wrong because `SlidingWindow` produces a continuous output for every event, not fixed intervals, and would not give a single average per minute. Option C is wrong because `SessionWindow` groups events based on inactivity gaps, not fixed time boundaries, and would not produce a consistent per-minute result. Option D is wrong because `HopWindow` creates overlapping windows with a hop size smaller than the window size, leading to multiple outputs per minute and not a single non-overlapping aggregation.

Full explanation →

686

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Lake. The data is ingested from multiple sources and needs to be deduplicated based on a composite key (source_id, record_id). New data may have duplicates within the same batch. Which write mode and table property should you use to handle this efficiently?

A.Use 'append' mode and perform a MERGE operation after write to deduplicate.

B.Use 'overwrite' mode and enable 'delta.autoOptimize.optimizeWrite' = true.

C.Use 'ignore' mode and set 'delta.autoCompact' = true.

D.Use 'error' mode and enable 'delta.merge.onSchemaMismatch' = true.

AnswerA

Append with a subsequent MERGE allows custom dedup logic on composite key.

Why this answer

Option A is correct because using 'append' mode writes all incoming data as new files, and then performing a MERGE operation (upsert) based on the composite key (source_id, record_id) allows you to efficiently deduplicate both within the batch and against existing data. This approach leverages Delta Lake's ACID transactions and avoids the cost of rewriting entire partitions, making it suitable for handling duplicates from multiple sources.

Exam trap

Microsoft often tests the misconception that 'overwrite' mode or table properties like 'autoOptimize' can handle deduplication, but the correct approach requires explicit deduplication logic (like MERGE) because Delta Lake does not enforce unique constraints natively.

How to eliminate wrong answers

Option B is wrong because 'overwrite' mode replaces the entire table or partition, which is inefficient for deduplication and would lose existing data not in the current batch; enabling 'delta.autoOptimize.optimizeWrite' only improves file layout, not deduplication logic. Option C is wrong because 'ignore' mode silently skips writes that would cause a duplicate based on the Delta table's schema or constraints, but Delta Lake does not enforce unique constraints natively, so duplicates would still be written; 'delta.autoCompact' only merges small files, not deduplicates. Option D is wrong because 'error' mode fails the write if any data conflicts (e.g., schema mismatch), which is not a deduplication strategy; 'delta.merge.onSchemaMismatch' is not a valid Delta Lake table property—the correct property for schema evolution is 'delta.autoMerge.enabled' or 'mergeSchema' in the DataFrame writer option.

Full explanation →

687

Multi-Selecthard

You are designing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, and the output is to Azure Synapse Analytics. The solution must guarantee exactly-once delivery to Synapse. Which THREE configurations are required? (Choose three.)

Select 3 answers

A.Configure the output to use batch mode for writing.

B.Define a watermark strategy in the query to handle late-arriving events.

C.Set the late arrival tolerance window to zero.

D.Use a job with a unique identifier column in the output to enable deduplication.

E.Ensure the output table in Synapse has a primary key to support upsert operations.

AnswersB, D, E

Ensures correct windowing.

Why this answer

Options A, C, and E are correct. Exactly-once semantics require a unique identifier for deduplication, a watermark to handle late events, and output to a table with a primary key for upsert. Option B is incorrect because late arrival tolerance is not required for exactly-once.

Option D is incorrect because batch mode is not supported for exactly-once output to Synapse.

Full explanation →

688

MCQmedium

Refer to the exhibit. You have created an external table in Azure Synapse serverless SQL pool as shown. You run a query: SELECT ProductID, SUM(Amount) FROM dbo.ExternalSales WHERE SaleDate > '2024-01-01' GROUP BY ProductID. The query is slow and scans all files in the /sales/ folder, which contains data from 2023 and 2024. The files are partitioned by year and month in the folder structure, e.g., /sales/year=2023/month=01/. What should you do to improve query performance?

A.Recreate the external table with a partition definition on SaleDate column using the folder structure

B.Recreate the external table with a partition on ProductID

C.Create statistics on the SaleDate column

D.Change the file format to CSV to improve read performance

AnswerA

By defining partitions using the folder structure, serverless SQL can skip partitions that don't match the filter.

Why this answer

Option A is correct because the query performance is slow due to full file scanning. By recreating the external table with a partition definition on the SaleDate column that maps to the folder structure (e.g., /sales/year=2023/month=01/), Azure Synapse serverless SQL pool can perform partition elimination, reading only the relevant partitions for the WHERE clause filter (SaleDate > '2024-01-01'). This drastically reduces the amount of data scanned, improving query speed.

Exam trap

The trap here is that candidates often confuse creating statistics (which helps cardinality estimation but not data skipping) with partition elimination (which physically reduces data scanned), or they assume any column partition will work without matching the folder structure.

How to eliminate wrong answers

Option B is wrong because partitioning on ProductID does not align with the folder structure (which is partitioned by year and month), so it would not enable partition elimination for the date filter; it would still scan all files. Option C is wrong because creating statistics on SaleDate helps the query optimizer estimate cardinality but does not reduce the amount of data scanned; the query would still read all files without partition pruning. Option D is wrong because CSV files are typically slower to read than Parquet due to lack of compression and columnar storage; changing to CSV would worsen performance, not improve it.

Full explanation →

689

MCQeasy

You need to configure encryption for an Azure SQL Database to protect data at rest. Which Azure service or feature should you enable?

A.Dynamic Data Masking

B.Always Encrypted

C.Azure Information Protection

D.Transparent Data Encryption (TDE)

AnswerD

TDE encrypts the database at rest automatically.

Why this answer

Transparent Data Encryption (TDE) is the correct choice because it performs real-time I/O encryption and decryption of the data and log files at rest, protecting against unauthorized access to the physical storage media. TDE uses an AES-256 encryption algorithm and is fully transparent to the application, requiring no changes to the database schema or queries.

Exam trap

The trap here is that candidates often confuse Dynamic Data Masking (DDM) with encryption, thinking it protects data at rest, when in fact it only masks output and does not encrypt the underlying storage.

How to eliminate wrong answers

Option A is wrong because Dynamic Data Masking (DDM) is a data masking feature that obfuscates sensitive data in query results to unauthorized users, but it does not encrypt data at rest. Option B is wrong because Always Encrypted is a client-side encryption technology that protects sensitive data in transit and at rest by encrypting columns with keys stored on the client, but it is not a database-level encryption for all data at rest and requires application changes. Option C is wrong because Azure Information Protection (AIP) is a classification and labeling service for documents and emails, not a database encryption feature for Azure SQL Database.

Full explanation →

690

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must prevent unauthorized access to data at rest and in transit. Which combination of features should you implement?

A.Enable Transparent Data Encryption (TDE) and enforce TLS 1.2.

B.Use Azure RBAC and firewall rules.

C.Use Always Encrypted and column-level security.

D.Store encryption keys in Azure Key Vault and enable double encryption.

AnswerA

TDE encrypts data at rest, and TLS 1.2 encrypts data in transit.

Why this answer

Option B is correct because TDE encrypts data at rest and TLS 1.2 encrypts data in transit. Option A is wrong because Azure RBAC controls access but does not encrypt data. Option C is wrong because Always Encrypted is for client-side encryption, not for all data at rest.

Option D is wrong because Azure Key Vault stores keys but does not encrypt data itself.

Full explanation →

691

MCQhard

Refer to the exhibit. You submit a Spark job in Azure Synapse Analytics using the Azure CLI. The job runs slowly during the shuffle phase. The input data is about 200 GB. Which configuration change would best improve performance for this shuffle-heavy workload?

A.Increase the number of executors to 4.

B.Change executor size to 'Large' to increase memory per executor.

C.Increase spark.sql.shuffle.partitions to 800.

D.Decrease spark.sql.shuffle.partitions to 200 to reduce overhead.

AnswerC

More partitions reduce the size of each partition, speeding up shuffle.

Why this answer

Option B is correct because for a 200 GB input, 400 shuffle partitions may be too few, causing large partitions and slow shuffles. Increasing partitions to a higher number (e.g., 800 or more) can improve parallelism. Option A is wrong because executor size 'Small' might be insufficient; increasing executor size could help but the question asks about configuration change, and partition count is more directly related to shuffle performance.

Option C is wrong because decreasing partitions would make partitions larger. Option D is wrong because executor count of 2 is low; increasing would help but the partition count is the key issue for shuffle.

Full explanation →

692

MCQhard

You deploy the Azure Security Center automation shown in the exhibit. What is the purpose of this automation?

A.It configures Azure Monitor to log high-severity alerts.

B.It applies an Azure Policy to remediate high-severity alerts.

C.It sends high-severity security alerts to an Event Hub for further processing.

D.It creates incidents in Azure Sentinel for high-severity alerts.

AnswerC

The action type is EventHub, and source severity is High.

Why this answer

Option B is correct because Security Center automations trigger actions (like sending to Event Hub) when specific security alerts or recommendations occur. Option A (Azure Sentinel) is incorrect because the automation is not creating incidents; it's forwarding alerts. Option C (Azure Monitor) is not involved.

Option D (Azure Policy) is for compliance, not alerting.

Full explanation →

693

MCQhard

Your company uses Azure Synapse Analytics dedicated SQL pool for a data warehouse. You notice that queries on a large fact table are slow. The table is hash-distributed on CustomerID and has 60 distributions. After reviewing the query plan, you see that many queries filter on OrderDate. You want to improve query performance without redesigning the table. What should you do?

A.Change the distribution key to OrderDate

B.Recreate the table with round-robin distribution

C.Update statistics only on the OrderDate column

D.Create a clustered columnstore index on the table and order by OrderDate

AnswerD

Clustered columnstore index with ordering by OrderDate accelerates range queries.

Why this answer

Option D is correct because creating a clustered columnstore index with ORDER BY OrderDate on the fact table improves query performance by physically ordering the data by OrderDate within each columnstore segment. This allows segment elimination during queries that filter on OrderDate, reducing the amount of data scanned. Since the table is already hash-distributed on CustomerID, this approach does not require redesigning the table and directly addresses the slow queries without changing the distribution strategy.

Exam trap

The trap here is that candidates often assume changing the distribution key (Option A) is the only way to optimize filter-heavy queries, but they overlook that a clustered columnstore index with ORDER BY can provide similar I/O reduction without the risks of data skew or redesign.

How to eliminate wrong answers

Option A is wrong because changing the distribution key to OrderDate would cause data skew if OrderDate has many duplicate values (e.g., all rows for a single date), leading to uneven distribution and poor parallel query performance. Option B is wrong because round-robin distribution distributes rows evenly but eliminates data locality for joins and aggregations, which would degrade performance for queries that join on CustomerID or other keys. Option C is wrong because updating statistics on OrderDate alone does not reduce I/O; statistics help the optimizer choose better plans but do not physically reorganize data to minimize data scanned for filter predicates.

Full explanation →

694

Matchingmedium

Match each Azure Synapse Analytics component to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Distributed query engine for relational data

Apache Spark runtime for big data processing

Data integration and orchestration

Web-based IDE for developing analytics solutions

Why these pairings

These components are central to Synapse Analytics.

Full explanation →

695

MCQmedium

Your team uses Azure Databricks to process data stored in Azure Data Lake Storage Gen2. You need to ensure that only authorized users can access the data and that access is audited. What should you implement?

A.Use Azure RBAC with Microsoft Entra ID and enable diagnostic settings for Azure Storage

B.Use Azure managed identities to access the storage account

C.Use shared access signatures (SAS) with stored access policies

D.Configure the storage account firewall to allow only Azure Databricks IP addresses

AnswerA

D is correct because RBAC provides fine-grained access control and diagnostic logs enable auditing.

Why this answer

Azure RBAC with Microsoft Entra ID (formerly Azure AD) provides fine-grained access control to Azure Data Lake Storage Gen2, ensuring only authorized users can access data. Enabling diagnostic settings for Azure Storage captures read, write, and delete operations, which are sent to a Log Analytics workspace, storage account, or event hub for auditing. This combination meets both the authorization and auditing requirements specified in the question.

Exam trap

The trap here is that candidates often confuse authentication (managed identities) with authorization and auditing, or they assume network-level controls (firewall) are sufficient for access control and auditing, when in fact Azure RBAC with diagnostic settings is required for user-level authorization and audit trails.

How to eliminate wrong answers

Option B is wrong because Azure managed identities authenticate Azure services (like Databricks) to Azure resources without storing credentials, but they do not provide granular user-level authorization or auditing of individual user actions. Option C is wrong because shared access signatures (SAS) with stored access policies grant time-limited, delegated access to storage resources but do not provide per-user auditing; SAS tokens are typically used for service-to-service access, not for auditing individual user activity. Option D is wrong because configuring the storage account firewall to allow only Azure Databricks IP addresses restricts network access but does not authorize specific users or provide audit logs of data access events.

Full explanation →

696

Multi-Selecteasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to support both batch and near-real-time data ingestion. Which TWO of the following methods can you use to ingest data into a dedicated SQL pool? (Select TWO.)

Select 2 answers

A.CREATE TABLE AS SELECT (CTAS) from external tables.

B.PolyBase with T-SQL commands.

C.COPY INTO command.

D.Azure Logic Apps with SQL connector.

E.Azure Data Factory with a copy activity using native sink.

AnswersA, B

CTAS can load data into a dedicated SQL pool from external tables.

Why this answer

CREATE TABLE AS SELECT (CTAS) from external tables is correct because it allows you to load data from external storage (e.g., Azure Blob Storage or Azure Data Lake Storage) into a dedicated SQL pool in a single, parallelized operation. This method leverages the MPP (Massively Parallel Processing) architecture of Synapse SQL pools to efficiently ingest large volumes of data for batch processing.

Exam trap

The trap here is that candidates often confuse the COPY INTO command (valid only for serverless SQL pools) with PolyBase or CTAS, or mistakenly think Azure Data Factory's native sink can directly write to a dedicated SQL pool without PolyBase or staging.

Full explanation →

697

MCQeasy

Refer to the exhibit. You run the Kusto query in Azure Monitor Logs to analyze Data Factory pipeline runs. What is the purpose of this query?

A.List all pipeline runs regardless of status

B.Identify pipelines with the most failed activity runs per hour

C.Calculate the average duration of failed pipeline runs

D.Show the number of failed trigger runs per hour

AnswerB

The query counts failed runs per pipeline per hour and sorts descending.

Why this answer

The query filters failed activity runs, groups by pipeline and hourly bins, then sorts by count descending to show pipelines with most failures. Option B is wrong because it's not about duration. Option C is wrong because it's not about all runs.

Option D is wrong because it's not about triggers.

Full explanation →

698

MCQeasy

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query returns fewer rows than expected. What should you check first?

A.Ensure the external table has the correct schema definition.

B.Check that the Azure AD identity has read permissions on the storage account.

C.Check the compression codec used in the Parquet files.

D.Verify the file path and pattern in the OPENROWSET query.

AnswerD

Incorrect file path or pattern can cause missing files or partitions.

Why this answer

Option D is correct because when using OPENROWSET in Azure Synapse serverless SQL pool to query Parquet files, the most common reason for fewer rows than expected is an incorrect file path or pattern. If the path or pattern is too restrictive (e.g., missing a wildcard or pointing to a subfolder instead of the root), the query will only read a subset of the files, resulting in fewer rows. This is the first thing to verify before investigating schema or permissions issues.

Exam trap

The trap here is that candidates often jump to schema or permission issues first, but the most frequent cause of missing rows in serverless SQL pool queries is an overly restrictive file path or pattern in the OPENROWSET query.

How to eliminate wrong answers

Option A is wrong because an incorrect schema definition would typically cause data type conversion errors or NULL values, not a reduction in row count; the query would still read all rows but might fail to parse them. Option B is wrong because if the Azure AD identity lacked read permissions, the query would fail entirely with an authorization error, not return fewer rows. Option C is wrong because the compression codec (e.g., snappy, gzip) does not affect the number of rows returned; Parquet files are self-describing and the serverless SQL pool automatically handles decompression regardless of codec.

Full explanation →

699

MCQmedium

Match the Azure service to its primary data processing use case. Drag each service on the left to the correct use case on the right. Services: Azure Databricks, Azure Stream Analytics, Azure Data Factory, Azure Synapse Analytics Use Cases: - Real-time event processing - Orchestration of ETL pipelines - Big data analytics with Spark - Enterprise data warehousing

A.Azure Databricks - Big data analytics with Spark

B.Azure Stream Analytics - Real-time event processing

C.Azure Data Factory - Orchestration of ETL pipelines

D.Azure Synapse Analytics - Enterprise data warehousing

Why this answer

Azure Databricks is used for big data analytics with Spark. Azure Stream Analytics is for real-time event processing. Azure Data Factory is for orchestrating ETL pipelines.

Azure Synapse Analytics is for enterprise data warehousing.

Exam trap

Candidates might confuse Azure Databricks with Azure Synapse Analytics because both can run Spark. However, Databricks is primarily a collaborative Spark environment, while Synapse is a data warehouse with integrated analytics.

Why the other options are wrong

A

Correct pair

B

Correct pair

C

Correct pair

D

Correct pair

Full explanation →

700

MCQhard

You are running a data transformation pipeline in Azure Synapse Spark that writes output to Delta tables. You notice that the job eventually slows down and then fails with an out-of-memory error. The input data size is 1 TB, and the cluster has 10 nodes with 16 GB memory each. What is the most likely cause?

A.The driver node does not have enough memory to collect the results

B.The data is not partitioned properly, leading to large partitions that exceed executor memory

C.The Delta table is being written in non-optimized format causing memory pressure

D.The transformation involves a wide dependency causing excessive shuffle

AnswerB

Unpartitioned data can result in a few large partitions that cause OOM. Increasing parallelism or repartitioning can help.

Why this answer

The most likely cause is that the data is not partitioned properly, leading to large partitions that exceed executor memory. In Azure Synapse Spark, each executor has a limited memory (16 GB per node in this cluster), and if a single partition is too large to fit in memory, the task processing that partition will fail with an out-of-memory error. Proper partitioning ensures that data is evenly distributed across executors, preventing any single partition from overwhelming available memory.

Exam trap

The trap here is that candidates often confuse out-of-memory errors with driver-side collection (Option A) or shuffle-related issues (Option D), but the specific context of writing to Delta tables points to executor memory exhaustion from oversized partitions, not driver memory or shuffle overhead.

How to eliminate wrong answers

Option A is wrong because the driver node collects results only for actions like `collect()` or `show()`, but writing to Delta tables does not require collecting results to the driver; the failure is on executor tasks, not the driver. Option C is wrong because Delta tables are inherently optimized (using Parquet format with ACID transactions), and writing in non-optimized format is not a concept; memory pressure is caused by partition size, not the table format. Option D is wrong because while wide dependencies (e.g., groupBy, join) can cause excessive shuffle, the question specifically describes a slowdown and out-of-memory error during writing, which is more directly tied to partition size rather than shuffle overhead.

Full explanation →

701

Multi-Selectmedium

Which TWO of the following are valid ways to handle late-arriving data in a streaming solution with Azure Stream Analytics? (Choose two.)

Select 2 answers

A.Reprocess the entire stream from the beginning when late data is detected.

B.Implement a custom Azure Function as a 'LateDataHandler' in the query.

C.Use a reference data input to store late-arriving events.

D.Configure the 'late arrival tolerance' window in the event ordering settings up to 21 days.

E.Use a temporal join to combine the late-arriving event with the historical window.

AnswersD, E

Stream Analytics allows setting a late arrival tolerance window to handle events that arrive after the event time.

Why this answer

Option D is correct because Azure Stream Analytics allows you to configure a 'late arrival tolerance' window in the event ordering settings, which can be set up to a maximum of 21 days. This window defines how long the service will wait to accommodate events that arrive after their timestamp, reordering them within that tolerance before processing. Option E is correct because a temporal join (e.g., using LATERAL or JOIN with DATEDIFF) can combine a late-arriving event with historical data from a reference or stream window, enabling you to retroactively correct aggregations or state.

Exam trap

The trap here is that candidates confuse the 'late arrival tolerance' with a simple delay setting, not realizing it is a reordering buffer up to 21 days, and they overlook temporal joins as a valid pattern for handling late data, instead assuming only external functions or full reprocessing are options.

Full explanation →

702

MCQmedium

You have an Azure Stream Analytics job that reads from an Event Hub and writes to Azure SQL Database. The job processes high-velocity IoT sensor data. You notice that the output to SQL Database is slower than expected and the job's watermark delay is increasing. What should you do to improve throughput?

A.Partition the output by a column like DeviceId.

B.Disable late arrival and out-of-order event handling.

C.Increase the Streaming Units (SU) of the job.

D.Decrease the window size in the query.

AnswerA

Partitioning allows parallel writes to SQL.

Why this answer

Partitioning the output by a column like DeviceId allows Azure Stream Analytics to write to multiple SQL Database tables or use partitioned tables, enabling parallel writes. This reduces contention and improves throughput because the job can distribute the load across multiple write operations, directly addressing the bottleneck caused by high-velocity IoT sensor data overwhelming a single output stream.

Exam trap

The trap here is that candidates often assume increasing compute resources (Streaming Units) always solves performance issues, but they overlook that the bottleneck is frequently at the output sink, requiring architectural changes like partitioning rather than scaling.

How to eliminate wrong answers

Option B is wrong because disabling late arrival and out-of-order event handling does not improve output throughput; it only changes how events are timestamped and may cause data loss or inaccuracies without addressing the write bottleneck. Option C is wrong because increasing Streaming Units (SU) allocates more compute resources to the job, but if the bottleneck is at the SQL Database output (e.g., write limits or lack of partitioning), adding SUs will not improve throughput and may even increase backpressure. Option D is wrong because decreasing the window size in the query reduces the amount of data aggregated per window, but it does not affect the rate at which output rows are written to SQL Database; the bottleneck remains at the output sink.

Full explanation →

703

MCQmedium

You are configuring security for an Azure Synapse Analytics workspace. You need to ensure that only users in the 'DataScientists' Microsoft Entra group can read data from the 'sales' schema in the serverless SQL pool. What should you configure?

A.Create a server-level login for the group and assign it to the 'public' role

B.Create a database user mapped to the Microsoft Entra group and grant SELECT ON SCHEMA::sales to the group

C.Assign the 'Synapse SQL Administrator' role to the group at workspace level

D.Create a contained database user with password and assign it to the 'db_datareader' role

AnswerB

Granular permissions at schema level

Why this answer

Option C is correct because you create a database user mapped to the Microsoft Entra group, then grant SELECT on the schema to that user. Option A is wrong because RBAC at workspace level is too broad. Option B is wrong because it grants permissions to all users.

Option D is wrong because user mapping alone does not grant permissions.

Full explanation →

704

MCQhard

Your company uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled. You need to optimize costs for a large dataset that is accessed only once a month for reporting. The data must be retained for 7 years. Which storage tier and lifecycle management rule should you configure?

A.Hot tier with no lifecycle policy

B.Cool tier with lifecycle policy to Archive after 30 days

C.Premium tier with lifecycle policy to Cool after 30 days

D.Archive tier with no lifecycle policy

AnswerB

Cool tier balances cost and access; Archive after 30 days reduces cost further.

Why this answer

Option C is correct because Cool storage is cost-effective for infrequent access, and a lifecycle policy moving to Archive after 30 days minimizes costs while retaining data for 7 years. Option A (Hot) is expensive for infrequent access. Option B (Premium) is for high throughput.

Option D (Archive immediately) incurs high retrieval costs for monthly access.

Full explanation →

705

MCQhard

You are responsible for securing an Azure Synapse Analytics workspace. The workspace is integrated with a Git repository for source control. You need to ensure that only authorized users can publish changes from the Git branch to the live Synapse service. What should you configure?

A.Assign the 'Synapse Artifact Publisher' role to specific users in the Synapse RBAC.

B.Use Microsoft Entra ID Conditional Access policies to require multi-factor authentication for publishing.

C.Create an Azure Policy to deny publishing if the user is not in an approved group.

D.Configure branch permissions in the Git repository to restrict who can merge to the collaboration branch.

AnswerA

The Synapse Artifact Publisher role explicitly grants permission to publish artifacts to the live service.

Why this answer

Option B is correct because Synapse RBAC roles like 'Synapse Artifact Publisher' control who can publish changes. Option A is wrong because Git permissions control access to the repository, not publishing to Synapse. Option C is wrong because Azure Policy enforces compliance rules, not access control.

Option D is wrong because Microsoft Entra ID is for identity, not specific publish permissions.

Full explanation →

706

MCQeasy

You need to transform data in Azure Databricks using Apache Spark. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. Which method should you use to read the data into a Spark DataFrame?

A.spark.read.parquet('abfss://container@storage.dfs.core.windows.net/path')

B.spark.read.format('delta').load('abfss://container@storage.dfs.core.windows.net/path')

C.spark.read.csv('abfss://container@storage.dfs.core.windows.net/path')

D.spark.read.json('abfss://container@storage.dfs.core.windows.net/path')

AnswerB

Delta format correctly reads the table including transaction log.

Why this answer

Option B is correct because the data is stored in Delta Lake format, which requires using the 'delta' format reader in Spark to properly read the transaction log and schema. The `spark.read.format('delta').load()` method is the standard way to read Delta tables, leveraging the Delta Lake protocol for ACID transactions and time travel capabilities.

Exam trap

The trap here is that candidates may assume Delta Lake files are just Parquet files and use `spark.read.parquet()`, missing the critical role of the Delta transaction log for consistency and ACID compliance.

How to eliminate wrong answers

Option A is wrong because `spark.read.parquet()` reads only Parquet files and ignores Delta Lake's transaction log, leading to stale or inconsistent data. Option C is wrong because `spark.read.csv()` is for CSV files, not Delta Lake format. Option D is wrong because `spark.read.json()` is for JSON files, not Delta Lake format.

Full explanation →

707

Multi-Selectmedium

Which TWO actions can you take to optimize query performance in Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.Use hash distribution on a low-cardinality column

B.Use round-robin distribution for fact tables

C.Use replicated tables for small dimension tables

D.Create materialized views for common aggregations

E.Increase the DWU setting after every query

AnswersC, D

Replicated tables eliminate data shuffling for joins with fact tables.

Why this answer

Correct: A and D. Using replicated tables avoids data movement for small dimension tables, and materialized views pre-compute aggregations. B is wrong because hash distribution on a high-cardinality column is generally good.

C is wrong because increasing DWU improves performance but is not a design optimization; it's a scaling action. E is wrong because round-robin distribution is not optimal for star schemas.

Full explanation →

708

Drag & Dropmedium

Drag and drop the steps to set up Azure Data Lake Storage Gen2 hierarchical namespace for a data lake into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

The storage account must have hierarchical namespace enabled. Then create a container, directories, set permissions, and upload data.

Full explanation →

709

MCQeasy

You are using Azure Data Factory to copy data from an Azure SQL Database to Azure Synapse dedicated SQL pool. The copy activity uses PolyBase as the copy method. The activity fails with the error 'Operation not supported: PolyBase cannot write to a table with clustered columnstore index'. What should you do to resolve this error?

A.Use an external table as the sink instead of a regular table

B.Create the target table as a heap or with clustered index

C.Enable staging with blob storage and use 'Allow PolyBase'

D.Change the copy method from PolyBase to Bulk Insert

AnswerB

PolyBase does not support writing to a clustered columnstore index directly. The table must be a heap or have a clustered index.

Why this answer

PolyBase in Azure Data Factory cannot write directly to a table that has a clustered columnstore index (CCI). The sink table must be a heap or have a clustered index for PolyBase to work. Option B correctly identifies this requirement, as creating the target table as a heap or with a clustered index resolves the error.

Exam trap

The trap here is that candidates often assume staging with blob storage (Option C) or switching to Bulk Insert (Option D) are the only workarounds, but the question specifically tests the PolyBase requirement that the sink table must not have a clustered columnstore index.

How to eliminate wrong answers

Option A is wrong because using an external table as the sink would require additional setup and is not a direct fix for the PolyBase CCI limitation; PolyBase can write to external tables, but the error specifically occurs when writing to a regular table with CCI. Option C is wrong because enabling staging with blob storage and 'Allow PolyBase' is used for staging-based PolyBase loads, but it does not bypass the requirement that the final sink table must not have a CCI. Option D is wrong because changing the copy method from PolyBase to Bulk Insert would work but is not the optimal or recommended fix; the question asks what should be done to resolve the error, and the correct approach is to adjust the table schema to support PolyBase, not to abandon PolyBase entirely.

Full explanation →

710

MCQhard

You are designing a data processing solution using Azure Databricks with Delta Lake. You need to ensure ACID transactions and schema enforcement. Which feature should you enable?

A.Auto Loader

B.Delta Lake format

C.Photon engine

D.Unity Catalog

AnswerB

Delta Lake provides ACID transactions, schema enforcement, and time travel.

Why this answer

Delta Lake is the correct choice because it provides ACID transactions (atomicity, consistency, isolation, durability) and schema enforcement (schema-on-write) on top of cloud storage like Azure Data Lake Storage. These features are inherent to the Delta Lake format, which uses a transaction log to track changes and enforce data integrity, making it ideal for reliable data processing in Azure Databricks.

Exam trap

Microsoft often tests the distinction between features that provide data governance (Unity Catalog) versus features that provide data reliability at the storage layer (Delta Lake), leading candidates to confuse Unity Catalog's metadata management with Delta Lake's transactional guarantees.

How to eliminate wrong answers

Option A is wrong because Auto Loader is a feature for incrementally ingesting new files from cloud storage, not for providing ACID transactions or schema enforcement. Option C is wrong because the Photon engine is a high-performance vectorized query engine that accelerates query execution but does not manage ACID transactions or schema constraints. Option D is wrong because Unity Catalog is a centralized metadata and governance layer for managing data assets, permissions, and lineage, but it does not directly enforce ACID transactions or schema enforcement at the table level.

Full explanation →

711

MCQmedium

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The copy activity uses a preCopyScript to truncate the destination table before writing. During a recent run, the copy activity failed after the truncation, leaving the destination table empty. You need to prevent data loss in future failures. What should you modify?

A.Add a retry policy to the copy activity.

B.Enable fault tolerance in the copy activity source.

C.Use a staging table and then a stored procedure to swap tables.

D.Remove the preCopyScript and set writeBatchSize to 0.

AnswerC

Staging table ensures atomicity: copy to staging, then swap with destination.

Why this answer

Option C is correct because using a staging table prevents data loss: copy to staging, then swap. Option A is wrong because the fault tolerance setting only handles errors but does not roll back truncation. Option B is wrong because turning off truncation would not ensure atomicity.

Option D is wrong because retry would re-execute truncation and fail again.

Full explanation →

712

MCQhard

A company uses Azure Stream Analytics to process real-time data from IoT devices. They need to ensure that the output to Azure Synapse Analytics is optimized for high throughput and low latency. What should they configure in the Stream Analytics job?

A.Use Azure SQL Database output instead of Azure Synapse Analytics.

B.Partition the output by a key and use a columnstore index in the target table.

C.Use a single partition for the output to simplify processing.

D.Disable batching to reduce latency.

AnswerB

Partitioning parallelizes writes and columnstore indexes are optimized for analytics.

Why this answer

Option C is correct because partitioning the output and using a compatible table schema (e.g., columnstore index) improves write performance. Option A is wrong because the input partition scheme may not align with the output. Option B is wrong because batching can increase latency.

Option D is wrong because Azure SQL Database output has different tuning options; the question specifies Azure Synapse Analytics.

Full explanation →

713

MCQhard

You are working with a Delta Lake table in Azure Databricks. The table is updated frequently with new data and occasionally with updates to existing rows. You need to optimize read performance for queries that filter on a specific date column. The table is partitioned by date. Which optimization technique should you apply?

A.Run OPTIMIZE on the table.

B.Run ZORDER BY on the date column.

C.Run ANALYZE STATISTICS on the table.

D.Run VACUUM to clean up old versions.

AnswerB

Z-ordering co-locates column data, improving data skipping for filters on that column.

Why this answer

Option B is correct because Z-ordering reduces the amount of data scanned for filter columns. Option A is wrong because OPTIMIZE only compacts small files, not reorder data for filtering. Option C is wrong because VACUUM removes old files, not optimizes read.

Option D is wrong because ANALYZE STATISTICS computes stats but does not physically reorganize data.

Full explanation →

714

Multi-Selectmedium

Which TWO components are required to set up a streaming data pipeline using Azure Synapse Analytics? (Select two.)

Select 2 answers

A.Azure Data Factory

B.Azure Event Hubs

C.Azure Analysis Services

D.Azure Blob Storage

E.Azure Synapse Pipelines (or Spark)

AnswersB, E

Event Hubs ingests streaming events.

Why this answer

Option A is correct because Event Hubs is a common ingestion source for streaming. Option C is correct because Azure Synapse Pipelines (or the underlying Spark engine) can process streaming data. Option B is wrong because Data Factory is for batch.

Option D is wrong because Analysis Services is for OLAP. Option E is wrong because Blob Storage is a destination, not a required component.

Full explanation →

715

MCQhard

You are reviewing an ARM template for Azure SQL Database security alert policy. Based on the exhibit, which threats will trigger an alert?

A.All alerts except SQL Injection and Access Anomaly

B.No alerts will be triggered because the policy is disabled

C.SQL Injection and Access Anomaly

D.SQL Injection Vulnerability and Data Exfiltration

AnswerD

These alerts are not listed as disabled, so they are enabled.

Why this answer

Option B is correct because the disabledAlerts list includes 'Sql_Injection' and 'Access_Anomaly', so those are suppressed. The remaining enabled alerts (e.g., SQL Injection Vulnerability, Data Exfiltration, Unsafe Action) will trigger alerts. Option A is wrong because SQL Injection is disabled.

Option C is wrong because Access Anomaly is disabled. Option D is wrong because the policy is enabled, so some alerts are active.

Full explanation →

716

MCQeasy

A data processing job in Azure Synapse Analytics writes results to a table in the dedicated SQL pool. After a failure, the job restarts from the beginning, causing duplicates. Which design pattern should you implement to ensure idempotent writes?

A.Use a TRUNCATE statement before each insert.

B.Use a MERGE statement with a unique key to upsert data.

C.Use a staging table and then swap partitions with the target table.

D.Use CREATE TABLE AS SELECT (CTAS) with a unique constraint.

AnswerC

Atomic swap ensures idempotency.

Why this answer

Option C is correct because using a staging table with partition swapping ensures idempotent writes by atomically replacing the target partition with a fully loaded staging partition. This avoids duplicates even if the job restarts, as the swap operation is transactional and the staging table can be truncated before each run. In Azure Synapse dedicated SQL pool, partition switching is a metadata-only operation that provides consistency without data movement.

Exam trap

The trap here is that candidates often choose MERGE (Option B) thinking it inherently provides idempotency, but they overlook that MERGE in Synapse dedicated SQL pool is not atomic across retries and can still cause duplicates if the job fails after partial execution, whereas partition switching provides true atomic replacement.

How to eliminate wrong answers

Option A is wrong because TRUNCATE before each insert would remove all existing data in the table, which is destructive and not suitable for incremental or partial loads; it also does not handle concurrent access or partial failures gracefully. Option B is wrong because MERGE with a unique key can still produce duplicates if the job restarts mid-operation (e.g., after inserts but before updates), and MERGE in Synapse dedicated SQL pool is not fully atomic for large-scale upserts due to potential deadlocks and transaction log overhead. Option D is wrong because CREATE TABLE AS SELECT (CTAS) with a unique constraint does not prevent duplicates on restart—CTAS creates a new table each time, and the unique constraint only enforces uniqueness within that single execution, not across retries; additionally, CTAS does not provide a mechanism to swap or replace existing data atomically.

Full explanation →

717

Multi-Selecthard

A company ingests streaming data from multiple sources into Azure Event Hubs. The data must be stored in Azure Data Lake Storage Gen2 in Parquet format, partitioned by date and hour. The solution must minimize cost and processing latency. Which THREE actions should be taken?

Select 2 answers

A.Enable Event Hubs Capture to automatically write data to Data Lake Storage in Avro format.

B.Use Azure Data Factory to copy data from Event Hubs to Data Lake Storage every 5 minutes.

C.Use Azure Stream Analytics to read from Event Hubs and write to Data Lake Storage in Parquet format.

D.Configure Stream Analytics output to partition by date and hour.

E.Use Azure Databricks to process the stream and write to Data Lake Storage.

AnswersC, D

Stream Analytics can transform and output Parquet to Data Lake.

Why this answer

Option C is correct because Azure Stream Analytics can read directly from Event Hubs and write to Azure Data Lake Storage Gen2 in Parquet format, which is a columnar format optimized for analytics and compression. This approach minimizes cost and processing latency by enabling real-time, continuous processing without the need for intermediate storage or batch orchestration, meeting the requirements for low-latency ingestion.

Exam trap

The trap here is that candidates often confuse Event Hubs Capture (which writes Avro) with the ability to write Parquet directly, or they overcomplicate the solution by choosing batch tools like Data Factory or heavy compute like Databricks when a simpler, cost-effective streaming service like Stream Analytics meets all requirements.

Full explanation →

718

Multi-Selecthard

You are optimizing an Azure Synapse Analytics pipeline that uses mapping data flows. The pipeline runs slowly when processing 100 GB of data. Which THREE settings should you adjust to improve performance?

Select 3 answers

A.Set 'partition option' to 'Round robin' with a higher number of partitions.

B.Set a tumbling window trigger to run the pipeline every 5 minutes.

C.Increase the 'Compute type' to 'Memory Optimized' and increase the number of cores.

D.Enable 'Data flow debug' to monitor execution details.

E.Use 'Optimize shuffle' in the data flow settings.

AnswersA, C, E

Round robin partitioning distributes data evenly across partitions, improving parallelism.

Why this answer

Options A, C, and D are correct. Partitioning improves parallelism. Data flow debug is for development only.

Optimize shuffle reduces data movement. Compute type and core count affect resource allocation. Option B (debug) is for testing.

Option E (tumbling window) is for streaming.

Full explanation →

719

Multi-Selecthard

Which THREE metrics from Azure Monitor should you use to evaluate the performance of an Azure Data Lake Storage Gen2 account?

Select 3 answers

A.Ingress

B.CPU Usage

C.Success E2E Latency

D.Available Storage Capacity

E.Blob Count

AnswersA, C, E

Measures incoming throughput.

Why this answer

Options B, C, and D are correct. Success E2E Latency measures end-to-end latency, Blob Count tracks object count, and Ingress measures throughput. Option A is wrong because Available Storage Capacity is not a metric for storage accounts.

Option E is wrong because CPU usage is for compute, not storage.

Full explanation →

720

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. They notice that queries against a large fact table are running slower over time. The table is hash-distributed on a date key and has a clustered columnstore index. Which action should you take to improve query performance?

A.Add a non-clustered index on frequently filtered columns.

B.Change the distribution column to a column with higher cardinality.

C.Change the distribution to round-robin.

D.Rebuild the clustered columnstore index.

AnswerD

Rebuilding the columnstore index improves compression, removes deleted rows, and reorganizes rowgroups, enhancing scan performance.

Why this answer

Over time, columnstore indexes can become fragmented due to insert, update, and delete operations, leading to compressed row groups that are not optimally sized or have deleted records. Rebuilding the clustered columnstore index reorganizes the data into fully compressed row groups, removes deleted rows, and restores the high compression and segment elimination that columnstore indexes rely on for fast query performance.

Exam trap

The trap here is that candidates may assume performance degradation is always due to data skew or distribution choice, overlooking the common real-world issue of columnstore index fragmentation from ongoing DML operations.

How to eliminate wrong answers

Option A is wrong because adding a non-clustered index on frequently filtered columns would introduce additional index maintenance overhead and is unlikely to outperform the existing columnstore index for large fact tables; columnstore indexes already excel at scanning and filtering large datasets. Option B is wrong because changing the distribution column to one with higher cardinality does not address the root cause of performance degradation over time, which is index fragmentation, not data skew or distribution inefficiency. Option C is wrong because changing the distribution to round-robin would eliminate data locality for joins and aggregations, likely worsening query performance, and does not resolve the fragmentation issue.

Full explanation →

721

MCQhard

A company uses Azure Synapse Analytics serverless SQL pool to query data in ADLS Gen2. Users report that queries against Parquet files are slow. What should you recommend to improve query performance?

A.Create external tables with statistics on relevant columns.

B.Create clustered columnstore indexes on the external tables.

C.Convert the Parquet files to CSV format for faster reads.

D.Partition the data into many small files.

AnswerA

Statistics help the optimizer prune data and improve query performance.

Why this answer

In Azure Synapse serverless SQL pool, external tables do not automatically have statistics. Without statistics, the query optimizer cannot generate efficient execution plans, leading to poor performance on Parquet files. Creating statistics on relevant columns enables the optimizer to estimate cardinality and choose better join and filter strategies, significantly improving query speed.

Exam trap

The trap here is that candidates confuse external table capabilities with dedicated SQL pool features, assuming that indexes like columnstore can be applied to external tables, or that file format changes (CSV) or file count adjustments are the primary performance levers, when in fact statistics are the critical missing piece for serverless SQL pool optimization.

How to eliminate wrong answers

Option B is wrong because clustered columnstore indexes are not supported on external tables in serverless SQL pool; they are only applicable to tables in dedicated SQL pools. Option C is wrong because CSV format is slower than Parquet for analytical queries due to lack of compression, columnar storage, and predicate pushdown; converting to CSV would degrade performance. Option D is wrong because partitioning data into many small files increases metadata overhead and file open operations, which slows down queries in serverless SQL pool; optimal performance is achieved with a moderate number of reasonably sized files.

Full explanation →

722

MCQeasy

You have an Azure Stream Analytics job that writes output to Azure Synapse Analytics. You need to ensure that the job can authenticate to Synapse Analytics using a managed identity. What should you do?

A.Enable system-assigned managed identity on the Stream Analytics job and configure the output to use it.

B.Generate a shared access signature (SAS) token for the Synapse Analytics workspace.

C.Create a user-assigned managed identity and assign it to the Stream Analytics job.

D.Configure the output to use SQL Server authentication with a username and password.

AnswerA

This is the correct method to use managed identity for authentication.

Why this answer

Option C is correct because Stream Analytics jobs can use managed identity to authenticate to Azure Synapse Analytics. Option A is wrong because connection strings with SQL authentication are not managed identity. Option B is wrong because system-assigned managed identity is enabled automatically, but the output configuration must be set to use it.

Option D is wrong because a user-assigned managed identity is not required; system-assigned can be used.

Full explanation →

723

Multi-Selectmedium

Which TWO Azure Blob Storage access tiers are suitable for data that must be available within milliseconds but is accessed infrequently (e.g., once per month)?

Select 2 answers

A.Cool

B.Hot

C.Premium

D.Cold

E.Archive

AnswersA, B

Low latency, lower storage cost, designed for infrequent access.

Why this answer

The Cool tier is designed for data that is accessed infrequently (e.g., once per month) but still requires low latency (milliseconds) on access. It offers a lower storage cost than Hot but with a higher access cost, making it ideal for infrequent reads where immediate availability is needed.

Exam trap

The trap here is that candidates often confuse 'infrequent access' with 'cold storage' and select Cold or Archive, forgetting that Cool is the correct tier for infrequent access with millisecond latency requirements.

Full explanation →

724

Matchingmedium

Match each data transformation concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Handling flexible columns that change over time

Timestamp to track incremental data processing

Optimization to read only relevant partitions

Merge insert and update operations into a single action

Why these pairings

These concepts are important for data transformation in Azure.

Full explanation →

725

MCQmedium

You have an Azure Data Lake Storage Gen2 account that stores parquet files. You need to ensure that files containing personally identifiable information (PII) are automatically classified and tagged. Which Azure service should you integrate?

A.Azure Policy

B.Microsoft Sentinel

C.Microsoft Defender for Cloud

D.Microsoft Purview

AnswerD

Purview provides automated scanning and classification of PII.

Why this answer

Option B is correct because Microsoft Purview provides automated data classification and labeling for Azure Storage. Option A is wrong because Microsoft Sentinel is a SIEM, not for classification. Option C is wrong because Microsoft Defender for Cloud is for security posture, not data classification.

Option D is wrong because Azure Policy enforces rules but does not classify content.

Full explanation →

726

MCQmedium

You are designing a solution to store large amounts of log data that is written once and accessed rarely. The data must be retained for 7 years for compliance. After 30 days, the data should be moved to a lower-cost storage tier. After 1 year, the data should be archived. Which Azure Storage lifecycle management policy should you implement for an Azure Data Lake Storage Gen2 account?

A.Transition to cool tier after 30 days; delete after 7 years.

B.Transition to cool tier after 30 days; transition to archive tier after 365 days; delete after 2555 days (7 years).

C.Transition to archive tier after 30 days; delete after 7 years.

D.Transition to cool tier after 30 days; transition to cool tier again after 365 days.

AnswerB

Matches the requirements exactly.

Why this answer

Option B is correct because it aligns with the specified lifecycle requirements: transition to cool tier after 30 days for cost savings, transition to archive tier after 365 days for long-term retention, and delete after 2555 days (7 years) for compliance. Azure Data Lake Storage Gen2 supports lifecycle management policies that automate tier transitions and deletion based on age, ensuring data is moved to lower-cost storage as access patterns change.

Exam trap

The trap here is that candidates may confuse the required tiering order (cool then archive) with direct archiving after 30 days (Option C) or fail to include a deletion rule (Option D), missing the 7-year compliance requirement.

How to eliminate wrong answers

Option A is wrong because it deletes the data after 7 years but does not include a transition to the archive tier after 1 year, which is required by the compliance policy to archive data after 365 days. Option C is wrong because it transitions to archive tier after only 30 days, which violates the requirement to keep data in a lower-cost tier (cool) for the first year before archiving. Option D is wrong because it transitions to cool tier again after 365 days, which does not archive the data as required, and it lacks a deletion rule for the 7-year retention period.

Full explanation →

727

Multi-Selecthard

You are designing a data lake architecture using Azure Data Lake Storage Gen2. You need to optimize query performance for Azure Synapse Analytics serverless SQL. Which three design considerations should you follow? (Choose three.)

Select 3 answers

A.Store data in Parquet format

B.Partition files by date to enable partition elimination

C.Compress files using snappy or gzip

D.Use many small files (under 64 MB) to increase parallelism

E.Store data in nested folder structures for better organization

AnswersA, B, C

Why this answer

Parquet is a columnar storage format that reduces I/O by reading only the columns needed for a query, which significantly improves performance in Azure Synapse serverless SQL. It also supports efficient compression and encoding schemes, making it ideal for analytical workloads on Azure Data Lake Storage Gen2.

Exam trap

The trap here is that candidates often confuse file size optimization with parallelism, assuming smaller files increase parallelism, but in serverless SQL, too many small files cause excessive metadata requests and reduce throughput, while larger files enable better batch processing.

Why the other options are wrong

D

Small files cause overhead; larger files (128 MB+) are recommended.

E

Deeply nested folders increase file listing time, impacting performance.

Full explanation →

728

Multi-Selecteasy

Which TWO configurations are recommended to secure data processing in Azure Synapse Pipelines?

Select 2 answers

A.Configure a self-hosted integration runtime on a public cloud VM.

B.Use the default Auto-resolve Integration Runtime for all data flows.

C.Store connection strings and secrets in Azure Key Vault and reference them via linked services.

D.Enable Managed Virtual Network (VNet) to isolate data flows.

E.Allow all public IP addresses to access the Azure Synapse workspace.

AnswersC, D

Key Vault centralizes secret management.

Why this answer

Option B is correct: Managed VNet ensures data integration runs in a secure network boundary. Option D is correct: Azure Key Vault integration stores secrets securely. Option A is incorrect: Public endpoint access reduces security.

Option C is incorrect: Auto-resolve integration runtime may use public endpoints. Option E is incorrect: Self-hosted IR is for on-premises, not for cloud security.

Full explanation →

729

MCQeasy

Your organization uses Azure SQL Database with Active Geo-Replication for disaster recovery. You need to ensure that all connections to the database use Microsoft Entra ID authentication and that access is audited. You also want to minimize the attack surface by disabling SQL authentication. What should you do?

A.Configure Conditional Access policies to require MFA for database access.

B.Enable 'Azure AD-only authentication' in the Azure SQL Database server settings and remove all SQL Server authenticated logins.

C.Create a server-level firewall rule to allow only specific IP addresses and enable SQL authentication.

D.Create an Azure RBAC role to restrict access to the database and assign it to users.

AnswerB

Disables SQL authentication and enforces Entra ID.

Why this answer

Option C is correct: Set 'Azure AD-only authentication' to 'True' in the Azure SQL Database server properties. This disables SQL authentication and enforces Entra ID. Option A is incorrect: Azure RBAC controls management plane, not database access.

Option B is incorrect: SQL authentication is still enabled. Option D is incorrect: Conditional Access policies work with Entra ID but don't disable SQL authentication.

Full explanation →

730

MCQeasy

A data engineer needs to store semi-structured JSON logs from multiple sources in Azure. The logs must be queryable using T-SQL and support schema-on-read. Which Azure service should be used?

A.Azure Synapse serverless SQL pool with JSON files in ADLS Gen2.

B.Azure Data Factory mapping data flows.

C.Azure Cosmos DB Core (SQL) API.

D.Azure SQL Database with JSON columns.

AnswerA

Serverless SQL pool can query JSON files with schema-on-read using T-SQL.

Why this answer

Azure Synapse serverless SQL pool can query JSON files stored in ADLS Gen2 using T-SQL, supporting schema-on-read by inferring the schema from the file content at query time. This makes it ideal for semi-structured logs that need to be queried without predefined schema.

Exam trap

The trap here is that candidates often confuse schema-on-read with schema-on-write, picking Azure SQL Database or Cosmos DB because they support JSON, but those require predefined schemas or containers, failing the schema-on-read requirement.

How to eliminate wrong answers

Option B is wrong because Azure Data Factory mapping data flows are designed for ETL/ELT transformations, not for direct T-SQL querying of data at rest. Option C is wrong because Azure Cosmos DB Core (SQL) API stores data as JSON but does not support schema-on-read; it requires a defined container schema and uses its own SQL dialect, not standard T-SQL. Option D is wrong because Azure SQL Database with JSON columns requires a predefined table schema and does not support schema-on-read for external files; it stores JSON in relational columns, not as files.

Full explanation →

731

MCQeasy

You need to monitor an Azure Data Factory pipeline for failures and send an email notification when a pipeline run fails. Which Azure service should you use to create an alert based on the pipeline run metrics?

A.Microsoft Sentinel

B.Azure Monitor

C.Azure Service Health

D.Azure Log Analytics

AnswerB

Azure Monitor can create metric alerts on 'Failed pipeline runs' metric and trigger an action group to send emails.

Why this answer

Option A is correct because Azure Monitor can create alerts based on ADF metrics like 'Failed pipeline runs'. Option B is wrong because Azure Service Health monitors Azure service health, not pipeline runs. Option C is wrong because Azure Log Analytics is for log queries, not alerting.

Option D is wrong because Microsoft Sentinel is for security.

Full explanation →

732

MCQeasy

A company uses Azure Synapse Analytics dedicated SQL pool. They need to load data from Azure Data Lake Storage Gen2 (ADLS Gen2) incrementally. Which PolyBase external table configuration supports incremental loading without reprocessing historical data?

A.Create an external table with a clustered index on the date column.

B.Create an external table partitioned by date column with a partition location pattern.

C.Create an external table on ORC files with a clustered columnstore index.

D.Create an external table with round-robin distribution on the date column.

AnswerB

Partitioned external tables allow PolyBase to eliminate partitions, loading only new data.

Why this answer

Option B is correct because partitioning an external table by a date column with a partition location pattern allows PolyBase to perform partition elimination during queries. This means only the partitions containing new or modified data are read from ADLS Gen2, enabling incremental loading without scanning historical data.

Exam trap

The trap here is that candidates confuse indexing or distribution strategies (which apply to internal tables) with external table capabilities, overlooking that only partition elimination on external tables enables incremental file-level filtering in PolyBase.

How to eliminate wrong answers

Option A is wrong because a clustered index on an external table is not supported; external tables in Azure Synapse are read-only and cannot have indexes. Option C is wrong because a clustered columnstore index applies to internal tables in the dedicated SQL pool, not to external tables, and does not control which files are read from ADLS Gen2. Option D is wrong because round-robin distribution distributes data evenly across distributions but does not enable partition elimination or incremental loading; it is a distribution method for internal tables, not a mechanism for selective file reading.

Full explanation →

733

MCQeasy

You are using Azure Synapse Analytics to process streaming data from Azure Event Hubs. The data must be written to a Delta Lake table in ADLS Gen2 with exactly-once semantics. Which processing engine should you use?

A.Azure Databricks with Structured Streaming

B.Azure Synapse serverless SQL pool

C.Azure Synapse Pipeline with Mapping Data Flow

D.Azure Stream Analytics

AnswerC

Mapping Data Flows in Synapse can achieve exactly-once semantics when writing to Delta Lake.

Why this answer

Option C is correct because Azure Synapse Pipeline with Mapping Data Flow supports Delta Lake as a sink and can be configured to use Spark-based execution for streaming data from Event Hubs, enabling exactly-once semantics through checkpointing and idempotent writes. Mapping Data Flow runs on Spark clusters within Synapse, providing the necessary transactional guarantees for Delta Lake tables.

Exam trap

The trap here is that candidates often assume Azure Databricks is the only option for Delta Lake with exactly-once semantics, overlooking that Azure Synapse Pipeline with Mapping Data Flow provides equivalent functionality within the Synapse ecosystem, which is the focus of the DP-203 exam.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming can write to Delta Lake with exactly-once semantics, but it is not a native Azure Synapse Analytics component; the question specifies using Azure Synapse Analytics, making Databricks an external service. Option B is wrong because Azure Synapse serverless SQL pool is designed for on-demand querying of data in data lakes, not for processing streaming data or writing to Delta Lake tables with exactly-once semantics. Option D is wrong because Azure Stream Analytics does not natively support Delta Lake as an output sink; it writes to Azure Blob Storage, ADLS Gen2, or Event Hubs in formats like Parquet or Avro, but lacks the transactional capabilities required for exactly-once semantics in Delta Lake.

Full explanation →

734

MCQmedium

Your organization has an Azure Data Factory pipeline that executes a series of activities to transform data. One of the activities is an Azure Databricks notebook that should run only if the previous activity succeeds. You need to configure the pipeline to handle failures gracefully and send an email alert if the Databricks activity fails. What should you do?

A.Add a failure output path from the Databricks activity to a Web activity that calls an email API.

B.Configure a retry policy and a timeout for the Databricks activity.

C.Use a Schedule trigger to run the pipeline and check for failures using Azure Monitor.

D.Set a dependency condition on the Databricks activity to 'Succeeded' and add a Send Email activity on the success path.

AnswerA

You can route failure output to a Web activity to send an email via Logic Apps or Azure Functions.

Why this answer

Option C is correct because you can add a failure path to the Databricks activity and attach a Web activity to send email. Option A is wrong because dependency conditions are on success, not failure. Option B is wrong because a failure policy on the activity does not send alerts.

Option D is wrong because a Schedule trigger does not handle failures.

Full explanation →

735

MCQeasy

A healthcare organization needs to store electronic health records (EHR) in a format that supports schema flexibility and complex nested data. The solution must allow fast queries by patient ID and enable analytics with Azure Synapse. Which data store should you choose?

A.Azure Table Storage

B.Azure Data Lake Storage Gen2 with files in JSON format

C.Azure Cosmos DB with analytical store enabled

D.Azure SQL Database with JSON columns

AnswerC

Cosmos DB provides schema flexibility, nested data support, and Synapse integration for analytics.

Why this answer

Azure Cosmos DB with analytical store enabled is the correct choice because it provides schema flexibility for complex nested EHR data, supports fast point reads by patient ID via its indexed partition key, and the analytical store enables efficient analytics with Azure Synapse through the Synapse Link feature, which automatically synchronizes operational data into a columnar format optimized for large-scale queries.

Exam trap

The trap here is that candidates often choose Azure SQL Database with JSON columns (Option D) because they assume relational databases can handle JSON, but they overlook the requirement for schema flexibility and native analytical store integration, which Cosmos DB with analytical store uniquely provides for hybrid transactional/analytical processing (HTAP) workloads.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a key-value store that does not support complex nested data structures or schema flexibility for hierarchical EHR records, and it lacks native integration with Azure Synapse for analytics. Option B is wrong because while Azure Data Lake Storage Gen2 with JSON files can store nested data, it does not provide fast point queries by patient ID without additional indexing or processing, and it requires separate ETL for analytics rather than real-time analytical store access. Option D is wrong because Azure SQL Database with JSON columns imposes a fixed relational schema and does not offer the same level of schema flexibility as a NoSQL document store; JSON columns also complicate indexing and nested query performance, and it lacks a built-in analytical store for seamless Synapse integration.

Full explanation →

736

MCQeasy

You need to monitor the performance of an Azure Synapse Analytics dedicated SQL pool. Which DMV should you query to find queries that are currently running and their execution status?

A.sys.dm_pdw_nodes

B.sys.dm_pdw_request_steps

C.sys.dm_pdw_errors

D.sys.dm_pdw_exec_requests

AnswerD

This DMV lists all currently executing requests and their status.

Why this answer

Option D is correct because sys.dm_pdw_exec_requests shows currently running requests in a dedicated SQL pool. Option A is wrong because sys.dm_pdw_nodes is a system view, not for queries. Option B is wrong because sys.dm_pdw_request_steps shows steps of completed or running requests, not high-level status.

Option C is wrong because sys.dm_pdw_errors shows errors, not running queries.

Full explanation →

737

Multi-Selecteasy

Which TWO features of Azure Databricks help manage data governance and security for sensitive data?

Select 2 answers

A.Structured Streaming

B.Secret Scopes

C.Auto Loader

D.Delta Live Tables

E.Unity Catalog

AnswersB, E

Secret Scopes securely store and manage access tokens and keys.

Why this answer

Secret Scopes (B) allow secure storage and referencing of sensitive credentials (e.g., API keys, database passwords) in Azure Databricks, preventing hardcoding in notebooks. Unity Catalog (E) provides fine-grained access control, data lineage, and centralized metadata management across workspaces, enabling governance of sensitive data through policies and auditing.

Exam trap

The trap here is that candidates confuse data processing features (Structured Streaming, Auto Loader, Delta Live Tables) with governance/security tools, because all are part of the Databricks ecosystem but serve fundamentally different purposes.

Full explanation →

738

MCQhard

Refer to the exhibit. You are an Azure data engineer responsible for ensuring that all storage accounts used in data pipelines enforce HTTPS traffic. You apply the Azure Policy definition shown above. Later, a data engineer creates a new storage account with 'Enable secure transfer' set to Disabled. What will happen when the policy is evaluated?

A.The storage account will be created, but the policy will be evaluated later during a compliance scan.

B.The storage account will be created with HTTPS enabled automatically.

C.The storage account creation will be denied and the request will fail.

D.The storage account will be created, but an audit event will be logged.

AnswerC

The policy denies the creation if the condition is met.

Why this answer

Option A is correct because the policy denies creation of storage accounts that do not have HTTPS enforced (supportsHttpsTrafficOnly = false). The policy is evaluated at creation time and will deny the request. Option B is wrong because the policy has deny effect, not audit.

Option C is wrong because the policy does not set the property automatically. Option D is wrong because the policy is evaluated during creation, not after.

Full explanation →

739

MCQeasy

Your company uses Azure Blob Storage to store backups. You need to ensure that data is encrypted at rest using a customer-managed key stored in Azure Key Vault. Which feature should you enable?

A.Azure Purview

B.Azure Disk Encryption

C.Azure Storage Service Encryption with customer-managed keys

D.Azure Information Protection

AnswerC

Allows using CMK from Key Vault for Blob Storage.

Why this answer

Option A is correct because Azure Storage Encryption supports customer-managed keys via Key Vault. Option B is incorrect because Azure Information Protection is for classification and labeling. Option C is incorrect because Azure Disk Encryption is for VMs, not Blob Storage.

Option D is incorrect because Azure Purview is a data governance service.

Full explanation →

740

Multi-Selecthard

Your company stores JSON documents in Azure Cosmos DB Core (SQL) API. You need to improve query performance for a common filter on the 'status' field and a sort on 'timestamp'. Which three actions should you take?

Select 3 answers

A.Create a composite index on (status, timestamp)

B.Increase the provisioned RU/s

C.Choose a partition key that ensures even distribution

D.Denormalize the status field into a separate container

E.Disable indexing on the timestamp field

AnswersA, B, C

Why this answer

A composite index on (status, timestamp) is correct because Azure Cosmos DB Core (SQL) API uses composite indexes to efficiently support queries with multiple filter and sort conditions. This index allows the query engine to first filter on the 'status' field and then sort by 'timestamp' in a single index seek, avoiding a full scan or in-memory sort. Without this composite index, the query would require a cross-partition scan or an expensive sort operation, degrading performance.

Exam trap

The trap here is that candidates may think increasing RU/s (Option B) or disabling indexing (Option E) are sufficient fixes, but the core issue is the lack of an appropriate composite index to support the combined filter and sort, which is a common DP-203 exam pattern for Cosmos DB query optimization.

Why the other options are wrong

D

Unnecessary complexity; composite index is sufficient.

E

Sorting requires indexing on that field.

Full explanation →

741

Multi-Selecthard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. The workload includes large fact tables and dimension tables. You need to improve query performance for star join queries. Which TWO actions should you take?

Select 2 answers

A.Use round-robin distribution on fact tables.

B.Use hash distribution on dimension tables.

C.Use heap table structure for fact tables.

D.Use replicated distribution on dimension tables.

E.Use hash distribution on fact tables using the join key.

AnswersD, E

Replicated tables avoid data movement.

Why this answer

Options B and D are correct. Using hash-distributed fact tables on the join key and replicated dimension tables are best practices for star joins. Option A is wrong because round-robin is for staging tables, not for fact tables.

Option C is wrong because clustered columnstore index is default and recommended, not heap. Option E is wrong because distribution on dimension tables should be replicated, not hash.

Full explanation →

742

MCQmedium

Refer to the exhibit. A user with Storage Blob Data Reader role on the container rawdata cannot list files under /2023/07/. What is the most likely reason?

A.The directory ACL does not grant 'execute' permission to the user

B.The user does not have Storage Blob Data Contributor role

C.The user is not the owner of the directory

D.The container name is misspelled

AnswerA

To list directory contents, the user needs execute permission on the directory. 'other' has no permissions, so the user (not being owner or group) is denied.

Why this answer

The ACL shows that 'other' has no permissions (---). The user does not have explicit ACL entries, so they fall under 'other'. Without execute permission (--x) on the directory, they cannot traverse it.

Although they have Reader role at container level, POSIX ACLs on the directory restrict access.

Full explanation →

743

MCQhard

You have a production pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage using a self-hosted integration runtime. The pipeline fails intermittently with a 'Connection closed' error. The data volume is 50 GB per run. What should you first troubleshoot to resolve this issue?

A.Increase the memory and CPU resources on the self-hosted integration runtime machine and check network stability.

B.Increase the 'connection timeout' setting in the linked service to 30 minutes.

C.Change the copy activity to use staged copy with Azure Blob Storage as an intermediate store.

D.Disable fault tolerance in the copy activity to improve performance.

AnswerA

The self-hosted IR needs sufficient resources for large data transfers; 'Connection closed' often indicates resource exhaustion or network interruptions.

Why this answer

Option A is correct because the self-hosted IR's memory and network stability are common causes of 'Connection closed' errors with large data volumes. Option B (disable fault tolerance) would actually make the issue worse. Option C (change to staging copy) might help but is not the first step.

Option D (increase TTL) addresses only idle connections, not active data transfers.

Full explanation →

744

Multi-Selectmedium

Which TWO actions should you take to secure data at rest in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable Azure Storage Service Encryption (SSE) for data at rest

B.Enable diagnostic settings to audit access

C.Configure firewall rules to restrict network access

D.Enable soft delete for blobs

E.Use Azure RBAC to grant least privilege access to storage accounts

AnswersA, E

SSE encrypts all data at rest automatically.

Why this answer

Enabling encryption at rest (Azure Storage Service Encryption) and using Azure RBAC to control access are both security best practices. Option C is wrong because firewall rules are network-level, not data-at-rest. Option D is wrong because soft delete is for recovery, not security.

Option E is wrong because diagnostic settings are for auditing, not encryption.

Full explanation →

745

Multi-Selecthard

Which THREE factors should you consider when choosing between Azure Data Factory Mapping Data Flows and Azure Synapse Spark pools for data transformation?

Select 3 answers

A.Scheduling: Only Data Flows can be scheduled via triggers.

B.Ease of use: Mapping Data Flows provide a visual designer, while Spark requires code.

C.Data volume: Data Flows are limited to 100 GB, while Spark can handle petabytes.

D.Integration with other services: Data Flows can use integration runtimes, while Spark is limited to Synapse.

E.Debugging: Data Flows have a debug session limit of 8 hours, while Spark pools have no debug limit.

AnswersB, D, E

Data Flows are no-code, Spark requires coding.

Why this answer

Option B is correct because Azure Data Factory Mapping Data Flows offer a visual, no-code designer for building data transformations, which lowers the barrier for users who are not proficient in programming. In contrast, Azure Synapse Spark pools require writing code in languages like PySpark, Scala, or SQL, making them more suitable for developers comfortable with coding. This distinction directly addresses ease of use as a key factor in choosing between the two services.

Exam trap

The trap here is that candidates assume Mapping Data Flows have a hard data volume limit (like 100 GB) or that Spark pools cannot be scheduled, when in fact both services are highly scalable and can be orchestrated via triggers, and the key differentiator is the coding versus visual interface.

Full explanation →

746

Multi-Selectmedium

Which THREE factors should be considered when choosing between Azure Synapse Dedicated SQL Pool and Azure SQL Database for a data warehouse workload?

Select 3 answers

A.Need for geo-replication

B.Concurrency and workload management

C.Integration with Azure Data Lake Storage

D.Volume of data to be stored

E.Complexity of analytical queries

AnswersB, D, E

Dedicated SQL Pool has workload management.

Why this answer

Option B is correct because Azure Synapse Dedicated SQL Pool provides built-in workload management features like workload classification, importance, and concurrency slots, which are essential for managing mixed analytical workloads with varying resource demands. Azure SQL Database, while supporting resource governance, lacks the granular workload management capabilities needed for complex data warehouse scenarios with concurrent queries.

Exam trap

The trap here is that candidates often assume geo-replication (Option A) is a critical differentiator, but both services support it, and the real decision hinges on workload management, data volume scalability, and analytical query complexity.

Full explanation →

747

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool with a large fact table partitioned by month. You notice that queries filtering on a specific month still scan all partitions. The table has a clustered columnstore index. What is the most likely cause?

A.The table is hash-distributed on a column other than date, preventing partition elimination.

B.The statistics on the date column are outdated or missing.

C.The columnstore index has a large number of open rowgroups that are not compressed.

D.The table is replicated, so partition elimination does not apply.

AnswerC

Open rowgroups cause the entire columnstore to be scanned, bypassing partition elimination.

Why this answer

Option C is correct because when a clustered columnstore index has a large number of open rowgroups (i.e., rowgroups that have not yet been compressed into columnstore segments), queries may fall back to the delta store, which stores data in a B-tree-like structure. This bypasses partition elimination because the delta store does not maintain the same metadata for partition pruning, causing all partitions to be scanned. Compressing open rowgroups into closed, compressed rowgroups restores proper partition elimination behavior.

Exam trap

The trap here is that candidates often assume partition elimination is always effective with columnstore indexes, overlooking that unclosed delta store rowgroups can disable this optimization, and they may incorrectly attribute the issue to distribution or statistics.

How to eliminate wrong answers

Option A is wrong because hash distribution on a non-date column does not prevent partition elimination; partition elimination is a storage-level optimization based on partition boundaries, independent of distribution method. Option B is wrong because statistics on the date column are used for cardinality estimation, not for partition elimination; partition elimination is determined by metadata about partition ranges, not statistics. Option D is wrong because replicated tables are fully copied to each distribution, but partition elimination still applies when querying a replicated table if it is partitioned; the issue here is specific to the columnstore index's rowgroup state, not replication.

Full explanation →

748

MCQhard

Your team is running a critical Azure Stream Analytics job that writes results to Azure SQL Database. Recently, the job has been failing with high latency and occasional data loss. You need to monitor the job's performance and set up alerts for when the watermark delay exceeds a threshold. What should you use?

A.Application Insights SDK integration in the job.

B.Azure Log Analytics workspace connected to the job diagnostics logs.

C.Azure Monitor metrics for the Stream Analytics job.

D.Azure Data Explorer for querying job performance data.

AnswerC

Azure Monitor provides built-in metrics like watermark delay and can trigger alerts.

Why this answer

Option A is correct because Azure Stream Analytics provides job metrics in Azure Monitor, including watermark delay. Option B is wrong because Log Analytics can store logs but not directly show Stream Analytics metrics. Option C is wrong because Application Insights is for application monitoring, not Stream Analytics.

Option D is wrong because Azure Data Explorer is not for monitoring Stream Analytics.

Full explanation →

749

MCQeasy

You are designing a data processing solution using Azure Databricks. The data is stored in Delta Lake format. You need to ensure that when you read the latest version of the table, you only see committed data and not uncommitted transactions. Which isolation level should you use?

A.WriteSerializable

B.Serializable

C.ReadUncommitted

D.SnapshotIsolation

AnswerD

Delta Lake uses Snapshot isolation to read the latest committed version.

Why this answer

Snapshot isolation is the correct choice because it provides a consistent view of the table by reading only the latest committed data, ignoring any uncommitted transactions. In Delta Lake, snapshot isolation ensures that readers see a snapshot of the table at a specific version, which includes only committed changes, making it ideal for read consistency without blocking concurrent writes.

Exam trap

The trap here is that candidates confuse write isolation levels (like WriteSerializable) with read isolation levels, or assume that Serializable is always the safest choice for consistency, when in fact Snapshot isolation is the specific Delta Lake mechanism for reading only committed data without blocking.

How to eliminate wrong answers

Option A (WriteSerializable) is wrong because it is a write isolation level that ensures serializable isolation for write operations, but it does not control read behavior to exclude uncommitted data. Option B (Serializable) is wrong because it is a general isolation level that prevents dirty reads, non-repeatable reads, and phantom reads, but it is not specifically designed to guarantee that only committed data is visible in Delta Lake; it also imposes more overhead than needed. Option C (ReadUncommitted) is wrong because it allows reading uncommitted data (dirty reads), which directly violates the requirement to see only committed data.

Full explanation →

750

MCQhard

A company is migrating its on-premises SQL Server data warehouse to Azure Synapse Analytics. They have a fact table with 2 billion rows and 30 columns. The table is frequently joined on CustomerID and filtered on OrderDate. What is the recommended table design?

A.Hash-distribute on CustomerID and partition on OrderDate

B.Replicate the table to all nodes

C.Round-robin distribution with partitions on OrderDate

D.Hash-distribute on OrderDate and partition on CustomerID

AnswerA

Optimizes joins and filtering.

Why this answer

Hash-distributing the fact table on CustomerID ensures that rows with the same CustomerID are co-located on the same distribution node, which makes joins on CustomerID efficient by avoiding data movement. Partitioning on OrderDate enables partition elimination when filtering by date, reducing the amount of data scanned. This combination optimizes both the join and filter operations for a large fact table in Azure Synapse Analytics.

Exam trap

The trap here is that candidates often confuse the roles of distribution and partitioning, thinking that partitioning on the join column or distributing on the filter column will improve performance, when in fact distribution should align with join keys and partitioning with filter keys.

How to eliminate wrong answers

Option B is wrong because replicating a 2-billion-row fact table to all nodes would consume excessive storage and cause significant overhead during data loading and maintenance, and it is intended for small dimension tables, not large fact tables. Option C is wrong because round-robin distribution distributes rows evenly but without any logical grouping, so joins on CustomerID would require shuffling all data across nodes, leading to poor performance. Option D is wrong because hash-distributing on OrderDate would scatter rows with the same CustomerID across nodes, making joins on CustomerID highly inefficient, and partitioning on CustomerID is not supported (partition columns must be date/time types in Synapse) and would not help with date-based filtering.

Full explanation →

Page 10 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →