Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 751–825

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 11 of 12

751

MCQhard

You are reviewing an Azure Policy definition. What does this policy do?

A.Denies the creation or update of a storage account if its default blob service version is not set to '2020-10-02'

B.Sets the default blob service version of all storage accounts to '2020-10-02'

C.Audits storage accounts to check if their default blob service version is set to a value other than '2020-10-02'

D.Denies the creation of storage accounts that have the default blob service version set to '2020-10-02'

AnswerA

The policy denies when the field does not equal the specified version.

Why this answer

Option A is correct because the policy denies any storage account that does not have the default service version set to '2020-10-02'. Option B is wrong because it denies, not audits. Option C is wrong because it applies to all storage accounts, not just those with defaultServiceVersion set.

Option D is wrong because it does not set the version; it denies if not set.

Full explanation →

752

MCQhard

Your Azure Data Lake Storage Gen2 account stores sensitive customer data. You need to ensure that data is encrypted at rest using customer-managed keys (CMK) and that access to the encryption key is logged. What should you do?

A.Enable infrastructure encryption on the storage account.

B.Enable double encryption using both platform-managed and customer-managed keys.

C.Configure customer-managed keys in Azure Key Vault and enable Key Vault diagnostics logging.

D.Use Azure Storage Service Encryption (SSE) with platform-managed keys.

AnswerC

CMK in Key Vault allows customer control, and diagnostics logs capture key access events.

Why this answer

Option D is correct because CMK with Key Vault provides customer-controlled encryption keys, and Key Vault diagnostics logs key access. Option A is wrong because infrastructure encryption uses platform-managed keys. Option B is wrong because SSE with platform-managed keys does not give customer control.

Option C is wrong because Double Encryption uses both platform and customer keys, but the primary requirement is CMK and logging.

Full explanation →

753

Multi-Selectmedium

Which TWO of the following are supported storage options for use as a source in Azure Synapse Pipeline Copy Activity?

Select 2 answers

A.Azure Data Lake Storage Gen2

B.Azure Analysis Services

C.Azure Cognitive Search

D.Azure Purview

E.Azure Blob Storage

AnswersA, E

ADLS Gen2 is a supported source.

Why this answer

Azure Data Lake Storage Gen2 is a supported source for Azure Synapse Pipeline Copy Activity because it combines a hierarchical file system with Azure Blob Storage APIs, enabling efficient data ingestion. The Copy Activity can read data from ADLS Gen2 using the AzureBlobFS linked service, which supports both file and folder-level reads for structured and unstructured data.

Exam trap

The trap here is that candidates confuse Azure services that manage or process data (like Analysis Services, Cognitive Search, or Purview) with actual storage services that can serve as a source for the Copy Activity, leading them to select non-storage options.

Full explanation →

754

MCQeasy

You need to audit all queries run against an Azure Synapse Analytics serverless SQL pool. What should you enable?

A.Azure SQL Auditing on the serverless SQL pool endpoint

B.Microsoft Purview to scan and catalog queries

C.Azure Policy to enforce auditing

D.Azure Monitor diagnostic settings

AnswerA

Auditing captures detailed query logs.

Why this answer

Option B is correct because auditing is configured at the server level for SQL pools, capturing query logs. Option A (Azure Monitor) is for metrics, not query text. Option C (Azure Policy) is for compliance.

Option D (Azure Purview) is for data cataloging.

Full explanation →

755

MCQmedium

You are troubleshooting a failed Azure Synapse Pipeline execution. The pipeline uses a Copy activity to load data from an on-premises SQL Server to Azure Data Lake Storage Gen2. The error indicates a 'Connection timeout' to the on-premises source. The Integration Runtime is Self-Hosted and has been running successfully for months. What is the most likely cause?

A.The SQL Server authentication credentials have expired.

B.The Self-Hosted Integration Runtime is not installed.

C.The on-premises network configuration has changed.

D.The Azure Storage account firewall is blocking access.

AnswerC

Network changes can block connectivity to the SQL Server.

Why this answer

Option B is correct because a change in the on-premises network configuration (e.g., firewall rules, DNS) could block the Integration Runtime from reaching the SQL Server. Option A is wrong because the Integration Runtime was running, so installation is not the issue. Option C is wrong because the storage account is the destination, not the source.

Option D is wrong because the Integration Runtime was successful before, so credentials are likely valid.

Full explanation →

756

MCQmedium

Refer to the exhibit. You are deploying an Azure Synapse Analytics workspace using an ARM template. The exhibit shows the encryption configuration. What is the effect of setting infrastructureEncryption to Enabled?

A.It disables encryption using the customer-managed key.

B.It enables encryption of data in transit between nodes.

C.It enables transparent data encryption for SQL pools.

D.It adds a second layer of encryption at the infrastructure level, ensuring data is encrypted at rest with two different keys.

AnswerD

Infrastructure encryption provides double encryption for data at rest.

Why this answer

Option D is correct because infrastructure encryption (double encryption) encrypts data at both the service level and the infrastructure level. Option A is wrong because it encrypts at rest, not in transit. Option B is wrong because it does not disable encryption; it adds another layer.

Option C is wrong because it's not for logs.

Full explanation →

757

MCQmedium

You are designing a change data capture (CDC) solution to incrementally load data from an on-premises SQL Server database to Azure Synapse Analytics. The source tables have no timestamp columns and you cannot modify the schema. Which Azure service should you use?

A.Azure Synapse Pipelines with mapping data flows

B.Azure Data Factory with change tracking

C.Azure Databricks with Auto Loader

D.Azure Stream Analytics

AnswerB

ADF can enable change tracking on SQL Server or use custom watermark logic.

Why this answer

Azure Data Factory's change tracking capability is the correct choice because it can capture row-level inserts, updates, and deletes from SQL Server without requiring timestamp columns or schema modifications. It uses SQL Server's built-in change tracking feature, which tracks changes at the table level and provides a reliable incremental load mechanism to Azure Synapse Analytics.

Exam trap

The trap here is that candidates often assume a timestamp column is mandatory for incremental loads, but Azure Data Factory's change tracking connector bypasses this requirement by using SQL Server's built-in change tracking mechanism, which does not require any schema modifications.

How to eliminate wrong answers

Option A is wrong because Azure Synapse Pipelines with mapping data flows require a timestamp column or a watermark to identify changed rows, and they cannot leverage SQL Server's change tracking without schema modifications. Option C is wrong because Azure Databricks with Auto Loader is designed for incremental ingestion of file-based data (e.g., from cloud storage), not for capturing changes from a relational database like SQL Server. Option D is wrong because Azure Stream Analytics is a real-time event processing service for streaming data (e.g., from IoT hubs or event hubs), not a batch or incremental load solution for database tables.

Full explanation →

758

MCQmedium

You are responsible for managing an Azure Data Lake Storage Gen2 account that stores parquet files for analytics. You need to implement a data retention policy that automatically deletes files older than 90 days in the 'logs' container. Additionally, you need to ensure that no data is lost due to accidental deletion; you want to be able to recover deleted files within 30 days. You also need to monitor the storage account for unusual access patterns. The solution must minimize administrative effort. What should you do?

A.Enable soft delete with a retention period of 30 days and configure a lifecycle management rule to delete blobs older than 90 days

B.Create an Azure Policy to enforce tag-based retention and use Azure Monitor to alert on access

C.Enable versioning and configure a retention policy in Azure Policy

D.Use Azure Backup for the storage account and manually delete old files

AnswerA

Soft delete provides recovery, lifecycle management deletes old data

Why this answer

Option B is correct because enabling soft delete provides recovery within the retention period (30 days), and using a lifecycle management policy can automatically delete files older than 90 days. Option A is wrong because Azure Policy does not manage retention or recovery. Option C is wrong because snapshots are for blobs, not for ADLS Gen2 directories.

Option D is wrong because it does not provide automatic deletion based on age.

Full explanation →

759

MCQhard

You have a Data Factory pipeline that runs a U-SQL script in Azure Data Lake Analytics. The script processes terabytes of data and outputs to a CSV file. The pipeline is failing with the error: 'The job failed with UserError: Script execution failed.' You need to troubleshoot the issue. Which approach should you take first?

A.Change the output format to Parquet to reduce file size.

B.Review the job logs in Azure Data Lake Analytics to identify the specific script error.

C.Migrate the U-SQL script to Azure Synapse Spark pool.

D.Increase the degree of parallelism for the U-SQL job.

AnswerB

Job logs provide detailed error messages that pinpoint the issue.

Why this answer

The most effective first step is to examine the detailed job logs in Data Lake Analytics, which contain the actual script error. Increasing parallelism or changing output format may not address the underlying script error. Moving to Azure Synapse is a larger architectural change.

Full explanation →

760

MCQmedium

You are reviewing the ARM template snippet for an Azure Data Lake Storage Gen2 account. The template fails to deploy with an error that the encryption key cannot be accessed. What is the most likely cause?

A.The key vault does not have soft-delete enabled.

B.The Data Lake Storage account does not have Get and Wrap Key permissions on the key vault.

C.The key name or version is incorrect.

D.The key vault URI is incorrectly formatted.

AnswerB

The storage account's identity must have these permissions to use the key.

Why this answer

Option C is correct because the Data Lake Storage account's managed identity (or the user deploying) needs access to the Key Vault to retrieve the key. If the identity does not have 'Get' and 'Wrap Key' permissions, it will fail. Option A is wrong because the URI is valid.

Option B is wrong because the key version exists. Option D is wrong because soft-delete is not required for access, though it is recommended.

Full explanation →

761

Multi-Selecteasy

You are developing a data pipeline in Azure Data Factory that ingests data from multiple on-premises SQL Server databases to Azure Data Lake Storage Gen2. The data volume is about 1 TB per day. You need to ensure the pipeline can handle the volume and provide monitoring and alerting. Which THREE components should you include?

Select 3 answers

A.Azure Synapse Analytics pipeline

B.Self-hosted integration runtime

C.Azure Monitor

D.Power BI

E.Data flow activity

AnswersB, C, E

Required to connect to on-premises SQL Server.

Why this answer

Correct answers: A, C, and E. Self-hosted integration runtime is required for on-premises connectivity. Azure Monitor provides monitoring and alerting.

Data flow can be used for data transformation at scale. Option B is wrong because Azure Synapse Analytics is not used in this pipeline. Option D is wrong because Power BI is for visualization, not data ingestion.

Full explanation →

762

Multi-Selecteasy

You need to secure access to an Azure Data Lake Storage Gen2 account. Which THREE methods can you use to authenticate and authorize access?

Select 3 answers

A.SQL connection strings.

B.Shared access signatures (SAS).

C.Managed identities.

D.Access control lists (ACLs).

E.Azure RBAC roles.

AnswersB, D, E

SAS tokens provide delegated access to resources.

Why this answer

Options A, B, and E are correct. RBAC roles, ACLs, and SAS tokens are all valid methods. Option C is wrong because connection strings are for Azure SQL Database.

Option D is wrong because managed identities are identities, not directly authorization methods, but they can be used with RBAC; however, the question asks for methods to authenticate and authorize, and managed identity is an identity type, not a method.

Full explanation →

763

MCQhard

Refer to the exhibit. You are reviewing an ARM template for an Azure Data Lake Storage Gen2 account. Which of the following security best practices is violated in this template?

A.The location should be fixed instead of using resourceGroup().location

B.The SKU should be Standard_GRS for disaster recovery

C.The account does not enable hierarchical namespace (HNS)

D.The account allows HTTP traffic and uses an outdated TLS version

AnswerD

supportsHttpsTrafficOnly: false and TLS1_0 are insecure.

Why this answer

Setting supportsHttpsTrafficOnly to false allows HTTP traffic, which is insecure. Also, minimumTlsVersion TLS1_0 is outdated and insecure. Option A is wrong because HNS is enabled correctly.

Option B is wrong because Standard_LRS is a valid SKU. Option D is wrong because location is correct.

Full explanation →

764

Multi-Selecteasy

You need to design a storage solution for a data lake that will be used by multiple teams for analytics. The solution must support fine-grained access control, versioning of files, and integration with Azure Purview for data cataloging. Which THREE features should you enable in Azure Data Lake Storage Gen2?

Select 3 answers

A.Integration with Microsoft Purview

B.Managed identity authentication

C.Blob versioning

D.Access control lists (ACLs)

E.Soft delete

AnswersA, C, D

Purview integrates with Data Lake Storage for data cataloging.

Why this answer

Option A is correct because Microsoft Purview integration is required for data cataloging, enabling automated scanning, classification, and lineage tracking of data stored in Azure Data Lake Storage Gen2. This directly supports the requirement for data cataloging in the scenario.

Exam trap

The trap here is that candidates confuse soft delete with versioning, but soft delete only recovers deleted blobs for a retention period, whereas versioning maintains all overwrites and deletes as separate versions for granular recovery and audit.

Full explanation →

765

MCQmedium

Refer to the exhibit. You are reviewing a Data Factory JSON definition. The factory has a user-assigned managed identity configured. However, the linked service to Azure Storage uses an account key. What security improvement should you recommend?

A.Add a firewall rule to limit access to the storage account

B.Modify the linked service to use the managed identity for authentication

C.Remove the managed identity and use a service principal

D.Keep the account key but store it in Azure Key Vault

AnswerB

Managed identity eliminates the need for account key.

Why this answer

Using the managed identity instead of account key is more secure because it avoids storing secrets. Option A is wrong because the managed identity is already configured but not used. Option B is wrong because firewall rules are separate.

Option D is wrong because switching to Key Vault still stores a secret; managed identity is preferred when possible.

Full explanation →

766

MCQeasy

You need to implement column-level security in Azure Synapse Analytics to restrict access to salary information. Only users with the 'HRManager' role should see salary columns. Which feature should you use?

A.Row-level security using security predicates

B.Dynamic data masking

C.Column-level security using GRANT on columns

D.Azure Purview data classification

AnswerC

CLS allows granting SELECT on specific columns to roles.

Why this answer

Option A is correct because column-level security (CLS) in Azure Synapse Analytics uses GRANT on specific columns. Option B is wrong because row-level security filters rows, not columns. Option C is wrong because dynamic data masking obfuscates data but does not restrict access.

Option D is wrong because Azure Purview is a governance tool, not a security enforcement mechanism.

Full explanation →

767

MCQeasy

A data engineer needs to store semi-structured JSON logs from IoT devices. The data will be queried using SQL and must support high-throughput writes. Which Azure data store is most appropriate?

A.Azure Blob Storage with JSON blobs

B.Azure Cosmos DB with Core (SQL) API

C.Azure Data Lake Storage Gen2 with JSON files and PolyBase

D.Azure SQL Database with JSON columns

AnswerB

Optimized for semi-structured JSON, native SQL API, and high-throughput writes.

Why this answer

Azure Cosmos DB with Core (SQL) API is the most appropriate choice because it natively stores semi-structured JSON documents, supports high-throughput writes with single-digit millisecond latency, and allows querying the JSON data directly using SQL syntax. Its schema-agnostic design and automatic indexing make it ideal for IoT workloads where device telemetry arrives at high velocity and must be immediately queryable.

Exam trap

The trap here is that candidates often choose Azure Blob Storage or Data Lake Storage because they associate JSON files with cheap storage, but they overlook the requirement for high-throughput writes and native SQL querying, which Cosmos DB uniquely satisfies among the options.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage with JSON blobs does not provide native SQL querying capabilities; querying would require additional services like Azure Synapse or external tools, and it is optimized for large, infrequent access rather than high-throughput writes. Option C is wrong because Azure Data Lake Storage Gen2 with JSON files and PolyBase is designed for analytical batch processing and large-scale data lakes, not for real-time high-throughput writes; PolyBase is used for querying external data in Synapse, not for direct ingestion at IoT scale. Option D is wrong because Azure SQL Database with JSON columns imposes a fixed relational schema and transactional overhead that cannot match the write throughput and schema flexibility of Cosmos DB; it is optimized for ACID transactions and structured data, not for high-velocity semi-structured ingestion.

Full explanation →

768

MCQmedium

A data engineer is tasked with optimizing a Spark job in Azure Synapse Analytics that processes 10 TB of data daily. The job currently uses 50 executors with 4 cores each. The performance is bottlenecked by shuffle operations. The engineer wants to reduce shuffle data size. Which technique should be applied?

A.Coalesce the number of partitions before the shuffle.

B.Increase the number of shuffle partitions.

C.Broadcast all tables to avoid shuffles.

D.Use column pruning to select only required columns before shuffle.

AnswerD

Reduces data volume.

Why this answer

Option B is correct because column pruning reduces the amount of data shuffled by eliminating unnecessary columns. Option A is incorrect because increasing parallelism may increase shuffle overhead. Option C is incorrect because coalescing reduces partitions but does not reduce shuffle data.

Option D is incorrect because broadcasting is for small tables, not 10 TB.

Full explanation →

769

MCQhard

You are designing a data processing solution for a healthcare organization. The solution must process streaming data from IoT devices and store it in Azure Data Lake Storage Gen2. The data must be available for both real-time dashboards and historical analysis. You need to minimize operational overhead. What should you do?

A.Ingest data via Azure Functions and write to Data Lake Storage; use Power BI to query Data Lake

B.Use Azure Stream Analytics to output to both Power BI and Data Lake Storage

C.Use Azure Databricks with Structured Streaming to write to Data Lake Storage and use Power BI DirectQuery

D.Ingest data to Azure Event Hubs, then use Event Hubs Capture to store in Data Lake Storage; use Power BI with Event Hubs

AnswerB

Stream Analytics is serverless, supports real-time output to Power BI and batch writes to Data Lake.

Why this answer

Using Azure Stream Analytics with output to both Power BI (real-time) and Data Lake Storage (historical) is a common pattern with minimal overhead. Option A is wrong because Azure Functions would require custom code for batching. Option B is wrong because Event Hubs doesn't provide real-time dashboards.

Option C is wrong because Azure Databricks would require cluster management.

Full explanation →

770

MCQmedium

Refer to the exhibit. You run the KQL query in Azure Data Explorer. What is the output?

A.Name: Alice, Age: 30; Name: Charlie, Age: 35

B.Id: 1, Name: Alice, Age: 30; Id: 3, Name: Charlie, Age: 35

C.Name: Alice, Age: 30; Name: Bob, Age: 25; Name: Charlie, Age: 35

D.Name: Alice, Age: 30; Name: Bob, Age: 25; Name: Charlie, Age: 35

AnswerA

Only Alice and Charlie have Age >25.

Why this answer

The KQL query uses the `take` operator to return a specified number of rows from the table. Since the query does not include an `order by` clause, the rows returned are non-deterministic but will be the first rows encountered in the data shard. In this case, the query `take 2` returns two rows, which are Alice (Age 30) and Charlie (Age 35), as shown in the exhibit.

The `project` operator then selects only the Name and Age columns, so the output is exactly those two rows with those columns.

Exam trap

The trap here is that candidates often assume `take` returns the first N rows in the order they appear in the table (like a top-N query without ordering), but without an explicit `order by`, the rows are non-deterministic and depend on data shard layout, leading to confusion when the output does not match the expected sequence.

How to eliminate wrong answers

Option B is wrong because it includes the Id column, but the query uses `project Name, Age` which explicitly excludes the Id column. Option C is wrong because it shows three rows (Alice, Bob, Charlie), but the query uses `take 2` which limits the output to exactly two rows. Option D is wrong for the same reason as Option C — it shows three rows instead of two, and also includes Bob who is not in the result set.

Full explanation →

771

MCQmedium

Your organization uses Azure Synapse Analytics. You need to design a data transformation pipeline that processes streaming data from Azure Event Hubs, performs aggregations over a 5-minute tumbling window, and loads the results into a dedicated SQL pool table. Which Azure service should you use to implement the streaming transformation?

A.Azure Stream Analytics

B.Azure Data Factory

C.Apache Spark for Azure Synapse

D.Azure Functions

AnswerA

Azure Stream Analytics can ingest from Event Hubs, perform tumbling window aggregations, and output to Synapse SQL pool.

Why this answer

Option A is correct because Azure Stream Analytics is the appropriate service for real-time stream processing with windowed aggregations. Option B is wrong because Azure Data Factory is for batch orchestration. Option C is wrong because Spark Structured Streaming is for big data workloads but less integrated with SQL pools.

Option D is wrong because Azure Functions is not designed for streaming windowed aggregations.

Full explanation →

772

Multi-Selecteasy

Which TWO monitoring metrics in Azure Monitor for Azure Synapse Analytics dedicated SQL pool can help identify performance bottlenecks? (Choose two.)

Select 2 answers

A.CPU percentage

B.Queued queries

C.Storage used

D.Data movement (shuffle) metrics

E.DWU used

AnswersB, D

High number of queued queries indicates concurrency issues.

Why this answer

Options A and C are correct. A indicates concurrency bottlenecks. C indicates data movement bottlenecks.

B is wrong because it shows resource usage, not bottlenecks. D is wrong because it shows storage usage. E is wrong because it shows DWU usage, not specific bottlenecks.

Full explanation →

773

MCQeasy

You are designing a data storage solution for IoT sensor data. The data is written thousands of times per second and requires low-latency reads for real-time dashboards. Which Azure storage solution should you use?

A.Azure Blob Storage

B.Azure Cosmos DB

C.Azure SQL Database

D.Azure Data Lake Storage Gen2

AnswerB

Why this answer

Azure Cosmos DB is the correct choice because it provides single-digit millisecond read and write latency at any scale, with automatic indexing and multi-region distribution. Its support for multiple APIs (SQL, MongoDB, Cassandra, etc.) and configurable consistency levels makes it ideal for IoT sensor data requiring high-throughput writes and low-latency reads for real-time dashboards.

Exam trap

The trap here is that candidates often choose Azure Blob Storage or Data Lake Storage Gen2 because they associate IoT data with 'storage' rather than 'real-time querying,' overlooking the critical requirement for low-latency reads and high-frequency writes that only a NoSQL database like Cosmos DB can satisfy.

Why the other options are wrong

A

Blob Storage is optimized for large, unstructured data with higher latency, not real-time ingestion.

C

SQL Database can handle writes but may struggle with the scale and low-latency requirements of IoT sensor data.

D

Designed for big data analytics, not real-time ingestion and query.

Full explanation →

774

Multi-Selecteasy

A company uses Azure Data Lake Storage Gen2 as the data lake. The data engineering team needs to ensure that sensitive data such as credit card numbers are masked when queried by non-admin users. The solution must be implemented within the data lake without moving data to another store. Which TWO features should they use? (Choose two.)

Select 2 answers

A.Azure Policy to audit access

B.Azure SQL Database dynamic data masking

C.Azure Synapse Serverless SQL with dynamic data masking

D.Microsoft Purview data classification and labeling

E.Azure Storage blob-level access policies

AnswersC, D

Can mask data in queries over external tables.

Why this answer

Options B and C are correct. Azure Synapse Serverless SQL can create external tables over Data Lake storage and apply dynamic data masking. Azure Purview can classify sensitive columns for policy.

Option A is incorrect because ADLS does not have built-in masking. Option D is incorrect because Azure SQL Database is a different store. Option E is incorrect because Azure Policy does not mask data.

Full explanation →

775

MCQmedium

You are designing a data lakehouse architecture in Azure using Delta Lake. The solution needs to process batch and streaming data from multiple sources, including IoT devices and CRM systems. You need to ensure data quality by enforcing schema validation and handling schema evolution. You also need to provide a unified catalog for querying. Which service should you use?

A.Azure Purview

B.Azure Data Lake Storage Gen2

C.Azure Synapse Analytics serverless SQL pool

D.Azure Databricks Unity Catalog

AnswerD

Provides schema enforcement, evolution, and a unified catalog.

Why this answer

Option C is correct because Azure Databricks Unity Catalog provides a unified governance solution for data and AI, including schema enforcement and evolution for Delta Lake. Option A is wrong because Azure Purview is for data discovery and lineage, not for schema enforcement. Option B is wrong because Azure Synapse Analytics is a query engine but does not provide the same schema management features as Unity Catalog.

Option D is wrong because Azure Data Lake Storage is storage only.

Full explanation →

776

MCQmedium

You are using Azure Data Explorer to monitor real-time sensor data. You run the KQL query shown in the exhibit. What is the purpose of this query?

A.To detect anomalies in the sensor data

B.To calculate the average value of a numeric column over time

C.To visualize the count of events every 5 minutes over the last hour

D.To filter events from a specific sensor

AnswerC

The summarize and render timechart achieve this.

Why this answer

Option B is correct because the query counts events per 5-minute bin and renders a timechart, showing the event frequency over time. Option A is wrong because it does not detect anomalies. Option C is wrong because it does not calculate averages.

Option D is wrong because it does not filter specific sensors.

Full explanation →

777

MCQmedium

Your Azure Synapse Analytics dedicated SQL pool query performance is degrading over time. You suspect that the statistics might be outdated. What is the most efficient way to update statistics for all tables in the pool?

A.Enable automatic statistics update and wait for the system to update them.

B.Execute `sp_updatestats` in the master database.

C.Execute `sp_updatestats` in the user database.

D.Run `UPDATE STATISTICS` on each table individually.

AnswerC

Correct: `sp_updatestats` updates statistics for all tables in the current database.

Why this answer

Option C is correct because the stored procedure `sp_updatestats` updates statistics for all tables in the database. Option A is wrong because `UPDATE STATISTICS` requires specifying table and index. Option B is wrong because that procedure is for the master database.

Option D is wrong because automatic update is enabled by default but may not trigger for all tables; manual update is more reliable.

Full explanation →

778

MCQeasy

You are designing a data processing solution in Azure Databricks to transform streaming data from Azure Event Hubs. The data must be aggregated in 1-minute tumbling windows and written to Azure Synapse Analytics. Which Spark API should you use?

A.RDD API

B.Structured Streaming

C.Spark Streaming (DStreams)

D.DataFrame API with batch processing

AnswerB

Supports windowed aggregations and streaming sinks.

Why this answer

Structured Streaming is the correct choice because it provides native support for event-time-based aggregations, such as 1-minute tumbling windows, and integrates seamlessly with Azure Event Hubs as a streaming source and Azure Synapse Analytics as a streaming sink using the `foreachBatch` or `writeStream` API. It offers exactly-once semantics and automatic state management for windowed operations, which are essential for reliable streaming ETL.

Exam trap

The trap here is that candidates confuse the older Spark Streaming (DStreams) API with Structured Streaming, assuming both are equally capable for event-time windows, but DStreams lack native event-time support and are deprecated in favor of Structured Streaming.

How to eliminate wrong answers

Option A is wrong because the RDD API operates at a low level without built-in support for event-time windowing, stateful aggregation, or streaming sinks like Azure Synapse Analytics, requiring manual implementation of checkpointing and fault tolerance. Option C is wrong because Spark Streaming (DStreams) uses micro-batch processing with a DStream API that is now in maintenance mode and lacks native event-time handling, making tumbling window aggregations more complex and less efficient compared to Structured Streaming. Option D is wrong because the DataFrame API with batch processing is designed for static data, not continuous streaming; it cannot process unbounded data from Event Hubs in real time or maintain state for tumbling windows without additional custom orchestration.

Full explanation →

779

MCQeasy

A data engineer is setting up Azure Data Lake Storage Gen2 for a new project. The security requirement is to prevent direct access to the storage account from the internet while allowing access from a specific virtual network. Which network security feature should be enabled?

A.Azure Private Endpoint

B.Shared access signature (SAS)

C.Azure Defender for Storage

D.Firewall and virtual network service endpoints

AnswerD

Allow access from specific VNet.

Why this answer

Option D is correct because Firewall and virtual network service endpoints allow you to restrict access to Azure Data Lake Storage Gen2 to only traffic originating from a specific virtual network, effectively blocking all internet-based access. This is achieved by configuring a service endpoint on the subnet and a firewall rule on the storage account that denies all traffic except that from the designated virtual network.

Exam trap

The trap here is that candidates often confuse Azure Private Endpoint with a complete internet-blocking solution, but Private Endpoint alone does not disable the public endpoint; you must also configure the firewall to deny all public traffic.

How to eliminate wrong answers

Option A is wrong because Azure Private Endpoint uses a private IP address from your virtual network to connect to the storage account, but it does not inherently block all internet access; it provides a private connection but still requires additional firewall rules to fully prevent internet access. Option B is wrong because Shared access signature (SAS) provides delegated access to storage resources via tokens that can be used over the internet, and it does not restrict network-level access from the internet. Option C is wrong because Azure Defender for Storage is a security monitoring and threat detection service, not a network access control mechanism; it does not block or restrict network traffic.

Full explanation →

780

Multi-Selecthard

You are designing data security for an Azure Data Lake Storage Gen2 account that stores sensitive customer data. You need to ensure that only authorized users can access the data and that access can be audited. Which TWO actions should you implement?

Select 2 answers

A.Configure diagnostic settings to send logs to a Log Analytics workspace.

B.Use storage account access keys (Shared Key) for authentication.

C.Generate shared access signatures (SAS) with a long expiry time.

D.Configure a firewall to allow access from all Azure services.

E.Enable Azure Active Directory (Azure AD) authentication for the storage account.

AnswersA, E

Diagnostic settings enable auditing of access.

Why this answer

Option A is correct because configuring diagnostic settings to send logs to a Log Analytics workspace enables auditing of data access events, such as read, write, and delete operations, which is essential for compliance and security monitoring. Option E is correct because Azure AD authentication provides fine-grained access control using role-based access control (RBAC) and eliminates the need for shared keys, aligning with the principle of least privilege.

Exam trap

The trap here is that candidates often confuse 'allowing access from all Azure services' (Option D) with a secure configuration, but it actually bypasses authentication and auditing, making it a security risk rather than a control.

Full explanation →

781

MCQhard

You are designing a disaster recovery plan for an Azure Synapse Analytics dedicated SQL pool. The primary region becomes unavailable. You need to fail over to a secondary region with minimal data loss. The recovery point objective (RPO) is 1 hour. What should you configure?

A.Configure active geo-replication to the secondary region.

B.Use automatic restore points and copy them to the secondary region using Azure Data Factory.

C.Enable geo-backup on the dedicated SQL pool.

D.Create user-defined restore points every hour and store them in the secondary region.

AnswerC

Geo-backup automatically creates backups and replicates to a paired region with a 1-hour RPO.

Why this answer

Option D is correct because the geo-backup feature automatically takes snapshots and replicates them to a paired region with a default RPO of 1 hour. Option A is wrong because the restore point is manual. Option B is wrong because active geo-replication is for Azure SQL Database, not Synapse dedicated SQL pool.

Option C is wrong because automatic restore points are local, not in another region.

Full explanation →

782

Multi-Selectmedium

Which TWO are valid ways to process data in Azure Synapse Analytics?

Select 2 answers

A.Use Logic Apps to run data transformations.

B.Use Azure Functions to process data in a serverless manner.

C.Use Synapse SQL pool to run T-SQL queries.

D.Use Power BI to transform data.

E.Use Synapse Spark notebooks to run Scala code.

AnswersC, E

Synapse SQL pool provides distributed query processing.

Why this answer

Option C is correct because Synapse SQL pool (formerly SQL DW) is a dedicated or serverless SQL engine within Azure Synapse Analytics that allows you to run T-SQL queries for data transformation, loading, and querying. It is a first-class compute resource designed for large-scale data warehousing workloads, making T-SQL queries a valid and primary method for processing data in Synapse.

Exam trap

The trap here is that candidates confuse general Azure services (Logic Apps, Functions, Power BI) with native Synapse Analytics processing capabilities, forgetting that only Synapse SQL and Synapse Spark are first-class compute engines within the service.

Full explanation →

783

MCQmedium

You are migrating an on-premises SQL Server database to Azure Synapse Analytics dedicated SQL pool. The database includes a table with 500 million rows that is frequently queried by date range. Which distribution strategy should you use for this table?

A.Hash distribution on the date column.

B.Hash distribution on an identity column.

C.Replicated distribution.

D.Round-robin distribution.

AnswerA

Hash distribution on the date column allows partition elimination and efficient querying by date range.

Why this answer

Hash distribution on the date column is correct because it distributes rows evenly across distributions based on a hash of the date value, enabling partition elimination and collocated joins when queries filter by date range. This strategy optimizes performance for the frequent date-range queries by ensuring that data for a specific date range is likely stored on the same distribution, minimizing data movement.

Exam trap

The trap here is that candidates often choose round-robin distribution for large tables thinking it is simple and balanced, but they overlook that frequent date-range queries require data collocation and partition elimination, which only hash distribution on the query filter column can provide.

How to eliminate wrong answers

Option B is wrong because hash distribution on an identity column would distribute rows randomly with respect to date, causing high data movement for date-range queries and poor query performance. Option C is wrong because replicated distribution is suitable for small, slowly changing dimension tables (typically < 2 GB), not for a 500-million-row fact table, as it would replicate massive data across all nodes, causing storage and maintenance overhead. Option D is wrong because round-robin distribution distributes rows evenly without any key, leading to full table scans and high data shuffling for date-range queries, which is inefficient for frequently queried tables.

Full explanation →

784

MCQhard

You are reviewing the ARM template above. The storage account is created with hierarchical namespace enabled (isHnsEnabled: true). After deployment, you need to ensure that the 'data-engineers' group can execute but not read the contents of the root directory. What should you do?

A.Modify the ARM template to set the 'isHnsEnabled' property to false and redeploy

B.Assign the Storage Blob Data Reader role to the data-engineers group at the storage account level

C.Configure a firewall rule to allow only the data-engineers group's IP addresses

D.Use the Azure portal to set ACLs on the root directory, granting execute permission to the data-engineers group without read permission

AnswerD

ACLs allow granular permissions; execute alone allows traversal but not listing contents.

Why this answer

Option B is correct because ACLs are the way to set execute permissions at the root directory without granting read. Option A is wrong because RBAC roles at the storage account level grant broad permissions. Option C is wrong because the hierarchical namespace is already enabled.

Option D is wrong because firewall rules control network access.

Full explanation →

785

MCQmedium

You are developing a data processing pipeline in Azure Synapse Analytics. The pipeline uses a mapping data flow to transform data from Azure Data Lake Storage Gen2 to a dedicated SQL pool. The data flow includes a Derived Column transformation that uses the expression: `iif(isNull(Column1), 'Default', Column1)`. However, the transformation is not handling NULL values correctly. What is the most likely cause?

A.The Derived Column transformation does not support the iif function; use a Conditional Split instead.

B.The column data type is not string; convert Column1 to string first.

C.The function names in the expression are case-sensitive; use 'isNull' instead of 'isnull'.

D.The expression must use ternary operator syntax: `Column1 == null ? 'Default' : Column1`.

AnswerC

Mapping data flows are case-sensitive; the correct function name is 'isNull'.

Why this answer

Option C is correct because the `iif` function in Azure Synapse Analytics mapping data flows is case-sensitive. The expression uses `isNull` (with a capital 'N'), but the correct function name is `isNull` (with a capital 'N' and lowercase 'ull'? Actually, the correct function is `isNull` with a capital 'N' and lowercase 'ull'? Wait, the expression in the question uses `isNull(Column1)` which is correct; the issue is that the question says the expression uses `iif(isNull(Column1), 'Default', Column1)` but the answer option C says use 'isNull' instead of 'isnull'. The trap is that the function name is case-sensitive; the correct function is `isNull` (capital 'N'), not `isnull` (all lowercase).

The expression in the question already uses `isNull` with capital 'N', but the answer option C suggests using 'isNull' instead of 'isnull' — this implies the candidate might have typed `isnull` (all lowercase) which would fail. The most likely cause is that the function name was typed incorrectly with wrong casing, as mapping data flows are case-sensitive for function names.

Exam trap

The trap here is that candidates assume function names in Azure Synapse mapping data flows are case-insensitive like in SQL, but they are actually case-sensitive, causing a seemingly correct expression to fail due to a subtle casing error.

How to eliminate wrong answers

Option A is wrong because the Derived Column transformation fully supports the `iif` function for conditional logic; a Conditional Split is not required for simple null handling. Option B is wrong because the `iif` function can handle any data type, and converting to string is unnecessary; the issue is not about data type conversion. Option D is wrong because mapping data flows do not support ternary operator syntax (`? :`); they use the `iif` function for conditional expressions.

Full explanation →

786

MCQeasy

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2 using Azure Data Factory. Which approach is the most efficient?

A.Use a lookup activity to retrieve the last watermark value, then copy only new records with a filter.

B.Use a tumbling window trigger with a data flow that processes all data each time.

C.Use a mapping data flow with a full load and then use Azure Databricks to deduplicate.

D.Copy the entire table every time and use Azure Synapse serverless SQL to filter duplicates.

AnswerA

Watermark pattern is efficient and well-supported.

Why this answer

Option A is correct because it uses a lookup activity to retrieve the last watermark value (e.g., a timestamp or incrementing key), then copies only new or changed records via a filter in the Copy activity. This minimizes data movement and processing time, making it the most efficient incremental loading approach in Azure Data Factory.

Exam trap

Microsoft often tests the misconception that any trigger-based or full-load approach can be adapted for incremental loading, but the key is minimizing data movement; candidates may overlook the watermark pattern and choose a full-load option because they think deduplication later solves the problem.

How to eliminate wrong answers

Option B is wrong because a tumbling window trigger with a data flow that processes all data each time performs a full load on every run, ignoring incremental logic and wasting resources. Option C is wrong because performing a full load followed by deduplication in Azure Databricks is inefficient; it moves all data repeatedly and adds unnecessary compute overhead. Option D is wrong because copying the entire table every time and using Azure Synapse serverless SQL to filter duplicates still transfers all data, incurring high egress costs and defeating the purpose of incremental loading.

Full explanation →

787

Multi-Selecthard

Which THREE factors should you consider when designing a monitoring strategy for Azure Synapse Analytics dedicated SQL pool performance?

Select 3 answers

A.Use dynamic management views (DMVs) to identify long-running queries.

B.Implement workload classification for resource allocation.

C.Ensure data is evenly distributed across distributions.

D.Configure automatic index rebuild for columnstore indexes.

E.Set up alerts for DWU usage to enable dynamic scaling.

AnswersA, B, E

DMVs like sys.dm_pdw_exec_requests help monitor query performance.

Why this answer

Option A is correct because dynamic management views (DMVs) in Azure Synapse Analytics dedicated SQL pool, such as sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps, provide real-time insight into query execution, allowing you to identify long-running queries, monitor resource consumption, and detect performance bottlenecks. This is a foundational monitoring practice for tuning workload performance.

Exam trap

The trap here is that candidates confuse design or maintenance actions (like data distribution or index rebuilds) with monitoring activities, leading them to select options that are valid optimization steps but not part of a monitoring strategy.

Full explanation →

788

MCQeasy

You are a data engineer at a retail company. You need to develop a data processing solution in Azure Synapse Analytics that reads sales transactions from Parquet files stored in Azure Data Lake Storage Gen2, transforms the data by aggregating daily sales per store, and writes the results to a dedicated SQL pool table for reporting. The transformation logic must be reusable and maintained in a source control system. You want to minimize administrative overhead and leverage serverless resources where possible. Which approach should you recommend?

A.Use Azure Data Factory with Mapping Data Flows to read Parquet files, perform aggregations, and write to the dedicated SQL pool.

B.Use a serverless SQL pool to query the Parquet files via OPENROWSET, then use CETAS to write the aggregated results to the dedicated SQL pool using an external table.

C.Create an Azure Synapse Spark notebook that reads Parquet files, performs aggregation using PySpark, and writes the results to the dedicated SQL pool using the Spark Synapse connector.

D.Use PolyBase in a dedicated SQL pool to create external tables over the Parquet files, then use INSERT...SELECT to load aggregated data into the target table.

AnswerB

Serverless SQL pool can read Parquet natively, and CETAS allows writing to dedicated SQL pool via external table.

Why this answer

Option C is correct because a serverless SQL pool can query Parquet files directly, and the CREATE EXTERNAL TABLE AS SELECT (CETAS) statement can transform and store results in a dedicated SQL pool table using PolyBase. This approach uses serverless resources for transformation and avoids managing Spark pools. Option A is wrong because dedicated SQL pool cannot directly read Parquet without PolyBase or external tables.

Option B is wrong because Azure Data Factory with Mapping Data Flows runs on Spark clusters, incurring more overhead. Option D is wrong because using a notebook in a Spark pool requires provisioning a Spark pool, which adds administrative overhead.

Full explanation →

789

MCQeasy

You are designing a data lake for a manufacturing company that will store sensor readings in Parquet format. The data will be used by data scientists for batch training and by analysts for ad-hoc queries. Which Azure service should you use as the primary storage layer?

A.Azure Data Lake Storage Gen2

B.Azure Synapse SQL Pool

C.Azure Blob Storage

D.Azure SQL Database

AnswerA

ADLS Gen2 provides hierarchical namespace and is optimized for analytics workloads.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct primary storage layer because it combines a hierarchical namespace with Azure Blob Storage's scalable object storage, providing POSIX-compliant access control and directory-level operations. This makes it ideal for storing Parquet files used by data scientists for batch training (via Spark or PolyBase) and by analysts for ad-hoc queries (via Synapse SQL or Azure Databricks), as it supports both high-throughput analytics and fine-grained security.

Exam trap

The trap here is confusing a compute service (Synapse SQL Pool) with a storage service, or assuming that Blob Storage alone is sufficient without considering the hierarchical namespace required for data lake operations.

How to eliminate wrong answers

Option B is wrong because Azure Synapse SQL Pool is a distributed query engine and data warehouse, not a primary storage layer; it queries data from external storage like ADLS Gen2 but does not natively store Parquet files as its primary format. Option C is wrong because Azure Blob Storage lacks a hierarchical namespace, making it unsuitable for directory-level operations and POSIX ACLs required for efficient data lake management, though it can store Parquet files, it is less optimized for analytics workloads. Option D is wrong because Azure SQL Database is a relational database for transactional workloads, not designed for storing large volumes of Parquet files or supporting the schema-on-read patterns needed for data science and ad-hoc analytics.

Full explanation →

790

Multi-Selecthard

A multinational corporation is designing a data lake on Azure Data Lake Storage Gen2. The data must be accessible from multiple regions with low latency, but only one region needs writable access. The solution must also comply with data residency requirements. Which two features or configurations should be implemented? (Choose two.)

Select 2 answers

A.Enable hierarchical namespace on the storage account

B.Use read-access geo-redundant storage (RA-GRS)

C.Deploy Azure Front Door to route read requests to the nearest region

D.Configure geo-replication with manual failover

E.Use Azure Data Lake Storage Gen2 with locally redundant storage (LRS)

AnswersB, C

Why this answer

B is correct because read-access geo-redundant storage (RA-GRS) provides a secondary read-only endpoint in a paired region, enabling low-latency reads from multiple regions while maintaining a single writable primary region. This meets the data residency requirement because the secondary region is within the same geography, and RA-GRS ensures data is replicated asynchronously without allowing writes to the secondary.

Exam trap

The trap here is that candidates often confuse RA-GRS with GRS or fail to recognize that RA-GRS provides a read-only secondary endpoint, which is essential for low-latency reads without compromising the single-writable-region requirement.

Why the other options are wrong

A

Hierarchical namespace is a feature of ADLS Gen2 but does not provide multi-region low-latency reads or data residency.

D

Manual failover does not provide automatic read access from secondary regions; it requires a manual process to promote the secondary.

E

LRS does not replicate data to another region, so it cannot provide multi-region low-latency reads.

Full explanation →

791

MCQmedium

A financial services company is migrating its data warehouse to Azure Synapse Analytics. They have a star schema with a 10-billion-row fact table and 50 dimension tables. Query performance is critical, and they need to minimize data movement during joins. Which distribution strategy should they use for the fact table?

A.Replicated distribution

B.Partitioned distribution

C.Hash distribution on the most frequently joined dimension key

D.Round-robin distribution

AnswerC

Collocates data for joins.

Why this answer

Hash distribution on the most frequently joined dimension key is correct because it co-locates matching rows from the fact and dimension tables on the same compute node, minimizing data movement during joins. For a 10-billion-row fact table, this distribution ensures that the most common join operation is performed locally without shuffling data across nodes, which is critical for query performance in Azure Synapse Analytics.

Exam trap

The trap here is that candidates confuse table partitioning with distribution strategy, assuming 'partitioned distribution' is a valid option, but Azure Synapse Analytics uses hash, round-robin, and replicated distributions only, while partitioning is a separate concept for data organization within a distribution.

How to eliminate wrong answers

Option A is wrong because replicated distribution copies the entire table to each node, which is impractical for a 10-billion-row fact table due to massive storage overhead and write performance degradation. Option B is wrong because partitioned distribution is not a valid distribution strategy in Azure Synapse Analytics; partitioning is a table organization feature for data management, not a distribution method for minimizing data movement. Option D is wrong because round-robin distribution spreads data evenly but randomly across nodes, causing all joins to require data movement (shuffling) since no rows are co-located, leading to poor query performance.

Full explanation →

792

MCQeasy

You need to process a large number of small files (each < 1 MB) from Azure Blob Storage in Azure Synapse Analytics. The processing is I/O-bound due to many small file operations. Which approach should you use to improve performance?

A.Use wildcard paths to read multiple files at once.

B.Enable optimized write on the Spark session.

C.Convert the files to a binary format like Avro before processing.

D.Use 'spark.sql.files.maxPartitionBytes' to coalesce small files into larger partitions.

AnswerD

This configuration merges small files into larger partitions, reducing overhead.

Why this answer

Option D is correct because `spark.sql.files.maxPartitionBytes` controls the maximum number of bytes packed into a single partition when reading files. By increasing this value, Spark coalesces many small files into fewer, larger partitions, reducing the overhead of task scheduling and I/O operations. This directly addresses the I/O-bound bottleneck caused by processing numerous small files in Azure Synapse Analytics.

Exam trap

The trap here is that candidates confuse file format conversion (Avro) or write optimization with read-side partition coalescing, failing to recognize that the core issue is the number of partitions created during file scanning, not the data format or write behavior.

How to eliminate wrong answers

Option A is wrong because wildcard paths only simplify file selection but do not reduce the number of partitions or I/O operations; each small file still becomes a separate partition by default. Option B is wrong because 'optimized write' is a Delta Lake feature that improves write performance by reducing small file output, but it does not help when reading existing small files from Blob Storage. Option C is wrong because converting to Avro changes the serialization format but does not inherently reduce the number of file read operations; the small file problem persists regardless of format.

Full explanation →

793

MCQmedium

Your organization uses Azure Synapse Analytics serverless SQL pools to query data in Azure Data Lake Storage Gen2. You need to ensure that only users with specific Microsoft Entra ID roles can query the data. What should you configure?

A.Assign the Storage Blob Data Contributor role to the users on the Azure Data Lake Storage Gen2 account.

B.Generate a shared access signature (SAS) token for the storage account and include it in the external table definition.

C.Configure an IP firewall rule on the storage account to allow only the SQL pool's outbound IP addresses.

D.Create a managed identity for the SQL pool and grant it access to the storage account.

AnswerA

Correct: The serverless SQL pool uses the caller's Microsoft Entra ID identity to access storage. Assigning RBAC role (e.g., Storage Blob Data Contributor) grants the necessary permissions.

Why this answer

Option A is correct because Azure Synapse Analytics serverless SQL pools rely on Microsoft Entra ID (formerly Azure AD) tokens and RBAC roles to authorize access to data in Azure Data Lake Storage Gen2. The Storage Blob Data Contributor role grants read and write access to data. Option B is wrong because SAS tokens are shared secrets, not tied to user identity.

Option C is wrong because managed identity is not suitable for per-user authorization. Option D is wrong because firewall rules control network access, not user-level authorization.

Full explanation →

794

MCQmedium

Your organization uses Azure Purview for data governance. You need to automatically scan an Azure Data Lake Storage Gen2 account and classify sensitive data such as credit card numbers and social security numbers. What should you configure?

A.Azure Information Protection (AIP) scanner

B.Microsoft Defender for Cloud

C.A new scan rule set in Purview with classification rules for sensitive data types

D.Azure Policy with built-in guest configuration

AnswerC

Purview can create custom scan rule sets that include classification rules to detect sensitive data during scans.

Why this answer

Option B is correct because Purview's classification rules can detect sensitive data patterns. Option A is wrong because Azure Policy is for resource compliance, not data classification. Option C is wrong because Azure Information Protection (now part of Microsoft Purview Information Protection) labels files but requires integration.

Option D is wrong because Microsoft Defender for Cloud is for security posture.

Full explanation →

795

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must ensure that data at rest in a dedicated SQL pool is encrypted using customer-managed keys (CMK) stored in Azure Key Vault. The encryption should be enabled at the database level. What should you configure?

A.Transparent Data Encryption (TDE) with a customer-managed key in Azure Key Vault.

B.Azure Purview data classification and encryption policies.

C.Azure Disk Encryption on the nodes hosting the dedicated SQL pool.

D.Always Encrypted with keys stored in Azure Key Vault.

AnswerA

TDE encrypts the entire database at rest, and CMK can be stored in Azure Key Vault.

Why this answer

Option C is correct because Transparent Data Encryption (TDE) with CMK in Azure Key Vault provides database-level encryption with customer-managed keys. Option A is wrong because Always Encrypted is for column-level encryption, not database-level. Option B is wrong because Azure Disk Encryption is for VMs, not SQL pools.

Option D is wrong because Azure Purview is for data governance, not encryption.

Full explanation →

796

MCQhard

Your Azure Data Factory pipeline uses a Self-Hosted Integration Runtime (SHIR) to copy data from an on-premises SQL Server to Azure Blob Storage. The copy activity is failing with a timeout error after 30 minutes. The data volume is 50 GB. You need to optimize the data transfer performance. Which configuration change should you make first?

A.Increase the 'Degree of copy parallelism'

B.Enable staging copy via Azure Blob Storage

C.Increase the 'Activity retry' count

D.Reduce the 'Data Integration Unit' (DIU) setting

AnswerA

Parallelism improves throughput for large data

Why this answer

Option D is correct because enabling parallel copy (degree of copy parallelism) allows multiple threads to read and write data concurrently, improving throughput for large datasets. Option A is wrong because it reduces resource usage. Option B is wrong because staging is for other scenarios.

Option C is wrong because it affects scheduling, not performance.

Full explanation →

797

MCQhard

A company is using Azure Data Factory to copy data from an on-premises SQL Server to Azure Blob Storage. The data must be encrypted in transit using TLS 1.2. The on-premises SQL Server is configured to support TLS 1.2. Which Data Factory property should be configured?

A.The encryptedCredential property in the linked service

B.The typeProperties property in the linked service to include 'Encrypt=True' in the connection string

C.The connectVia property in the linked service

D.The integrationRuntime property in the dataset

AnswerB

The connection string in typeProperties can include 'Encrypt=True' to enforce TLS encryption.

Why this answer

Option B is correct because TLS 1.2 encryption for data in transit between Azure Data Factory and an on-premises SQL Server is enforced by adding 'Encrypt=True' and 'TrustServerCertificate=False' to the connection string within the typeProperties of the linked service. This ensures the SQL Server Native Client uses TLS 1.2 for the connection, aligning with the requirement.

Exam trap

The trap here is that candidates confuse the encryptedCredential property (which encrypts stored credentials at rest) with the connection string encryption settings (which enforce TLS for data in transit), leading them to select Option A incorrectly.

How to eliminate wrong answers

Option A is wrong because the encryptedCredential property stores the encrypted authentication credential (e.g., password) for the linked service, not the encryption protocol for data in transit. Option C is wrong because the connectVia property specifies the integration runtime (e.g., Self-Hosted IR) used for connectivity, not the encryption settings for the SQL Server connection. Option D is wrong because the integrationRuntime property in a dataset defines the compute environment for data movement, not the connection string encryption properties.

Full explanation →

798

MCQmedium

You are designing a data pipeline to ingest streaming data from IoT devices into Azure Synapse Analytics. The data must be available for querying with minimal latency, but you also need to handle spikes in throughput without data loss. Which service should you use as the ingestion layer?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure IoT Hub

D.Azure Event Hubs

AnswerD

Event Hubs is optimized for high-throughput streaming data ingestion with buffering.

Why this answer

Azure Event Hubs is the correct choice because it is a fully managed, real-time data ingestion service optimized for high-throughput streaming data from millions of IoT devices. It provides at-least-once delivery, supports partitioning for massive scale, and integrates natively with Azure Synapse Analytics via the Synapse Pipeline or Event Hubs Capture to handle throughput spikes without data loss.

Exam trap

The trap here is that candidates often confuse Azure IoT Hub with Event Hubs, assuming IoT Hub is the default for all IoT streaming, but IoT Hub is designed for device management and lower-throughput telemetry, whereas Event Hubs is the dedicated high-throughput ingestion service for analytics pipelines.

How to eliminate wrong answers

Option A is wrong because Azure Data Lake Storage Gen2 is a hierarchical file storage service designed for batch and analytical workloads, not for real-time streaming ingestion; it lacks native event streaming and buffering capabilities. Option B is wrong because Azure Blob Storage is an object storage service for unstructured data, not built for low-latency, high-throughput event ingestion; it would require additional services like Event Hubs to capture streaming data. Option C is wrong because Azure IoT Hub is a managed service for bi-directional communication with IoT devices, but it is not optimized for high-volume streaming ingestion into Synapse; it is better suited for device management and command/control scenarios, and its default throughput is lower than Event Hubs for pure data ingestion.

Full explanation →

799

MCQmedium

You have an Azure Synapse Analytics serverless SQL pool that queries data in Azure Data Lake Storage Gen2. You need to ensure that only users with specific Microsoft Entra ID groups can access the data through the serverless SQL pool. What should you configure?

A.Grant the Microsoft Entra ID group the Storage Blob Data Reader role on the storage account

B.Grant the Microsoft Entra ID group CONNECT permission on the serverless SQL pool and configure ACLs on the storage to allow read access for the group

C.Configure a firewall rule to allow only the Microsoft Entra ID group IP ranges

D.Use a shared access signature (SAS) token with the SQL pool and distribute it to users

AnswerB

Both SQL permissions and storage ACLs are needed.

Why this answer

Controlling access involves two layers: the SQL pool itself and the underlying storage. Users must have both CONNECT permission on the SQL pool and read permissions on the storage via ACLs (using their managed identity or user identity). Option A is wrong because firewall rules are network-level.

Option B is wrong because storage account keys are not used with Microsoft Entra authentication. Option D is wrong because SAS tokens are not recommended for user-level access.

Full explanation →

800

Drag & Dropmedium

Drag and drop the steps to implement Slowly Changing Dimension (SCD) Type 2 in Azure Synapse Analytics dedicated SQL pool into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

SCD Type 2: stage data, merge to find changes, expire old records, insert new versions, and update attributes (if needed).

Full explanation →

801

MCQhard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. The pool has a large fact table distributed by hash on CustomerID. Most queries filter on OrderDate. You need to improve query performance for date-range queries without changing the distribution. What should you do?

A.Change the distribution to round-robin on OrderDate

B.Create a materialized view on OrderDate

C.Partition the table on OrderDate

D.Create a non-clustered index on OrderDate

AnswerC

Partition pruning eliminates scans of irrelevant partitions.

Why this answer

Adding a partition on OrderDate allows partition elimination, reducing data scanned. Distribution remains unchanged. Option A is wrong because changing to round-robin would increase data movement.

Option B is wrong because adding indexes is not supported in dedicated SQL pool (only clustered columnstore indexes are used). Option D is wrong because materialized views can help but are not specifically for date-range filtering on a large table; partition elimination is more direct.

Full explanation →

802

MCQmedium

You are monitoring an Azure Cosmos DB account using Azure Monitor. The 'Normalized RU Consumption' metric for a container is consistently above 90%. You need to ensure that the container can handle the load without throttling. What should you do?

A.Change the partition key to a different property.

B.Increase the provisioned throughput (RU/s) for the container.

C.Switch the account to serverless mode.

D.Modify the indexing policy to exclude unused paths.

AnswerB

Increasing RU/s provides more capacity, lowering the normalized RU consumption percentage for the same workload.

Why this answer

The 'Normalized RU Consumption' metric indicates the percentage of provisioned throughput (RU/s) being used. Consistently above 90% means the container is operating near its capacity limit, risking throttling (HTTP 429 errors) during traffic spikes. Increasing the provisioned throughput (RU/s) directly raises the capacity, allowing the container to handle the load without throttling.

Exam trap

The trap here is that candidates confuse optimizing RU consumption (e.g., indexing or partition key changes) with the need to increase capacity when the metric already shows the system is at its limit, leading them to choose options that reduce per-request cost rather than addressing the throughput ceiling.

How to eliminate wrong answers

Option A is wrong because changing the partition key does not increase the total throughput; it redistributes existing throughput across partitions, which may improve distribution but does not solve a capacity shortage. Option C is wrong because switching to serverless mode caps throughput at a lower maximum (typically 5,000 RU/s per container) and is intended for intermittent or low-traffic workloads, not for consistently high load. Option D is wrong because modifying the indexing policy to exclude unused paths reduces RU consumption per request, but with normalized RU already above 90%, the reduction is unlikely to bring consumption below the threshold and does not address the root cause of insufficient provisioned capacity.

Full explanation →

803

Multi-Selectmedium

Which TWO strategies reduce data movement in Azure Synapse Analytics pipelines? (Choose two.)

Select 2 answers

A.Use Data Flow to transform data before loading

B.Use Stored Procedure activity to insert data

C.Use serverless SQL pool to query data in place

D.Use Copy activity to move data from source to staging

E.Use PolyBase to load data from external tables

AnswersC, E

Queries without moving data.

Why this answer

Option C is correct because serverless SQL pools in Azure Synapse Analytics allow you to query data directly from files in Azure Data Lake Storage or other external sources without moving the data into a dedicated SQL pool. This eliminates data movement entirely by using the compute resources of the serverless pool to process queries in place, leveraging the OPENROWSET or CREATE EXTERNAL TABLE syntax to read data from its original location.

Exam trap

The trap here is that candidates confuse 'reducing data movement' with 'optimizing data movement'—they may think Data Flow or Copy activity with staging reduces movement when in fact they still move data, whereas serverless SQL and PolyBase query data in place without relocation.

Full explanation →

804

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from IoT devices and store it in a dedicated SQL pool for reporting. The data volume is high (millions of events per hour), and you need to optimize for both ingestion speed and query performance. You also need to ensure that the data can be partitioned by date for efficient maintenance. Which architecture should you recommend?

A.Ingest data to Azure Data Lake Storage Gen2 in Delta format, then use PolyBase to load into a dedicated SQL pool partitioned by date.

B.Use Azure Stream Analytics to write directly to a dedicated SQL pool with a time-based window.

C.Store data in Azure SQL Database with elastic scaling and use linked server queries.

D.Use Event Hubs Capture to store data in Avro files in Blob Storage and then query with external tables.

AnswerA

Optimizes ingestion and query performance.

Why this answer

Option B is correct: Stream data to Azure Data Lake Storage (ADLS) in Delta format, then use PolyBase to load into Synapse dedicated SQL pool. Delta format provides fast reads and optimizations. Option A is incorrect: Direct streaming to SQL pool is not efficient for high volumes.

Option C is incorrect: Azure SQL Database is not designed for large-scale analytics. Option D is incorrect: Event Hubs Capture stores data in Avro, which is less performant for queries than Delta.

Full explanation →

805

MCQhard

Your company uses Azure Synapse Analytics for its enterprise data warehouse. The main fact table, OrdersFact, is distributed using hash on OrderID. It has 10 billion rows. The table is partitioned by month. Recently, the data engineering team added a new column 'OrderStatus' that is used in many queries with filters like 'WHERE OrderStatus = 'Shipped''. These queries are scanning the entire table because the partition pruning is not effective. You need to improve query performance for these status-based queries without redesigning the entire table. What should you do?

A.Repartition the table by OrderStatus

B.Create a non-clustered columnstore index on OrderStatus

C.Change the distribution key to include OrderStatus

D.Create a materialized view that aggregates by OrderStatus and month

AnswerB

A non-clustered columnstore index on OrderStatus allows efficient filtering.

Why this answer

Creating a non-clustered columnstore index on OrderStatus can significantly improve query performance by allowing index seek on status values. Option A is wrong because adding OrderID to the distribution key does not help status queries. Option B is wrong because repartitioning by status is not recommended for a table already partitioned by month and may cause data skew.

Option D is wrong because materialized views can help but are not necessary if indexing solves the problem more directly.

Full explanation →

806

MCQhard

You are using Azure Synapse SQL Pool to store a large fact table partitioned by date. Queries frequently filter on a specific date range and aggregate by a column called 'product_id'. Which table distribution and indexing strategy will minimize query execution time?

A.Hash distribution on product_id with clustered rowstore index

B.Replicated table with clustered columnstore index

C.Round-robin distribution with clustered columnstore index

D.Hash distribution on product_id with clustered columnstore index

AnswerD

Hash distribution enables co-location for aggregation, and columnstore is efficient for large scans.

Why this answer

Option D is correct because hash distribution on product_id ensures that rows with the same product_id are co-located on the same distribution, enabling efficient local aggregation without data movement. The clustered columnstore index provides high compression and fast scan performance for large fact tables, especially when queries filter on a date range and aggregate by product_id.

Exam trap

The trap here is that candidates often choose round-robin distribution (Option C) thinking it is best for large tables, but they overlook that hash distribution on the aggregation column eliminates expensive data shuffling, which is critical for minimizing query execution time.

How to eliminate wrong answers

Option A is wrong because a clustered rowstore index is not optimal for large fact tables with aggregation queries; it lacks the column-level compression and batch processing benefits of columnstore, leading to higher I/O and slower scans. Option B is wrong because replicated tables are designed for small, slowly changing dimension tables, not large fact tables; replicating a large fact table would consume excessive storage and cause high replication overhead. Option C is wrong because round-robin distribution distributes data evenly but does not co-locate rows by product_id, forcing data shuffling across distributions during aggregation, which increases query execution time.

Full explanation →

807

MCQeasy

You are tuning an Azure Stream Analytics job that reads from an Event Hub and writes to an Azure Synapse Analytics table. The job's SU% utilization is consistently at 90%. Which action would most likely reduce the SU% utilization?

A.Decrease the Event Hub throughput units.

B.Partition the output table in Azure Synapse Analytics.

C.Use a reference data join to filter events.

D.Increase the number of streaming units (SU) allocated to the job.

AnswerD

More SUs provide additional compute resources, lowering the utilization percentage for the same workload.

Why this answer

Increasing the number of streaming units (SU) allocated to the job directly adds more compute resources, which reduces the SU% utilization by distributing the workload across more SUs. Since the job is consistently at 90% utilization, adding SUs lowers the per-SU load, preventing throttling and improving throughput. This is the standard scaling approach for Azure Stream Analytics when SU% is high.

Exam trap

The trap here is that candidates often confuse scaling the input source (Event Hub throughput units) or optimizing the output sink (partitioning) with directly addressing the compute bottleneck, but only increasing SUs reduces the compute utilization percentage.

How to eliminate wrong answers

Option A is wrong because decreasing Event Hub throughput units reduces the ingress capacity, which can cause backpressure and increase SU% utilization as the job struggles to keep up with incoming data. Option B is wrong because partitioning the output table in Azure Synapse Analytics improves write throughput but does not affect the compute load on the Stream Analytics job itself, so SU% utilization remains unchanged. Option C is wrong because using a reference data join to filter events adds additional processing overhead (lookups and state management), which would likely increase, not decrease, SU% utilization.

Full explanation →

808

MCQmedium

A company is designing a data lake in Azure Data Lake Storage Gen2 (ADLS Gen2) to store IoT sensor data from millions of devices. The data is ingested in Parquet format, partitioned by date and device ID. The analytics team frequently queries the last 30 days of data for specific device types. Which partition strategy minimizes query cost and optimizes performance?

A.Partition by device ID first, then by date.

B.Partition by device ID only, with a separate directory for each device.

C.Partition by date (yyyy/MM/dd) first, then by device type (e.g., sensor_type=temp).

D.Partition by device type only, with a directory for each type.

AnswerC

This allows date pruning first, then efficient filtering by device type within each day.

Why this answer

Option C is correct because partitioning by date first enables efficient partition pruning for the common query pattern (last 30 days), and then by device type further filters the data within those date partitions. In ADLS Gen2, queries using partition elimination skip entire directories, reducing the amount of data scanned and minimizing query cost. This strategy aligns with the typical query workload, where date-range filtering is the most selective predicate.

Exam trap

The trap here is that candidates often assume partitioning by the most granular attribute (device ID) first will provide the best performance, but they overlook that query patterns typically filter by time range, making date the most effective first-level partition for cost and performance optimization.

How to eliminate wrong answers

Option A is wrong because partitioning by device ID first, then by date, would require scanning all device ID partitions even when querying only recent data, leading to high I/O and cost. Option B is wrong because partitioning only by device ID with a separate directory per device does not support efficient date-range pruning; queries for the last 30 days would need to scan every device directory, which is prohibitively expensive for millions of devices. Option D is wrong because partitioning only by device type would force scanning all date directories for every query, even when the query is limited to a specific time range, resulting in unnecessary data reads and higher costs.

Full explanation →

809

MCQeasy

You are designing a batch processing solution in Azure Data Factory. The source is an Azure Blob Storage container with CSV files. The target is an Azure SQL Database. The pipeline must run daily and incrementally load only new or changed rows. Which Data Factory feature should you use?

A.Mapping Data Flow with a watermark column

B.Stored Procedure activity to run a merge statement

C.Copy activity with a full load every time

D.Change data capture (CDC) resource

AnswerD

CDC enables incremental loading by tracking changes.

Why this answer

Option D is correct because Change Data Capture (CDC) in Azure Data Factory enables incremental loading by capturing insert, update, and delete operations from the source. For Azure Blob Storage CSV files, CDC can be implemented using the 'LastModifiedDate' or a watermark column to identify new or changed rows, avoiding full reloads and reducing latency.

Exam trap

The trap here is that candidates confuse CDC with a full load or assume a stored procedure can handle incremental logic without source-side change tracking, missing that CDC is the only native incremental load feature for Blob Storage in Azure Data Factory.

How to eliminate wrong answers

Option A is wrong because Mapping Data Flow with a watermark column is not a built-in feature for CDC; it requires custom logic to track changes and does not natively handle incremental loads from Blob Storage. Option B is wrong because a Stored Procedure activity to run a merge statement is a target-side operation that assumes the source already provides changed data, but it does not solve the problem of identifying new or changed rows in the source CSV files. Option C is wrong because a Copy activity with a full load every time contradicts the requirement for incremental loading, leading to unnecessary data transfer and performance degradation.

Full explanation →

810

MCQmedium

You are building a data processing solution that requires exactly-once semantics when writing to Azure Event Hubs from Azure Stream Analytics. Which output format should you configure?

A.JSON

B.CSV

C.Parquet

D.Avro

AnswerD

Avro format in Stream Analytics provides exactly-once semantics when writing to Event Hubs.

Why this answer

Azure Stream Analytics supports exactly-once semantics when writing to Event Hubs only when using Avro serialization. This is because Avro provides a compact binary format with embedded schema, enabling Stream Analytics to track and deduplicate events precisely during output, which is required for exactly-once delivery. JSON, CSV, and Parquet do not support the necessary metadata and checkpointing mechanisms for exactly-once guarantees in this specific integration.

Exam trap

The trap here is that candidates often assume JSON is the default or most compatible format for streaming outputs, but they overlook that exactly-once semantics require a binary format with embedded schema support, which only Avro provides in this specific Azure Stream Analytics to Event Hubs integration.

How to eliminate wrong answers

Option A is wrong because JSON is a text-based format that does not support the schema evolution and binary encoding required for Stream Analytics to enforce exactly-once semantics to Event Hubs. Option B is wrong because CSV lacks schema information and binary encoding, making it impossible for Stream Analytics to guarantee exactly-once delivery due to potential data loss or duplication during serialization. Option C is wrong because Parquet is a columnar storage format optimized for analytics and batch processing, not for real-time streaming output to Event Hubs, and Stream Analytics does not support exactly-once semantics with Parquet output to Event Hubs.

Full explanation →

811

MCQeasy

You are building a data pipeline in Azure Data Factory to copy data from an on-premises SQL Server database to Azure Blob Storage. The pipeline must run daily and handle incremental updates. The on-premises SQL Server table has a LastModifiedDate column that is updated when a row changes. What is the most efficient way to implement incremental loads?

A.Enable Change Data Capture (CDC) on the SQL Server database and use an ADF mapping data flow to read changes.

B.Use a Lookup activity to get the maximum LastModifiedDate from the destination, then use a Copy activity with a query that filters rows where LastModifiedDate > that value.

C.Use a tumbling window trigger with a window size of 1 day and copy all data from the source each time.

D.Perform a full load every day and use a Delete activity to remove duplicates.

AnswerB

This is the standard watermark pattern for incremental loads.

Why this answer

Option B is correct because it uses a Lookup activity to retrieve the maximum LastModifiedDate from the destination (Azure Blob Storage), then passes that value as a parameter to a Copy activity that queries only rows where LastModifiedDate exceeds it. This minimizes data transfer by reading only new or changed rows, and it avoids the overhead of enabling Change Data Capture or performing full loads, making it the most efficient approach for incremental loads in Azure Data Factory.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing CDC (Option A) or full loads (Options C and D), when a simple watermark-based query using the existing LastModifiedDate column is the most efficient and straightforward approach for incremental loads in Azure Data Factory.

How to eliminate wrong answers

Option A is wrong because enabling Change Data Capture (CDC) on SQL Server introduces additional overhead and complexity, and using an ADF mapping data flow for CDC is less efficient than a simple query-based incremental load when a LastModifiedDate column exists. Option C is wrong because using a tumbling window trigger with a 1-day window and copying all data each time performs a full load every run, which is inefficient and wasteful for incremental updates. Option D is wrong because performing a full load daily and then using a Delete activity to remove duplicates is extremely inefficient, as it transfers all data every day and requires additional processing to identify and delete duplicates, negating the benefits of incremental loading.

Full explanation →

812

MCQmedium

Your Azure Synapse Analytics dedicated SQL pool is experiencing performance degradation. Queries that previously completed in seconds now take minutes. You notice high queue wait times in sys.dm_pdw_exec_requests. What is the most likely cause?

A.Outdated statistics

B.A single long-running query blocking others

C.Concurrency throttling due to insufficient resources

D.Data skew in distribution

AnswerC

Queue waits indicate queries are waiting for slots; increase SLO or optimize concurrency.

Why this answer

High queue wait times indicate that queries are waiting for resources. This is typically due to concurrency throttling when the number of concurrent queries exceeds the capacity of the current Service Level Objective (SLO). Option A is wrong because data skew would cause uneven distribution, not necessarily queue waits.

Option C is wrong because a single large query would show high execution time, not queue wait. Option D is wrong because stats being out of date would affect execution plans, not queue waits.

Full explanation →

813

MCQmedium

Refer to the exhibit. You have an Azure Data Factory pipeline that copies trade data from Azure Blob Storage to Azure SQL Database. The pipeline runs every hour and truncates the destination table before each copy. However, users report that data is missing during the copy window. What is the most likely cause?

A.The output dataset is not correctly configured

B.The writeBatchSize is too low, causing timeouts

C.The preCopyScript truncates the table before the copy completes, causing a period with no data

D.The source dataset is set to recursive, which includes unwanted files

AnswerC

Table is empty during copy.

Why this answer

Option B is correct. The preCopyScript truncates the table before copying, so if the copy fails or takes time, the table is empty. Option A is wrong because writeBatchSize does not cause data loss.

Option C is wrong because recursive does not affect data loss. Option D is wrong because outputs and inputs are correct.

Full explanation →

814

Multi-Selecthard

You are tuning a Spark job in Azure Synapse Analytics that processes large Parquet files. The job currently takes too long due to data skew. Which three actions can improve performance? (Choose three.)

Select 3 answers

A.Use bucketing on the join key when writing intermediate data.

B.Add a salt key to the join column to distribute the load.

C.Use coalesce to reduce the number of partitions.

D.Increase executor memory to handle larger partitions.

E.Repartition the data on the skewed column.

AnswersA, B, E

Bucketing pre-partitions data to avoid shuffle and reduce skew.

Why this answer

Options A, B, and E are correct. Repartitioning redistributes data. Salting adds a random key to break skew.

Bucketing pre-partitions data. Option C is wrong because coalesce reduces partitions but does not address skew. Option D is wrong because increasing executor memory may help but does not directly solve skew.

Full explanation →

815

MCQeasy

Your team is developing a data processing solution using Azure Databricks. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. You need to ensure that when multiple jobs concurrently write to the same Delta table, the operations are atomic and consistent. Which Delta Lake feature should you use?

A.Enable Optimized Write on the Delta table.

B.Enable Auto Optimize on the Delta table.

C.Rely on Delta Lake's built-in ACID transactions.

D.Use Dynamic Partition Pruning in your Spark jobs.

AnswerC

Delta Lake provides ACID transactions, ensuring atomic and consistent concurrent writes.

Why this answer

Delta Lake provides built-in ACID (Atomicity, Consistency, Isolation, Durability) transactions that guarantee atomic and consistent concurrent writes. When multiple jobs write to the same Delta table, Delta Lake uses a transaction log (stored as JSON files in the `_delta_log` directory) to serialize writes, ensuring that each write is either fully committed or rolled back, preventing partial updates or data corruption.

Exam trap

The trap here is that candidates confuse performance-tuning features (Optimized Write, Auto Optimize, Dynamic Partition Pruning) with transactional guarantees, assuming they provide atomicity or consistency when they only address file layout or query speed.

How to eliminate wrong answers

Option A is wrong because Optimized Write is a performance feature that reduces the number of small files written by coalescing partitions, but it does not provide atomicity or consistency guarantees for concurrent writes. Option B is wrong because Auto Optimize is a Delta Lake feature that automatically compacts small files and optimizes file layout, but it does not handle transactional concurrency or atomicity. Option D is wrong because Dynamic Partition Pruning is a Spark SQL optimization that improves query performance by skipping irrelevant partitions during joins, not a mechanism for ensuring atomic or consistent concurrent writes.

Full explanation →

816

Multi-Selecthard

Which THREE measures should you implement to monitor and optimize the performance of Azure Data Lake Storage Gen2?

Select 3 answers

A.Enable Network Security Group flow logs for the storage account subnet.

B.Enable Storage Insights to monitor capacity and transactions.

C.Configure lifecycle management policies to move cold data to archive tier.

D.Use Azure Storage Analytics logs to analyze latency and request rate.

E.Enable Azure Monitor diagnostic settings to capture read and write requests.

AnswersB, D, E

Storage Insights provides performance metrics.

Why this answer

Option A is correct: Metrics like 'Ingress' help identify bottlenecks. Option C is correct: Diagnostic logs provide request details for troubleshooting. Option D is correct: Azure Storage Analytics offers latency and throughput data.

Option B is incorrect: Tiering is for cost optimization, not performance monitoring. Option E is incorrect: NSG flow logs are for network security, not storage performance.

Full explanation →

817

MCQhard

Refer to the exhibit. You are reviewing an Azure Cosmos DB for NoSQL container configuration. The container stores customer orders. The application frequently queries orders by orderId. However, these queries are consuming high RUs and are slow. What is the most likely cause?

A.The unique key on orderId does not index the field; queries by orderId are cross-partition.

B.The _etag field is excluded from indexing, causing high RU for queries that include it.

C.The indexing mode is set to consistent, which causes high RU usage.

D.The container is provisioned with only 400 RU/s, which is too low for the workload.

AnswerA

Unique keys do not serve as indexes; the query must specify customerId to be efficient.

Why this answer

Option A is correct because the unique key constraint on `orderId` does not automatically create an index for that field; it only enforces uniqueness. Without an explicit indexing policy that includes `orderId`, queries filtering by `orderId` must perform a cross-partition scan, which consumes high RUs and is slow. In Azure Cosmos DB for NoSQL, queries that cannot be served from a single physical partition due to missing index or partition key mismatch result in fan-out across all partitions, dramatically increasing RU consumption.

Exam trap

Microsoft often tests the misconception that a unique key constraint automatically creates an index for querying, when in fact unique keys only enforce uniqueness and do not affect query performance; the trap is that candidates confuse unique keys with indexing policies.

How to eliminate wrong answers

Option B is wrong because `_etag` is a system property used for optimistic concurrency control, and excluding it from indexing does not cause high RU for queries that include it; queries that reference `_etag` would simply not benefit from an index, but the primary issue is the missing index on `orderId`. Option C is wrong because the indexing mode 'consistent' is the default and recommended mode for balanced write/read performance; it does not inherently cause high RU usage. Option D is wrong because while 400 RU/s is low, the question states that queries by `orderId` are consuming high RUs and are slow, which points to a query design issue (cross-partition) rather than insufficient throughput; increasing RU/s without fixing the indexing would only mask the problem.

Full explanation →

818

Multi-Selectmedium

Which TWO factors should you consider when choosing between Azure SQL Database and Azure Cosmos DB for a transactional workload that requires low-latency reads and writes globally?

Select 2 answers

A.ACID transaction support across multiple documents.

B.Support for secondary indexes.

C.Consistency models (strong, bounded staleness, session, eventual, consistent prefix).

D.Ability to run stored procedures.

E.Global distribution capabilities with multi-region writes.

AnswersC, E

Cosmos DB offers five consistency models; SQL Database offers snapshot isolation.

Why this answer

Option C is correct because Azure Cosmos DB offers multiple well-defined consistency models (strong, bounded staleness, session, consistent prefix, and eventual) that allow you to tune the trade-off between consistency and latency for globally distributed workloads. For a transactional workload requiring low-latency reads and writes globally, choosing the appropriate consistency model (e.g., session or eventual) can significantly reduce write latency by avoiding the overhead of synchronous replication required for strong consistency. This flexibility is a key factor when designing for global distribution with multi-region writes.

Exam trap

The trap here is that candidates often assume ACID transactions (Option A) or stored procedures (Option D) are unique to one service, when in fact both Azure SQL Database and Azure Cosmos DB support these features, making them irrelevant as distinguishing factors for global low-latency workloads.

Full explanation →

819

MCQeasy

You are designing a data processing solution in Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 and you need to perform transformations using Apache Spark. The security requirements mandate that all data in transit must be encrypted and that the storage account must not be accessible from the public internet. What should you configure?

A.Enable the storage account firewall and add a private endpoint for Azure Databricks to use.

B.Disable TLS on the storage account and use a shared access signature (SAS) token for authentication.

C.Use Azure Databricks with VNet injection and configure a service endpoint for the storage account.

D.Configure the storage account to use HTTPS only and enable firewall rules to allow only Azure services.

AnswerA

Private endpoint ensures private connectivity and encryption in transit.

Why this answer

Option A is correct because it satisfies both security requirements: encrypting data in transit and preventing public internet access. A private endpoint uses Azure Private Link to connect Azure Databricks to the storage account over the Microsoft backbone network, ensuring all traffic stays within the Azure network and is encrypted via TLS. The storage account firewall is then configured to deny all public traffic, so only the private endpoint can access the storage account, meeting the 'not accessible from the public internet' mandate.

Exam trap

The trap here is confusing service endpoints (which still use public endpoints) with private endpoints (which use private IPs and fully isolate the resource from the internet), leading candidates to pick Option C thinking VNet injection plus service endpoints provides complete public internet isolation.

How to eliminate wrong answers

Option B is wrong because disabling TLS removes encryption in transit, violating the requirement that all data in transit must be encrypted; SAS tokens provide authentication but do not encrypt the channel. Option C is wrong because a service endpoint still exposes the storage account to the public internet (it only restricts source IPs via the firewall), so it does not meet the 'not accessible from the public internet' requirement; VNet injection alone does not enforce private connectivity. Option D is wrong because enabling HTTPS only and allowing only Azure services still leaves the storage account accessible from the public internet (Azure services can originate from public IPs), failing the requirement to block all public internet access.

Full explanation →

820

MCQeasy

You are designing a data storage solution for real-time streaming data from IoT devices. The data must be stored in its original format for immediate processing and later transformed for analytics. Which Azure service should you use for raw data ingestion?

A.Azure Data Lake Storage Gen2

B.Azure Event Hubs

C.Azure Stream Analytics

D.Azure Data Factory

AnswerB

C is correct because Event Hubs is designed for real-time data ingestion.

Why this answer

Azure Event Hubs is a fully managed, real-time data ingestion service optimized for high-throughput streaming data from IoT devices. It can receive millions of events per second, store them in a partitioned, ordered log for immediate processing, and retain them for up to 7 days (or longer with Event Hubs Capture) for later transformation and analytics. This makes it the correct choice for raw data ingestion before any transformation occurs.

Exam trap

The trap here is that candidates confuse data ingestion (Event Hubs) with data storage (Data Lake Storage) or data processing (Stream Analytics), assuming a single service must handle both raw capture and transformation, when in fact the question explicitly asks for raw data ingestion only.

How to eliminate wrong answers

Option A is wrong because Azure Data Lake Storage Gen2 is a hierarchical file store for analytics workloads, not a real-time ingestion endpoint; it cannot natively accept streaming events at high velocity without an intermediary like Event Hubs or IoT Hub. Option C is wrong because Azure Stream Analytics is a stream processing engine that consumes data from sources like Event Hubs and performs real-time transformations, but it does not store raw data itself—it outputs results to sinks. Option D is wrong because Azure Data Factory is a cloud-based ETL and orchestration service for batch and scheduled data movement, not designed for real-time, high-throughput streaming ingestion from IoT devices.

Full explanation →

821

MCQhard

Your organization uses Azure Purview for data governance. You need to ensure that only authorized users can register data sources and create classification rules, while other data consumers can only search and browse the data catalog. What should you configure?

A.Assign the Catalog Admin role to curators and Data Reader role to consumers

B.Assign the Data Curator role to curators and Data Reader role to consumers

C.Assign the Collection Admin role to curators and Data Reader role to consumers

D.Assign the Data Source Administrator role to curators and Data Reader role to consumers

AnswerB

Data Curator has full catalog management; Data Reader has read-only access.

Why this answer

Azure Purview uses roles: Data Curator can register sources and manage classifications; Data Reader can only search and browse. Option A is wrong because Data Source Administrator can only manage source registrations. Option C is wrong because Collection Admin manages collections.

Option D is wrong because there is no 'Catalog Admin' role; Purview has built-in roles.

Full explanation →

822

MCQeasy

You need to execute a T-SQL stored procedure in Azure Synapse dedicated SQL pool that performs a large data load. The stored procedure takes approximately 45 minutes to run. You want to monitor the progress and see the statement text currently being executed. Which dynamic management view (DMV) should you query?

A.sys.dm_workload_management_workload_groups_stats

B.sys.dm_pdw_exec_requests (with status='running')

C.sys.dm_pdw_waits

D.sys.dm_pdw_exec_requests

AnswerB

Filtering by status='running' shows currently executing requests with the command text.

Why this answer

B is correct because `sys.dm_pdw_exec_requests` with the filter `WHERE status = 'running'` returns the currently executing requests in an Azure Synapse dedicated SQL pool, including the full statement text in the `command` column. This allows you to see the exact T-SQL being executed during the 45-minute data load, enabling progress monitoring.

Exam trap

The trap here is that candidates often choose D without the `status='running'` filter, thinking it shows current execution, but it returns all historical requests, requiring an additional filter to isolate the active one.

How to eliminate wrong answers

Option A is wrong because `sys.dm_workload_management_workload_groups_stats` provides statistics about workload group resource utilization (e.g., CPU, memory), not the currently executing statement text. Option C is wrong because `sys.dm_pdw_waits` shows wait information (e.g., locks, waits on resources) but does not include the statement text of running requests. Option D is wrong because `sys.dm_pdw_exec_requests` without the `status='running'` filter returns all requests (completed, failed, cancelled, etc.), not just the currently executing one, so you would not see the active statement text without additional filtering.

Full explanation →

823

MCQeasy

A company uses Azure Data Lake Storage Gen2 with hierarchical namespace enabled. They need to restrict a specific application's access to only write files in a particular directory without being able to read or list files. Which type of permission should be assigned?

A.Configure a firewall rule to allow only the application's IP address.

B.Configure an access control list (ACL) that grants execute and write permissions to the application's service principal.

C.Generate a shared access signature (SAS) with write and list permissions.

D.Assign the Storage Blob Data Contributor role at the storage account level.

AnswerB

ACLs allow fine-grained write-only access without read or list.

Why this answer

Option C is correct because POSIX-style ACLs allow granular permissions like write-only. Option A is wrong because RBAC roles like Storage Blob Data Contributor include read and list permissions. Option B is wrong because SAS tokens can be scoped but not to write-only at directory level easily.

Option D is wrong because firewall rules control network access, not permissions.

Full explanation →

824

MCQhard

Refer to the exhibit. You have a Mapping Data Flow in Azure Data Factory that reads JSON files from a folder partitioned by year/month/day. The source setting includes a row limit of 10,000. The sink writes Parquet files with a file pattern and partition columns. You notice that the job processes only the first 10,000 rows from the entire dataset instead of 10,000 rows per partition. How should you modify the data flow to achieve row limit per partition?

A.Remove the rowLimit from source settings. In the Optimize tab of the source, set 'Partition option' to 'Set partitioning' with 'Value' as 'year,month'. Then add a 'Top N' transformation after the source to limit rows per partition.

B.Add a 'RowNumber' transformation after the source, then filter rows where row number <= 10000 per partition.

C.In the source settings, change rowLimit to a dynamic expression that uses partition variables to limit per partition.

D.Use a 'Distinct' transformation with 'All columns' to remove duplicate rows, which inherently limits rows.

AnswerA

This approach partitions the data first and then applies a per-partition row limit using Top N.

Why this answer

The row limit in the source settings applies to the entire dataset, not per partition. To achieve row limit per partition, you need to set the row limit in a derived column or filter transformation after partitioning. However, the correct approach is to remove the row limit from source and instead use a 'Top N' transformation after a 'Partition' transformation, but Data Flow does not have a Top N per partition.

The practical solution is to use a 'Window' transformation to rank rows within each partition and then filter. Alternatively, you can use a 'SurrogateKey' to assign row numbers per partition. The correct answer here is to enable 'Sampling' within each partition using the 'Optimize' tab, but the closest option is to apply the row limit in a filter transformation after partitioning.

Option A correctly suggests using the 'Optimize' tab to set 'Partition' and then apply a 'Top N' filter. Option B is wrong because row limit cannot be set per partition in source settings. Option C is wrong because Data Flows do not have a 'RowNumber' transformation.

Option D is wrong because 'Distinct' does not limit rows.

Full explanation →

825

MCQhard

You are developing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion into a dedicated SQL pool. You need to ensure that data from streaming sources is available for queries within 5 seconds. Which approach should you use?

A.Use Azure Stream Analytics with a custom SQL function that writes directly to the dedicated SQL pool

B.Use Azure Databricks with Structured Streaming, write to Data Lake Storage, and then use PolyBase to load into SQL pool

C.Use Azure Data Factory with tumbling window triggers to load data from Event Hubs every 5 seconds

D.Use Event Hubs Capture to write to Data Lake Storage, then use PolyBase to load into the SQL pool every 5 seconds

AnswerA

Stream Analytics can achieve sub-second latency and write directly to SQL pool via a stored procedure.

Why this answer

Using Azure Stream Analytics with a custom function to write to the dedicated SQL pool via a stored procedure can achieve low latency. Option A is wrong because Azure Data Factory is batch-only. Option B is wrong because Event Hubs Capture writes in batches (minutes).

Option D is wrong because Spark Structured Streaming to Data Lake and then PolyBase introduces additional latency.

Full explanation →

Page 11 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →