Knowledge + Practice

Microsoft Azure Data Engineer Associate DP-203 (DP-203) — Questions 376–450

846 questions total · 12pages · All types, answers revealed

Take a mock exam Exam hub

Page 6 of 12

376

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must support incremental loading of data from an Azure SQL Database to a dedicated SQL pool using PolyBase. Which approach should you use to minimize data movement and maximize performance?

A.Use the bcp utility to export data from Azure SQL Database to a text file, then bulk insert into the dedicated SQL pool.

B.Create external tables in the dedicated SQL pool that reference the source data, then use CREATE TABLE AS SELECT (CTAS) to load incrementally.

C.Use Azure Data Factory with a copy activity to load data into staging tables, then merge into the target table.

D.Use Azure Databricks to read the source data, apply transformations, and write to the dedicated SQL pool using the Spark connector.

AnswerB

PolyBase external tables enable direct query of source data, and CTAS allows efficient incremental loading with minimal data movement.

Why this answer

Option B is correct because using external tables with PolyBase in Azure Synapse Analytics allows you to directly query the source Azure SQL Database without moving the data first. The CREATE TABLE AS SELECT (CTAS) statement then loads only the incremental data into the dedicated SQL pool, minimizing data movement by leveraging PolyBase's parallel streaming capability for maximum performance.

Exam trap

The trap here is that candidates often assume external tables are only for static data or Hadoop, but PolyBase in Synapse supports external tables against Azure SQL Database for efficient incremental loading, making options that introduce extra hops (like Data Factory or bcp) seem more familiar but less optimal.

How to eliminate wrong answers

Option A is wrong because the bcp utility exports data to a text file, which introduces an intermediate storage step and additional I/O overhead, increasing data movement and latency compared to direct PolyBase access. Option C is wrong because Azure Data Factory copy activity moves data through an intermediate staging area (e.g., Azure Blob Storage), which adds extra data transfer and storage costs, whereas PolyBase can read directly from the source without staging. Option D is wrong because Azure Databricks with the Spark connector requires moving data out of Azure SQL Database into a Spark cluster for processing, then writing back to the dedicated SQL pool, which increases data movement and complexity compared to the native PolyBase approach.

Full explanation →

377

MCQmedium

Your team is using Azure Synapse Analytics to process sensitive customer data. You need to ensure that column-level security is applied to a specific table so that only users with a certain role can view certain columns. Which feature should you use?

A.Column-level security (CLS)

B.Row-level security (RLS)

C.Azure Purview data policies

D.Dynamic data masking (DDM)

AnswerA

CLS restricts column access based on user's role or group membership.

Why this answer

Option A is correct because column-level security in Azure Synapse restricts column access based on user's group membership. Option B is incorrect because row-level security restricts rows, not columns. Option C is incorrect because dynamic data masking obfuscates data but does not restrict access.

Option D is incorrect because Azure Purview is a data governance service, not for column-level security.

Full explanation →

378

MCQhard

An Azure Data Factory pipeline runs multiple times daily, loading data from an on-premises SQL Server to Azure Blob Storage. You notice that the pipeline sometimes fails due to transient network errors. You need to implement a retry policy with exponential backoff. Which configuration should you apply?

A.Set the pipeline's retry property to 3 and retry interval to 60 seconds.

B.Set the activity's retry property to 3 and enable exponential backoff.

C.Set the activity's retry property to 3 and retry secs to 60.

D.Set the trigger's retry policy to 3 with exponential backoff.

AnswerB

Activity-level retry with exponential backoff automatically increases wait time.

Why this answer

Option C is correct because the retry policy with exponential backoff is configured in the activity's retry property. Option A is wrong because pipeline-level retry is a simple retry, not exponential backoff. Option B is wrong because the retry property at pipeline level is not for activities.

Option D is wrong because the activity-level retry secs property is for fixed interval, not exponential backoff.

Full explanation →

379

Multi-Selecteasy

You are using Azure Stream Analytics to process real-time data from an IoT hub. The output is sent to Azure Blob Storage for long-term storage. You need to ensure that the output files are partitioned by date and hour for easy querying. Which THREE configurations should you set? (Choose three.)

Select 3 answers

A.Use a path pattern that includes {date} and {time} tokens.

B.Configure the event ordering policy to adjust late events.

C.Set the output serialization format to Avro or Parquet.

D.Set the compatibility level to 1.2 or higher.

E.Enable 'Write to blob storage partitioned by time' in the output settings.

AnswersA, C, E

Tokens in the path pattern create folder structure based on date and time.

Why this answer

Options A, D, and E are correct. Option A: Setting the output format to Avro or Parquet is common for partitioned data. Option D: Enabling 'DateTime format' in the output allows partitioning.

Option E: Using a path pattern with {date} and {time} creates folder structure. Option B is wrong because partitioning is not part of event ordering. Option C is wrong because compatibility level does not affect output partitioning.

Full explanation →

380

MCQhard

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data and comply with GDPR. The data will be stored in Azure Synapse Analytics. You need to ensure that only authorized users can view specific columns (e.g., credit card numbers). Which security feature should you implement?

A.Row-level security (RLS)

B.Column-level security

C.Dynamic data masking

D.Microsoft Defender for Cloud

AnswerB

Column-level security restricts access to specific columns.

Why this answer

Option C is correct because Column-level security in Azure Synapse allows you to restrict access to specific columns for specific users or roles. Option A is wrong because Row-level security restricts rows, not columns. Option B is wrong because Dynamic data masking obfuscates data but does not prevent access entirely.

Option D is wrong because Microsoft Defender for Cloud is a security monitoring tool, not for column-level access control.

Full explanation →

381

Drag & Dropmedium

Drag and drop the steps to configure Azure Databricks auto-scaling cluster for ETL workloads into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First create the workspace, then the cluster with auto-scaling settings, choose runtime, attach libraries, and set policies.

Full explanation →

382

MCQeasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to store fact and dimension tables. The fact table is expected to have billions of rows. Which distribution strategy should you recommend for the fact table to optimize query performance and minimize data movement?

A.Round-robin distribution.

B.Partitioned table with a partition key.

C.Hash distribution on a column that is frequently used in joins and aggregations.

D.Replicated distribution.

AnswerC

Hash distribution collocates rows with the same key, reducing data movement.

Why this answer

Hash distribution on a column frequently used in joins and aggregations is the best choice for a fact table with billions of rows in a dedicated SQL pool. It distributes rows across distributions based on a hash of the distribution column, ensuring that rows with the same key value are co-located on the same distribution. This minimizes data movement during joins and aggregations, as the data required for these operations is already local to each distribution, significantly improving query performance.

Exam trap

The trap here is that candidates often confuse partitioning with distribution, thinking that partitioning alone can optimize data movement across nodes, but partitioning operates within a distribution and does not affect how data is distributed across compute resources.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without considering data relationships, which leads to excessive data movement during joins and aggregations, degrading performance for large fact tables. Option B is wrong because partitioning is a data organization technique within a distribution, not a distribution strategy; it helps with data management and partition elimination but does not control how data is distributed across compute nodes, so it cannot minimize data movement across distributions. Option D is wrong because replicated distribution copies the entire table to each compute node, which is impractical for a fact table with billions of rows due to massive storage overhead and write performance penalties; it is intended for small dimension tables, not large fact tables.

Full explanation →

383

MCQmedium

Refer to the exhibit. A custom RBAC role is defined as shown. A user is assigned this role at the resource group scope. Which operation can the user perform?

A.Delete containers

B.Write blob data to containers

C.List containers in a storage account within DataRG

D.Read blob data from containers

AnswerC

The action permits reading container properties and listing containers.

Why this answer

The custom RBAC role includes the 'Microsoft.Storage/storageAccounts/blobServices/containers/read' action, which allows listing containers. Since the user is assigned this role at the resource group scope (DataRG), they can list containers in any storage account within that resource group. The role does not include any data plane actions (e.g., read/write/delete blob data) or container deletion permissions, so only the list operation is permitted.

Exam trap

The trap here is that candidates often confuse control plane container listing permissions with data plane blob read permissions, assuming that 'read' on containers implies access to blob content, whereas Azure RBAC strictly separates these scopes.

How to eliminate wrong answers

Option A is wrong because deleting containers requires the 'Microsoft.Storage/storageAccounts/blobServices/containers/delete' action, which is not included in the role. Option B is wrong because writing blob data requires the 'Microsoft.Storage/storageAccounts/blobServices/containers/write' action (or a data plane permission like 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write'), which is absent. Option D is wrong because reading blob data requires the 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read' action, which is not granted; the role only grants read access to container metadata (listing), not to blob content.

Full explanation →

384

MCQmedium

Refer to the exhibit. An ARM template deploys an Azure Synapse Analytics workspace. What is the purpose of the 'managedVirtualNetwork' property set to 'default'?

A.It configures the workspace to use a user-assigned managed identity

B.It disables public network access to the workspace

C.It creates the workspace in a private endpoint configuration

D.It enables the workspace to use a managed virtual network for network isolation

AnswerD

Managed VNet provides isolation.

Why this answer

Setting 'managedVirtualNetwork' to 'default' in an ARM template for Azure Synapse Analytics enables a managed virtual network that provides network isolation for the workspace. This allows the workspace to use private endpoints and managed private endpoints for secure data integration without exposing traffic to the public internet. It is a foundational setting for implementing a secure, network-isolated Synapse environment.

Exam trap

The trap here is that candidates confuse 'managedVirtualNetwork' with directly creating private endpoints or disabling public access, when in fact it is the prerequisite that enables the workspace to use a managed virtual network for network isolation, with private endpoints and public network access controls being separate configurations.

How to eliminate wrong answers

Option A is wrong because the 'managedVirtualNetwork' property controls network isolation, not identity configuration; user-assigned managed identities are configured via the 'identity' property in the ARM template. Option B is wrong because disabling public network access is a separate setting (e.g., 'publicNetworkAccess' property), not the purpose of 'managedVirtualNetwork'. Option C is wrong because setting 'managedVirtualNetwork' to 'default' does not directly create the workspace in a private endpoint configuration; it enables the managed virtual network, and private endpoints must be explicitly created within that network for specific resources.

Full explanation →

385

MCQhard

Refer to the exhibit. You are creating a serverless SQL table in Azure Synapse Analytics that reads Parquet files from the specified location. The folder contains multiple Parquet files with different schemas. When querying the table, you get an error about schema mismatch. What is the most likely reason?

A.The Parquet files are not using the .parquet extension.

B.The derivedModel option is set to false, which disables schema inference.

C.The serverless SQL pool infers schema from the first file and expects all files to have the same schema.

D.The recursive option is causing the table to include files from subfolders that have different schemas.

AnswerC

Serverless SQL uses schema inference from the first file; subsequent files with different schemas cause errors.

Why this answer

Serverless SQL pools infer the schema from the first file encountered. If files have different schemas, the inference may be inconsistent. Option A is correct because the schema inference is done on the first file, and mismatches cause errors.

Option B is wrong because recursive reads subfolders but not different schemas. Option C is wrong because .parquet extension is fine. Option D is wrong because derivedModel is false, which means no model is used, but that doesn't cause schema mismatch.

Full explanation →

386

MCQeasy

You are designing a data processing pipeline using Azure Data Factory. The pipeline must ingest data from an HTTP endpoint that returns a JSON array. The data must be transformed by flattening nested arrays and then loaded into an Azure SQL Database table. The pipeline should be triggered daily. You need to choose the appropriate activities and transformations. The solution must be cost-effective and easy to maintain. Which combination of activities should you use?

A.Use a Lookup activity to read the JSON, then a ForEach activity to iterate and insert rows into SQL Database.

B.Use a Copy activity to ingest data from the HTTP source into Azure Blob Storage, then a Data Flow activity with a Flatten transformation to flatten the JSON, and finally a Copy activity to load into SQL Database.

C.Use a Data Flow activity directly from HTTP source with a Flatten transformation and sink to SQL Database.

D.Use two Copy activities: one to copy JSON to Blob Storage, and another to copy from Blob Storage to SQL Database without transformation.

AnswerB

This is the standard pattern: ingest, transform, load.

Why this answer

Option A is correct. A Copy activity can ingest the JSON from HTTP, and a Data Flow with Flatten transformation can flatten the nested arrays. Option B is wrong because Execute Pipeline is unnecessary.

Option C is wrong because a Lookup activity is for reading a single row, not for data ingestion. Option D is wrong because multiple Copy activities are not needed.

Full explanation →

387

MCQmedium

You are reviewing an Azure Data Factory pipeline JSON that copies data from Azure Blob Storage to Azure SQL Database using a stored procedure. The pipeline fails with a 'Parameter supplied for object is not valid' error. What is the most likely cause?

A.The source type 'BlobSource' is not compatible with Azure Blob Storage.

B.The SQL table type 'dbo.InsertType' does not exist.

C.The stored procedure parameters are not mapped to source columns.

D.The dataset references are incorrect.

AnswerC

Copy activity needs mapping from source columns to stored procedure parameters.

Why this answer

Option B is correct because the stored procedure parameter 'Param1' is defined with a static string value 'value1', but the copy activity should map source columns to stored procedure parameters. Option A is wrong because the source type is valid. Option C is wrong because dataset references are correctly structured.

Option D is wrong because the error is about parameters, not dataset names.

Full explanation →

388

MCQmedium

A company is ingesting streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time and stored in Azure Synapse Analytics for reporting. The solution must handle late-arriving data and ensure exactly-once semantics. Which Azure service should you use for stream processing?

A.Azure Data Factory with Event Hubs source

B.Azure Synapse Spark Structured Streaming

C.Azure Stream Analytics

D.Azure Event Hubs with Capture

AnswerC

Provides exactly-once delivery and can handle late arrivals.

Why this answer

Option D is correct because Azure Stream Analytics supports exactly-once semantics to Synapse, handles late arrivals, and is optimized for real-time. Option A is incorrect because Spark Structured Streaming in Synapse does not provide exactly-once to Synapse out of the box. Option B is incorrect because Event Hubs itself does not process data.

Option C is incorrect because Data Factory is for batch, not streaming.

Full explanation →

389

MCQeasy

You are designing a data pipeline in Azure Data Factory that copies data from Azure Blob Storage to Azure SQL Database. The data contains personally identifiable information (PII). What should you use to protect the data during transit?

A.Azure Information Protection

B.Encryption over HTTPS/TLS

C.Azure Disk Encryption

D.Azure Storage Service Encryption

AnswerB

Azure Data Factory uses TLS to encrypt data in transit between endpoints.

Why this answer

Option B is correct because Azure Data Factory always encrypts data in transit using TLS. Option A is wrong because Azure Information Protection is for labeling, not transit encryption. Option C is wrong because Azure Disk Encryption is for at-rest encryption of disks.

Option D is wrong because Azure Storage Service Encryption is for at-rest encryption.

Full explanation →

390

MCQmedium

A company uses Azure Synapse Analytics with dedicated SQL pools. They notice that query performance degrades significantly during peak hours. They have already scaled up the Data Warehouse Units (DWU) to the maximum. Which action should they take next to improve performance?

A.Enable result-set caching.

B.Rebuild all clustered columnstore indexes.

C.Increase the number of concurrency slots.

D.Move the data to Azure Data Lake Storage Gen2.

AnswerA

Result-set caching stores query results in the SSD cache, reducing compute resource usage and improving performance for repeated queries.

Why this answer

When a dedicated SQL pool is already at maximum DWU, further scaling is not possible. Enabling result-set caching stores query results in the SSD-based cache of the SQL pool, allowing repeated queries to be served directly from cache without re-scanning data or re-computing aggregations. This reduces I/O and CPU pressure during peak hours, improving performance for recurring queries without requiring additional compute resources.

Exam trap

The trap here is that candidates often confuse result-set caching with materialized views or index maintenance, assuming that only index rebuilds or scaling can fix performance, but result-set caching is a lightweight, no-cost configuration change that directly addresses repeated query patterns during peak load.

How to eliminate wrong answers

Option B is wrong because rebuilding clustered columnstore indexes is a maintenance task that can improve compression and scan performance, but it does not address the root cause of peak-hour degradation when the pool is already at maximum DWU; it also consumes significant resources during rebuild. Option C is wrong because concurrency slots are a resource governance mechanism that limits the number of concurrent queries, not a performance-tuning feature; increasing concurrency slots would actually reduce the resources available per query, potentially worsening performance. Option D is wrong because moving data to Azure Data Lake Storage Gen2 changes the storage layer but does not directly improve query performance in a dedicated SQL pool; the pool still reads data through its compute nodes, and the bottleneck is compute, not storage location.

Full explanation →

391

MCQhard

You are reviewing an Azure Data Factory dataset JSON definition for a data lake. The dataset is used in a copy activity that loads sales data into Azure Data Lake Storage Gen2. The pipeline runs successfully, but you notice that the output file always overwrites the previous file with the name 'sales.parquet' regardless of the folderPath parameter. What is the most likely cause?

A.The folderPath parameter is not being evaluated correctly

B.The linkedServiceName is not correctly configured

C.The fileName property is hardcoded and not parameterized

D.The format type is incorrect; it should be 'Parquet' instead of 'ParquetFormat'

AnswerC

The fileName is fixed to 'sales.parquet', causing overwrites.

Why this answer

Option C is correct because the dataset's fileName property is hardcoded to 'sales.parquet', which means every pipeline run writes to the same file, overwriting it regardless of the folderPath parameter. In Azure Data Factory, if fileName is static and not parameterized, the copy activity will always target that exact file name, even if folderPath changes dynamically.

Exam trap

The trap here is that candidates often focus on folderPath or linked service configuration, overlooking that the fileName property is hardcoded, which is the direct cause of the overwrite behavior.

How to eliminate wrong answers

Option A is wrong because if the folderPath parameter were not evaluated correctly, the pipeline would likely fail or write to an unexpected folder, but the observed behavior is consistent overwriting of the same file, indicating the folderPath is working but the fileName is fixed. Option B is wrong because an incorrectly configured linkedServiceName would cause authentication or connection failures, not a consistent overwrite behavior. Option D is wrong because 'ParquetFormat' is a valid format type in Azure Data Factory datasets for Parquet files; the issue is not about format type but about the fileName property being static.

Full explanation →

392

MCQmedium

Your team uses Azure Data Factory to orchestrate data movement. You need to monitor pipeline runs and set up alerts when a pipeline fails more than three times in an hour. What is the most efficient approach?

A.Create an alert rule in Azure Data Factory based on the 'Failed pipeline runs' metric.

B.Configure diagnostic settings to send logs to Log Analytics and create a log alert.

C.Use a Logic App to periodically check the pipeline run status and send notifications.

D.Create an Azure Monitor metric alert for the 'Failed pipeline runs' metric with a threshold of 3 in 1 hour.

AnswerD

Azure Monitor metric alerts are efficient for monitoring pipeline failures.

Why this answer

Option D is correct because Azure Monitor alerts can be configured based on metrics like Failed pipeline runs with a threshold of 3 in 1 hour. Option A is wrong because Alert rules in Data Factory are limited. Option B is wrong because diagnostic settings send logs to Log Analytics, but you would need to create a log alert, which is less efficient than a metric alert.

Option C is wrong because a logic app is not the most efficient for simple threshold alerts.

Full explanation →

393

MCQhard

You are reviewing a mapping data flow in Azure Data Factory that reads a CSV file from ADLS Gen2 and writes to an Azure Synapse Analytics dedicated SQL pool. The data flow includes a Derived Column transformation with the expression: `column1 == "Error" ? toString(column1) : column1`. The pipeline fails with an error indicating that the sink table could not be created. What is the most likely cause?

A.The source file does not have a header row.

B.The Derived Column expression has a syntax error.

C.Using allowCopyCommand with autoCreate is not supported.

D.The sink dataset is configured for JSON format.

AnswerC

allowCopyCommand requires the table to exist.

Why this answer

Option B is correct because `allowCopyCommand` is set to `true`, which requires the sink table to already exist. The `tableOption` is `autoCreate`, but with `allowCopyCommand` enabled, auto-create is not supported. Option A is wrong because the expression is syntactically correct.

Option C is wrong because the issue is with sink, not source. Option D is wrong because the source is delimited text, not JSON.

Full explanation →

394

Multi-Selectmedium

You are using Azure Data Factory to ingest data from a REST API into Azure Synapse Analytics. The API has a rate limit of 100 requests per minute. You need to ensure that the pipeline respects the rate limit and retries on failure. Which two settings should you configure in the copy activity? (Choose two.)

Select 2 answers

A.Enable 'Enable staging' to use a staging blob.

B.Configure the 'Batch size' to 100.

C.Set the 'Throttle' property to limit the number of concurrent connections.

D.Set the 'Retry' property to a value greater than 0.

E.Increase the 'Timeout' value to 10 minutes.

AnswersC, D

Throttling concurrent connections helps stay within the rate limit.

Why this answer

Options A and C are correct. Setting 'Retry' to handle transient failures and 'Throttle' to limit concurrent connections help respect rate limits. Option B is wrong because 'Batch size' is for bulk operations, not rate limiting.

Option D is wrong because 'Timeout' cancels the activity, not retries. Option E is wrong because 'Enable staging' is for large data transfers, not rate limiting.

Full explanation →

395

Multi-Selecteasy

You are monitoring an Azure Data Factory pipeline that processes streaming data from Event Hubs to Azure Synapse Analytics. Which TWO Azure Monitor metrics should you set alerts on to detect data loss or processing delays?

Select 2 answers

A.InputEvents and OutputEvents metrics

B.Duration metric

C.Data read and data written metrics

D.Pipeline run count metric

E.Backlogged input events metric

AnswersA, E

Comparing input and output events helps detect data loss.

Why this answer

Option B is correct because 'InputEvents' and 'OutputEvents' allow you to compare if all events are being processed. Option C is correct because 'Backlogged input events' indicates data is accumulating and not being processed quickly enough. Option A is wrong because 'Data read' and 'Data written' are for copy activities, not streaming.

Option D is wrong because 'Duration' is not a streaming metric. Option E is wrong because 'Pipeline run count' is for batch pipelines.

Full explanation →

396

MCQhard

You are monitoring an Azure Synapse Pipeline that uses a Mapping Data Flow. The data flow processes 2 GB of data from a CSV source and writes to a Delta sink. The pipeline fails with a 'DataFlowException: Operation aborted' error after running for 45 minutes. The cluster is configured with 8 cores. What is the most likely cause?

A.The cluster size is too small for the data volume.

B.The CSV source contains malformed rows that cause parsing errors.

C.The data flow cluster's time-to-live (TTL) is set to 45 minutes and the job exceeded it.

D.The data flow is using the Spark cluster's default timeout setting.

AnswerC

The default TTL for data flow clusters is 60 minutes, but if custom set to 45 minutes, the cluster may be terminated during long-running jobs.

Why this answer

Option C is correct because the error 'Operation aborted' after exactly 45 minutes aligns with the default time-to-live (TTL) setting for Azure Synapse Mapping Data Flow clusters. When the TTL expires, the cluster is terminated, and any running job is aborted. The 8-core cluster and 2 GB data volume are not inherently problematic for a 45-minute window, but the TTL default of 45 minutes causes the abort if the job runs longer than that.

Exam trap

The trap here is that candidates confuse the TTL (a Synapse cluster lifecycle setting) with a Spark job timeout or a data volume issue, leading them to incorrectly select cluster size or malformed data as the cause.

How to eliminate wrong answers

Option A is wrong because 8 cores can process 2 GB of data within 45 minutes under normal conditions; the error is not due to insufficient cluster size but rather a timeout. Option B is wrong because malformed rows would cause a parsing error (e.g., 'MalformedRecordException'), not a generic 'Operation aborted' error after a fixed duration. Option D is wrong because the Spark cluster's default timeout is not a configurable setting that causes this specific error; the TTL is a Synapse-specific cluster lifecycle setting, not a Spark-level timeout.

Full explanation →

397

MCQmedium

You are designing a data processing pipeline in Azure Data Factory that ingests data from an on-premises SQL Server database to Azure Data Lake Storage Gen2. The data volume is large (500 GB). The network connection between on-premises and Azure is limited to 100 Mbps. You need to minimize the time to transfer the initial full load while ensuring data integrity. Which approach should you recommend?

A.Use Azure Data Factory copy activity with parallel connections

B.Use Azure ExpressRoute to increase bandwidth

C.Compress the data using GZip and use copy activity

D.Use Azure Data Box to copy the data offline

AnswerD

Data Box transfers data physically, bypassing network limitations.

Why this answer

Option D is correct because Azure Data Box physically ships the data, bypassing network bandwidth limitations for large initial loads. Option A is wrong because it would take over 11 hours even at full bandwidth, and network may not be stable. Option B is wrong because it compresses but still uses network.

Option C is wrong because VPN adds overhead.

Full explanation →

398

Multi-Selectmedium

Which TWO actions should you take to secure access to an Azure Data Lake Storage Gen2 account using Microsoft Entra ID?

Select 2 answers

A.Generate a shared access signature (SAS) token with limited permissions.

B.Assign Azure RBAC roles such as Storage Blob Data Contributor to users or groups.

C.Configure a storage firewall to allow only specific IP addresses.

D.Use storage account access keys for authentication.

E.Enable hierarchical namespace on the storage account.

AnswersB, E

RBAC provides role-based access control integrated with Entra ID.

Why this answer

Options A and B are correct. Option A: Enabling hierarchical namespace is required for ACLs. Option B: RBAC roles like Storage Blob Data Contributor provide coarse-grained access.

Option C is wrong because storage account keys bypass identity. Option D is wrong because SAS tokens also bypass identity. Option E is wrong because firewall rules do not use Entra ID.

Full explanation →

399

Multi-Selecthard

Which THREE metrics should you monitor to optimize the performance of an Azure Synapse Analytics dedicated SQL pool? (Choose three.)

Select 3 answers

A.Storage used percentage

B.DWU (Data Warehouse Unit) usage percentage

C.Active queries count

D.Buffer cache hit ratio

E.Memory grant waiters count

AnswersB, C, E

Indicates resource utilization

Why this answer

Options A, C, and D are correct. Option A: DWU usage indicates if the pool is under- or over-provisioned. Option C: Memory grant waiters shows queries waiting for memory.

Option D: Active queries help concurrency. Option B is wrong because cache hit ratio is for SQL Server. Option E is wrong because storage usage is about capacity, not performance.

Full explanation →

400

MCQeasy

You are designing a data pipeline that uses Azure Data Factory to load data from an FTP server to Azure Data Lake Storage. The FTP server requires authentication with username and password. Which type of linked service should you create?

A.FTP

B.Azure Blob Storage

C.HTTP

D.Rest service

AnswerA

FTP linked service supports username and password authentication.

Why this answer

Option A is correct because FTP linked service supports username/password authentication. Option B is incorrect because HTTP linked service is for HTTP endpoints. Option C is incorrect because Azure Blob Storage linked service is for Azure blobs.

Option D is incorrect because Rest service is for REST APIs.

Full explanation →

401

MCQeasy

Your company uses Azure Databricks for data processing. You need to ensure that spark jobs cannot access certain storage accounts. What is the most secure approach?

A.Store storage account keys in Azure Key Vault and retrieve them in notebooks.

B.Use shared access keys and restrict their usage.

C.Use Azure RBAC to grant specific storage account permissions to the Azure Databricks managed identity.

D.Disable public network access on storage accounts.

AnswerC

RBAC provides fine-grained access control using managed identities.

Why this answer

Option D is correct because Azure RBAC on storage accounts using Microsoft Entra ID (formerly Azure AD) is the recommended way to control access. Option A is wrong because shared access keys provide broad access. Option B is wrong because secrets are not recommended for production.

Option C is wrong because disabling firewall doesn't help.

Full explanation →

402

MCQeasy

Your Azure Data Factory pipeline is failing with the error: 'Operation on target Copy data1 failed: The remote server returned an error: (403) Forbidden.' The source is Azure Blob Storage and the sink is Azure SQL Database. You have verified the SQL Database firewall rules allow Azure services. What is the most likely cause?

A.The SQL Database firewall is blocking the Data Factory IP

B.The storage account is behind a private endpoint

C.The Data Factory managed identity lacks Storage Blob Data Contributor role on the storage account

D.The SQL Database is throttling the write operations

AnswerC

403 Forbidden indicates authentication/authorization failure

Why this answer

Option B is correct because a 403 error typically indicates that the managed identity used by Data Factory does not have the correct RBAC role (e.g., Storage Blob Data Contributor) on the storage account. Option A is wrong because the error is 403, not 400. Option C is wrong because the SQL Database firewall is open.

Option D is wrong because there is no indication of sink throttling.

Full explanation →

403

Multi-Selectmedium

Which TWO of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2?

Select 2 answers

A.Assign RBAC roles for data access

B.Configure storage firewall rules

C.Use customer-managed keys in Azure Key Vault

D.Use Azure Storage Service Encryption (SSE)

E.Enable TLS 1.2 for all connections

AnswersC, D

Customer-managed keys provide additional control over encryption.

Why this answer

Option C is correct because using customer-managed keys (CMK) in Azure Key Vault allows you to control and rotate the encryption keys used for Azure Storage Service Encryption (SSE), providing an additional layer of security for data at rest. This is a valid method to secure data at rest in Azure Data Lake Storage Gen2, as it ensures that only authorized parties with access to the key vault can decrypt the data.

Exam trap

The trap here is that candidates often confuse access control methods (RBAC, firewalls) or transport security (TLS) with data at rest encryption, mistakenly thinking they secure the stored data itself, when in fact only encryption mechanisms like SSE or CMK protect data at rest.

Full explanation →

404

MCQmedium

A company uses Azure Synapse Analytics with dedicated SQL pools. They need to allow a data scientist to read all tables in the 'sales' schema but prevent access to columns containing personally identifiable information (PII). Which feature should be used?

A.Dynamic data masking

B.Row-level security

C.Column-level security

D.Azure Active Directory authentication

AnswerC

Column-level security restricts access to specific columns based on user or role.

Why this answer

Column-level security (C) is the correct choice because it allows you to restrict access to specific columns in a table, such as PII columns, while granting read access to all other columns in the 'sales' schema. This is achieved by defining a GRANT SELECT statement on the table with a column list, or by using a security policy with a filter predicate that blocks access to sensitive columns. Unlike Dynamic Data Masking, which obfuscates data at query time but does not prevent the user from seeing the masked values in certain scenarios, Column-level security actually denies access to the specified columns entirely.

Exam trap

The trap here is that candidates often confuse Dynamic Data Masking with column-level security, assuming that masking PII is sufficient, but the exam tests the distinction that masking does not prevent data access—it only obfuscates the output, whereas column-level security actually denies read permission on the column.

How to eliminate wrong answers

Option A is wrong because Dynamic Data Masking (DDM) obfuscates PII data at query time but does not prevent the user from reading the underlying data; a user with sufficient privileges can still see the original values by using techniques like casting or applying functions. Option B is wrong because Row-level security (RLS) restricts access to rows based on a predicate, not columns; it cannot hide specific columns within a row. Option D is wrong because Azure Active Directory authentication controls who can connect to the SQL pool but does not provide granular column-level access control within tables.

Full explanation →

405

MCQmedium

You are running a pipeline in Azure Data Factory that uses a Mapping Data Flow. The data flow reads from Azure SQL Database and writes to Azure Synapse Analytics. You find that the data flow is very slow. Which configuration change would most likely improve performance?

A.Set the 'Staging' option to 'Use staging'

B.Increase the 'Compute type' to 'Memory Optimized' and the 'Core count'

C.Enable staging for the sink and use PolyBase

D.Set the 'Partition option' to 'Round robin' on the source

AnswerB

More compute resources speed up data flow execution.

Why this answer

Mapping Data Flows in Azure Data Factory execute on Spark clusters. The default compute configuration may not provide sufficient memory or parallelism for large data volumes. Increasing the 'Compute type' to 'Memory Optimized' and raising the 'Core count' directly allocates more memory and processing cores to the Spark cluster, which accelerates transformations and data movement between Azure SQL Database and Azure Synapse Analytics.

Exam trap

The trap here is that candidates confuse Mapping Data Flow performance tuning with Copy Activity optimizations, such as PolyBase or staging, which are irrelevant to Spark-based data flows.

How to eliminate wrong answers

Option A is wrong because setting 'Staging' to 'Use staging' in a Mapping Data Flow is not a valid configuration; staging is used for copy activities, not for data flows. Option C is wrong because enabling staging for the sink and using PolyBase is a performance optimization for Copy Activity, not for Mapping Data Flow, which uses Spark-native connectors. Option D is wrong because setting the 'Partition option' to 'Round robin' on the source distributes data evenly but does not address the root cause of slow performance, which is insufficient compute resources for the Spark cluster.

Full explanation →

406

MCQhard

You are a data engineer for a global retail company. The company has a hybrid architecture with on-premises SQL Server databases and Azure Synapse Analytics. You need to design a data processing solution that ingests incremental changes from the on-premises SQL Server database (source) into Azure Synapse Analytics (sink) with low latency (under 15 minutes) and high reliability. The source database is 5 TB and experiences high transaction volume during business hours. The solution must minimize impact on the source system and handle schema changes automatically. You have the following options: Option A: Use Azure Data Factory with a copy activity that uses a watermark column to query incremental changes every 10 minutes. The copy activity writes directly to the Synapse table using PolyBase. Option B: Use Azure Data Factory with a mapping data flow that reads from the source using a SQL query with a watermark, performs transformations, and writes to Synapse using staging via Blob Storage and PolyBase. Option C: Use SQL Server Integration Services (SSIS) running on Azure-SSIS Integration Runtime to extract data using change data capture (CDC) and load into Synapse. Option D: Use Azure Databricks with Auto Loader to ingest files from a staging area that is populated by a separate log-shipping process from the source. Which option should you choose?

A.Option C

B.Option A

C.Option D

D.Option B

AnswerD

Mapping data flow supports schema drift and uses staging for PolyBase.

Why this answer

Option B is correct because it handles incremental loads with low latency, uses PolyBase for efficient loading, and mapping data flow allows for schema drift handling and transformations without impacting source. Option A lacks schema drift handling. Option C requires SSIS packages and may have higher latency.

Option D requires additional log-shipping, increasing complexity and latency.

Full explanation →

407

MCQhard

You are designing a data processing solution for a retail company. The solution must ingest streaming sales data from point-of-sale (POS) systems and batch uploads from stores that are offline. The total data volume is 5 TB daily. The solution must allow real-time dashboards and periodic batch processing. Which combination of services and ingestion patterns is most cost-effective and scalable?

A.Use Azure IoT Hub for POS streaming and Azure Blob Storage for offline store uploads, then process with Stream Analytics and Data Factory

B.Use Azure Event Hubs with Kafka protocol for all incoming data. Use Stream Analytics for real-time dashboards and Event Hubs Capture to land data in ADLS for batch processing

C.Use Azure Data Lake Storage for all data, then use Azure Databricks structured streaming for real-time and batch

D.Use Azure Stream Analytics directly on POS data and store offline uploads in Blob Storage, then batch process with U-SQL

AnswerB

Why this answer

Option B is correct because Azure Event Hubs with Kafka protocol provides a unified ingestion endpoint for both streaming POS data and batch offline uploads, eliminating the need for separate services. Stream Analytics enables real-time dashboards, while Event Hubs Capture automatically lands data into Azure Data Lake Storage for cost-effective batch processing, making this the most scalable and cost-effective solution for 5 TB daily.

Exam trap

The trap here is that candidates often assume IoT Hub is required for streaming data, but Event Hubs with Kafka protocol is more cost-effective and scalable for high-volume POS data, and they overlook Event Hubs Capture as a built-in mechanism for batch landing.

Why the other options are wrong

A

Two separate ingestion services increase management overhead and cost; IoT Hub is designed for device-to-cloud, not necessarily POS systems.

C

ADLS is storage, not an ingestion service; Databricks structured streaming can read from ADLS but not directly ingest streaming data without a queue.

D

Stream Analytics cannot directly ingest from POS systems without a messaging layer; offline store uploads still need ingestion.

Full explanation →

408

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics that uses serverless SQL pools to query Parquet files in Azure Data Lake Storage Gen2. The files are partitioned by year and month. You need to optimize query performance and reduce data scanned. What should you do?

A.Use CREATE EXTERNAL TABLE AS SELECT (CETAS) to create new external tables.

B.Use OPENROWSET with the DATA_SOURCE parameter.

C.Create views that filter on partition columns.

D.Increase the number of files per partition.

AnswerC

Allows partition elimination when querying.

Why this answer

Option C is correct because serverless SQL pools in Azure Synapse Analytics support partition elimination only when queries use views or inline queries that explicitly filter on partition columns (e.g., year, month) in the WHERE clause. This allows the pool to skip scanning irrelevant partitions, reducing data scanned and improving performance. Creating views that encapsulate these filters ensures consistent partition pruning across queries.

Exam trap

The trap here is that candidates often assume that simply using external tables or OPENROWSET automatically provides partition pruning, but in serverless SQL pools, partition elimination only occurs when the query explicitly references the partition columns in the WHERE clause, typically through a view or inline filter.

How to eliminate wrong answers

Option A is wrong because CETAS creates external tables that store query results as new files, but it does not inherently optimize query performance or reduce data scanned for existing partitioned Parquet files; it is a data movement operation, not a query optimization technique. Option B is wrong because OPENROWSET with the DATA_SOURCE parameter allows querying files directly, but without explicit partition column filters in the WHERE clause, the serverless pool cannot perform partition elimination and will scan all files. Option D is wrong because increasing the number of files per partition increases metadata overhead and can degrade query performance due to more file open/read operations, and it does not reduce the amount of data scanned.

Full explanation →

409

MCQhard

A multinational bank needs to store customer transaction records for 10 years to meet regulatory compliance. The data is rarely accessed after the first year. The solution must minimize storage costs while allowing queries on recent data with low latency. Which tiering strategy should you implement?

A.Store all data in Azure SQL Database with partitioning and drop older partitions

B.Use Azure Data Lake Storage Gen2 with a single storage tier

C.Store data in Azure Cosmos DB with time-to-live (TTL) and use Azure Blob Storage for backups

D.Use Azure Blob Storage with lifecycle management to transition from Hot to Cool to Archive tiers

AnswerD

Lifecycle management automates tier transitions, minimizing cost while retaining data.

Why this answer

Option D is correct because Azure Blob Storage lifecycle management automatically transitions blobs from Hot to Cool to Archive tiers based on age, minimizing storage costs for rarely accessed data after the first year while keeping recent data in Hot tier for low-latency queries. This aligns with the 10-year retention requirement and cost optimization goal without manual intervention.

Exam trap

The trap here is that candidates may choose Option C thinking TTL in Cosmos DB can handle retention, but TTL deletes data automatically, which violates regulatory retention requirements, not just cost optimization.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database with partitioning and dropping older partitions permanently deletes data, violating the 10-year regulatory retention requirement. Option B is wrong because Azure Data Lake Storage Gen2 with a single storage tier (e.g., Hot) does not provide automatic cost optimization for rarely accessed data over 10 years, leading to higher costs. Option C is wrong because Azure Cosmos DB with TTL automatically deletes expired data, which cannot be used for long-term retention, and Azure Blob Storage for backups does not replace a tiering strategy for the primary data store.

Full explanation →

410

Multi-Selectmedium

You are designing a data processing pipeline in Azure Data Factory that uses a Mapping Data Flow. You need to handle errors gracefully, such as when a row fails to convert a column value. Which TWO actions should you take? (Choose two.)

Select 2 answers

A.Wrap the data flow in a Try-Catch activity in the pipeline.

B.Set the data flow's error handling to 'Abort on error' to stop processing on first failure.

C.Enable schema drift on the source to automatically handle data type mismatches.

D.Configure the sink transformation to allow errors and log error rows to a separate file.

E.Use a Conditional Split transformation to separate rows that cause errors based on a condition.

AnswersD, E

Sink can be configured to continue on error and write error rows to a file.

Why this answer

Options A and D are correct. Option A: Using a conditional split to route error rows is a common pattern. Option D: Configuring the sink to allow errors and logging them ensures fault tolerance.

Option B is wrong because try-catch is not available in Mapping Data Flows. Option C is wrong because aborting the activity is not graceful. Option E is wrong because schema drift does not handle conversion errors.

Full explanation →

411

MCQmedium

You are responsible for securing an Azure Synapse Analytics workspace. You need to ensure that only authorized users can query the serverless SQL pool. Which authentication method should you use?

A.Microsoft Entra ID authentication

B.Managed identity authentication

C.SQL authentication

D.Azure Key Vault authentication

AnswerA

Provides centralized identity management and security.

Why this answer

Option B is correct because Microsoft Entra ID authentication is recommended for serverless SQL pool. Option A is wrong because SQL authentication is less secure and not recommended. Option C is wrong because managed identity is for service-to-service, not user queries.

Option D is wrong because Azure Key Vault stores secrets, not authentication.

Full explanation →

412

MCQhard

You are a data engineer working for a logistics company. You have an existing Azure Data Factory pipeline that ingests data from a REST API to Azure Data Lake Storage Gen2. The API has rate limiting that can cause failures. You need to implement a solution that can handle rate limiting by retrying with exponential backoff. The pipeline should also log the number of retries for each API call. What should you do?

A.Configure the Copy activity with retry policy using exponential backoff by setting the retry count and retry interval. Enable diagnostic logs to capture retry details.

B.Use a Web activity with a Until loop to implement custom retry logic.

C.Use an Azure Function as a custom activity in Azure Data Factory to implement retry logic with exponential backoff.

D.Use Azure Logic Apps to call the API and then copy the response to Azure Data Lake Storage Gen2.

AnswerA

Built-in retry with exponential backoff and logging.

Why this answer

Option B is correct because Azure Data Factory's Copy activity has built-in support for retry with exponential backoff when you set the retry count and retry interval. You can also log retry attempts using diagnostic logs. Option A is wrong because Azure Logic Apps would be a separate service and add complexity.

Option C is wrong because Azure Functions would require custom development. Option D is wrong because the Web activity does not have built-in retry with exponential backoff.

Full explanation →

413

MCQmedium

Refer to the exhibit. You are querying the sys.external_tables view in an Azure Synapse Analytics serverless SQL pool. The query returns no rows, but you know that external tables have been created. What is the most likely reason?

A.The external tables are using PolyBase, which is not supported in serverless SQL pool.

B.Serverless SQL pool does not support external tables; you must use a dedicated SQL pool.

C.The external tables were created using OPENROWSET, not CREATE EXTERNAL TABLE, so they do not appear in sys.external_tables.

D.The user does not have permission to view the sys.external_tables view.

AnswerC

OPENROWSET queries do not create external table metadata; they are ad-hoc queries.

Why this answer

In Azure Synapse Analytics serverless SQL pool, external tables are created using the CREATE EXTERNAL TABLE statement, and they are listed in the sys.external_tables view. However, if you query data directly via OPENROWSET without creating an external table, those ad-hoc queries do not register any metadata in sys.external_tables. Since the question states that external tables have been created but the view returns no rows, the most likely reason is that the tables were actually created using OPENROWSET, not CREATE EXTERNAL TABLE, so they are not cataloged in the system view.

Exam trap

The trap here is that candidates may assume any external data access creates a catalog entry, but the exam tests the specific difference between DDL-based external tables and ad-hoc OPENROWSET queries in serverless SQL pool.

How to eliminate wrong answers

Option A is wrong because PolyBase is fully supported in serverless SQL pool for reading external data; it is not unsupported. Option B is wrong because serverless SQL pool does support external tables via CREATE EXTERNAL TABLE, and they are visible in sys.external_tables. Option D is wrong because if the user lacked permission to view sys.external_tables, the query would typically return an error or no rows, but the question states the user knows external tables exist, making a permission issue less likely than the metadata not being populated due to using OPENROWSET.

Full explanation →

414

MCQmedium

Refer to the exhibit. You have a managed identity that needs to read data from the 'data' container in Azure Data Lake Storage Gen2. The policy currently denies access. What is the most likely cause?

A.The condition on 'acs:RequestVersion' is preventing access because the request does not use the specified API version

B.The resource path is malformed; it should include the blob path

C.The action 'Microsoft.Storage/storageAccounts/blobServices/containers/read' is incorrect; it should be 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'

D.The principal is a managed identity, but the policy requires a user-assigned identity

AnswerA

The condition requires API version 2019-12-12, which may not be used.

Why this answer

The condition `acs:RequestVersion` requires the request to use a specific API version, which may not be met. Option A is wrong because the action is correct for reading containers. Option B is wrong because the resource path is correct.

Option C is wrong because the principal is a managed identity.

Full explanation →

415

MCQmedium

You are designing a data processing solution that uses Azure Databricks to transform large datasets. You need to ensure that the processing is cost-effective and can scale to handle variable workloads. Which cluster configuration should you recommend?

A.Use an auto-scaling cluster with spot instances.

B.Use a fixed-size cluster with premium tier.

C.Use a Photon-accelerated cluster with premium tier.

D.Use an interactive cluster with a large number of workers.

AnswerA

Auto-scaling and spot instances provide cost-effectiveness and scalability.

Why this answer

Option A is correct because auto-scaling clusters in Azure Databricks dynamically adjust the number of workers based on workload demands, ensuring cost-effectiveness by scaling down during low activity. Spot instances (Azure Spot VMs) further reduce costs by using unused Azure capacity at a significant discount, making this combination ideal for variable workloads where fault tolerance is acceptable.

Exam trap

The trap here is that candidates often assume premium tier or Photon acceleration automatically improves cost-effectiveness, but these features address performance or governance, not the core requirement of scaling with variable workloads and minimizing cost via spot pricing.

How to eliminate wrong answers

Option B is wrong because a fixed-size cluster cannot scale to handle variable workloads, leading to either over-provisioning (higher costs) or under-provisioning (performance degradation). Option C is wrong because Photon-accelerated clusters are optimized for high-performance SQL and DataFrame workloads, but they do not inherently address cost-effectiveness for variable workloads; the premium tier adds features like role-based access control but does not enable scaling or spot pricing. Option D is wrong because an interactive cluster with a large number of workers is designed for ad-hoc analysis and collaboration, not for cost-effective batch processing; it lacks auto-scaling and spot instance support, leading to higher costs during idle periods.

Full explanation →

416

MCQeasy

A company wants to ingest streaming data from IoT devices into Azure for real-time analytics. The data must be available for immediate querying and also stored long-term in a cost-effective format. Which Azure service should be used as the primary ingestion endpoint?

A.Azure SQL Database

B.Azure Event Hubs

C.Azure Data Lake Storage Gen2

D.Azure Blob Storage

AnswerB

Optimized for high-throughput streaming ingestion.

Why this answer

Azure Event Hubs is the correct primary ingestion endpoint for streaming IoT data because it is a fully managed, real-time data ingestion service optimized for high-throughput, low-latency event streaming. It can ingest millions of events per second from IoT devices and integrates natively with Azure Stream Analytics and other analytics services for immediate querying, while also supporting long-term retention via Event Hubs Capture to cost-effective storage like Azure Blob Storage or Data Lake Storage.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage or Data Lake Storage as ingestion endpoints because they are cost-effective for storage, but they lack the real-time streaming capabilities and event-ordering guarantees that Event Hubs provides for immediate querying.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database designed for OLTP workloads, not for high-volume streaming ingestion; it lacks native support for event streaming protocols like AMQP or Kafka and would create bottlenecks and high costs for real-time IoT data. Option C is wrong because Azure Data Lake Storage Gen2 is a hierarchical file storage optimized for big data analytics and batch processing, not a real-time ingestion endpoint; it cannot natively accept streaming events or provide sub-second querying without an intermediary ingestion service. Option D is wrong because Azure Blob Storage is an object storage service for unstructured data, not designed for real-time event ingestion; it does not support streaming protocols or provide the low-latency, ordered event delivery required for immediate querying.

Full explanation →

417

MCQmedium

You are designing a data pipeline in Azure Data Factory (ADF) that copies data from an on-premises SQL Server database to Azure Synapse Analytics dedicated SQL pool. The pipeline must run daily and handle incremental loads efficiently. Which sink dataset type and copy method should you use?

A.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use the Copy activity with PolyBase enabled.

B.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and enable the built-in Upsert option.

C.Use Azure Blob Storage as the sink dataset, then use PolyBase to load into the dedicated SQL pool.

D.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use Stored Procedure with staging table and PolyBase.

AnswerD

This combination enables high-throughput ingestion and supports incremental loading via merge logic in the stored procedure.

Why this answer

Option D is correct because it uses a staging table and PolyBase to efficiently load incremental data into Azure Synapse Analytics dedicated SQL pool. PolyBase provides high-throughput parallel loading, and the stored procedure handles the merge logic (upsert) to manage incremental changes. This approach is recommended for large-scale, daily incremental loads to Synapse.

Exam trap

The trap here is that candidates assume the built-in Upsert option works for all Azure SQL targets, but it is not supported for Azure Synapse Analytics dedicated SQL pool, requiring a custom staging-and-merge pattern instead.

How to eliminate wrong answers

Option A is wrong because the Copy activity with PolyBase enabled does not natively support incremental upsert logic; it only supports bulk insert or append, not merge operations. Option B is wrong because the built-in Upsert option is not available for Azure Synapse Analytics dedicated SQL pool as a sink in ADF Copy activity; it is only supported for Azure SQL Database and SQL Server. Option C is wrong because using Azure Blob Storage as an intermediate sink adds unnecessary complexity and latency; PolyBase can load directly from ADF into Synapse without an intermediate Blob Storage hop.

Full explanation →

418

MCQeasy

You are designing a data processing solution for a marketing company that uses Azure Synapse Analytics. The solution needs to process customer data from multiple sources, including CRM and web analytics. The data must be cleansed and transformed before loading into a dedicated SQL pool. The transformations include string manipulations, date conversions, and lookups. You need to choose a serverless transformation approach that integrates with Azure Synapse pipelines. Which approach should you use?

A.Use Azure Stream Analytics to transform the data in real time.

B.Use PolyBase to load data and then use T-SQL stored procedures to transform.

C.Use Azure Databricks notebooks with Spark to perform transformations.

D.Use mapping data flows in Azure Synapse pipelines.

AnswerD

Serverless, visual transformation within Synapse.

Why this answer

Option C is correct because Azure Synapse pipelines support mapping data flows, which are serverless and provide a visual interface for transformations. Option A is wrong because PolyBase is for loading, not transformation. Option B is wrong because Azure Databricks would require a cluster.

Option D is wrong because Azure Stream Analytics is for streaming, not batch transformations.

Full explanation →

419

MCQeasy

A company stores IoT sensor data in Azure Blob Storage. The data is appended every minute and must be queried in near real-time using a SQL interface. Which Azure service should be used to enable this?

A.Azure Cosmos DB

B.Azure SQL Database

C.Azure Synapse SQL Pool

D.Azure Data Lake Storage Gen2

AnswerD

Correct. Data Lake Storage Gen2 enables SQL queries via Azure Synapse Serverless SQL.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines Blob Storage's scalable, append-friendly architecture with a hierarchical namespace and full POSIX-like ACLs, enabling SQL-based querying via Azure Synapse Serverless SQL or PolyBase. The append-blob pattern (every-minute writes) is natively supported, and the SQL interface (e.g., OPENROWSET) can query the data in near real-time without moving it.

Exam trap

The trap here is that candidates confuse 'SQL interface' with a traditional relational database (like Azure SQL Database) or a NoSQL store (like Cosmos DB), missing that ADLS Gen2 paired with Synapse Serverless SQL provides a schema-on-read SQL layer directly over blob data.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB is a NoSQL database optimized for low-latency, globally distributed key-value or document workloads, not for SQL-based ad-hoc querying of append-only blob data. Option B is wrong because Azure SQL Database is a relational OLTP engine that requires schema-defined tables and transactional ingestion, making it unsuitable for direct, schema-on-read queries over raw append-blob files. Option C is wrong because Azure Synapse SQL Pool (dedicated) is a massively parallel processing (MPP) data warehouse designed for large-scale batch analytics, not for near real-time queries on continuously appended blob data without complex ingestion pipelines.

Full explanation →

420

MCQmedium

You have an Azure Data Factory pipeline that uses a Self-Hosted Integration Runtime (SHIR) to copy data from an on-premises Oracle database to Azure Blob Storage. The pipeline is failing with a connectivity error. You have verified that the SHIR is running and the network firewall allows outbound traffic to Azure. What is the most likely cause of the failure?

A.The SHIR is not registered with Azure Data Factory.

B.The SHIR cannot reach the Oracle database due to a network firewall.

C.The SHIR requires inbound port 443 from Azure to on-premises.

D.The SHIR does not have access to Azure Key Vault.

AnswerB

The SHIR must have network access to the on-premises database.

Why this answer

Option D is correct because the SHIR needs network access to the on-premises database; if the database is not reachable due to firewall or network configuration, it will fail. Option A is wrong because SHIR does not require inbound ports. Option B is wrong because SHIR uses outbound to Azure, but the error is connectivity to on-premises.

Option C is wrong because SHIR does not require Azure Key Vault for basic connectivity.

Full explanation →

421

Multi-Selecthard

Which THREE components are required to implement a modern data warehouse architecture on Microsoft Azure using Azure Synapse Analytics?

Select 3 answers

A.Microsoft Purview for data governance and lineage.

B.Power BI as the data visualization layer.

C.Azure Data Lake Storage Gen2 as the data lake.

D.Azure Analysis Services for semantic modeling.

E.Azure Synapse dedicated SQL pool for data warehousing.

AnswersA, C, E

Purview provides metadata management and data discovery.

Why this answer

Options A, C, and D are correct because a modern data warehouse typically uses a data lake for storage (ADLS Gen2), a compute engine for analytics (Synapse SQL pool), and a metadata catalog (Purview). Option B is wrong because Power BI is a visualization tool, not required. Option E is wrong because Azure Analysis Services is optional for semantic modeling.

Full explanation →

422

MCQhard

You have an Azure Data Factory pipeline defined as shown. The pipeline is failing because the preCopyScript truncates the staging table before each run, but the table is empty on the first run. What change would you make to ensure the pipeline works correctly?

A.Remove the preCopyScript entirely.

B.Increase the writeBatchSize to 50000 to speed up the copy.

C.Change the preCopyScript to: IF OBJECT_ID('dbo.Staging') IS NOT NULL TRUNCATE TABLE dbo.Staging.

D.Set recursive to false in the source.

AnswerC

This conditional truncation prevents error when table is empty.

Why this answer

Option B is correct because the preCopyScript should check if the table exists before truncating. The script 'IF OBJECT_ID('dbo.Staging') IS NOT NULL TRUNCATE TABLE dbo.Staging' handles the first run. Option A is wrong because setting writeBatchSize higher may cause memory issues.

Option C is wrong because disabling recursive is not related to the truncate issue. Option D is wrong because the script is executed on each run, not only on the first.

Full explanation →

423

MCQmedium

A company is designing a data lake on Azure Data Lake Storage Gen2. Data comes from multiple sources with varying schemas. The team must minimize storage costs while keeping all data available for future processing. Which storage tier should they use for the raw ingested data?

A.Premium tier

B.Archive tier

C.Cool tier

D.Hot tier

AnswerC

Correct. Cool tier balances cost and availability for infrequently accessed data.

Why this answer

The Cool tier is the optimal choice for raw ingested data in a data lake because it offers low storage costs while maintaining low-latency access for future processing. Unlike the Archive tier, Cool tier data is immediately available without the multi-hour rehydration delay, and it is significantly cheaper than the Hot tier for data that is infrequently accessed but must remain online for ETL or batch processing.

Exam trap

The trap here is that candidates often confuse 'minimize storage costs' with 'cheapest tier possible' and select Archive, forgetting that raw data must be immediately accessible for future processing, which Archive cannot provide without significant delay.

How to eliminate wrong answers

Option A is wrong because the Premium tier is designed for high-transaction workloads with sub-millisecond latency, not for cost-efficient storage of raw data; it would incur unnecessary expense. Option B is wrong because the Archive tier requires data to be rehydrated (taking up to 15 hours) before it can be read, making it unsuitable for raw data that must be available for future processing. Option D is wrong because the Hot tier is optimized for frequent access and has the highest storage cost, which contradicts the requirement to minimize storage costs for data that is not accessed often.

Full explanation →

424

MCQmedium

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data volume is approximately 500 GB per hour with peak spikes. The pipeline must minimize data loss during transient failures. Which feature should you implement?

A.Use Azure Synapse Pipeline with Auto-commit and checkpointing to process streaming data.

B.Use PolyBase to load data directly from Event Hubs to the dedicated SQL pool.

C.Use COPY INTO statement to ingest data from Event Hubs into the dedicated SQL pool.

D.Enable Event Hubs Capture to write data to Azure Data Lake Storage and then load using PolyBase.

AnswerA

Auto-commit with checkpointing in Synapse Pipeline provides fault tolerance and exactly-once processing for streaming data.

Why this answer

Option A is correct because Azure Synapse Pipeline with Auto-commit and checkpointing provides exactly-once processing semantics for streaming data from Event Hubs, ensuring no data loss during transient failures by committing offsets only after successful writes to the dedicated SQL pool. This feature is designed for high-volume streaming (500 GB/hour) and handles peak spikes through parallelization and retry logic, making it the optimal choice for minimizing data loss.

Exam trap

The trap here is that candidates often confuse batch-loading technologies like PolyBase or COPY INTO with streaming capabilities, overlooking that only checkpointing-based pipelines provide the fault tolerance needed for real-time data ingestion with minimal loss.

How to eliminate wrong answers

Option B is wrong because PolyBase is a bulk-loading technology for batch data from external sources like Azure Data Lake Storage or Blob Storage, not designed for real-time streaming from Event Hubs, and it lacks checkpointing or auto-commit mechanisms to handle transient failures without data loss. Option C is wrong because the COPY INTO statement is used for batch ingestion from files in Azure Data Lake Storage or Blob Storage, not for streaming data from Event Hubs, and it does not provide streaming-specific fault tolerance like checkpointing. Option D is wrong because Event Hubs Capture writes data to Azure Data Lake Storage in batches (e.g., every 5 minutes or 200 MB), introducing latency and potential data loss during the capture window, and PolyBase loading from there adds further delay, failing the requirement to minimize data loss during transient failures in a streaming pipeline.

Full explanation →

425

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool that is used for reporting. You notice that the tempdb database is growing rapidly and causing queries to fail. Which two actions should you take to mitigate the issue? (Select two.)

A.Enable result-set caching to reduce query reruns.

B.Increase the service level (DWU) of the dedicated SQL pool.

C.Reduce the degree of parallelism (MAXDOP) for the workload.

D.Move tempdb to a separate storage account.

E.Optimize queries that perform large sorts or hash joins.

AnswerC, E

Lowering MAXDOP reduces the number of concurrent operations that can consume tempdb resources.

Why this answer

Options B and D are correct. Reducing the degree of parallelism (DOP) limits the number of concurrent operations, reducing tempdb usage. Optimizing query performance can reduce large sorts and hash joins that use tempdb.

Option A is wrong because moving tempdb is not supported in Azure Synapse. Option C is wrong because increasing DWU may provide more tempdb space but does not address the root cause. Option E is wrong because result-set caching does not affect tempdb usage.

Full explanation →

426

MCQmedium

Your organization uses Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2. You notice that queries are taking longer than expected. You need to identify which queries are consuming the most resources and optimize them. What should you do first?

A.Use query hints to optimize execution plans.

B.Query the sys.dm_exec_requests DMV to view running queries and their resource usage.

C.Enable diagnostic settings and send query logs to Log Analytics.

D.Create statistics on all columns used in queries.

AnswerB

DMVs give real-time insight into resource consumption.

Why this answer

Option A is correct because DMVs like sys.dm_exec_requests provide real-time resource consumption for serverless SQL pool. Option B is wrong because query hints may help but you need to identify problematic queries first. Option C is wrong because diagnostic settings send logs to Log Analytics, but DMVs are immediate.

Option D is wrong because statistics are already maintained by the serverless pool.

Full explanation →

427

MCQmedium

You execute the above T-SQL in a serverless SQL pool in Azure Synapse Analytics. The external table creation succeeds, but when you query the table, it returns zero rows. The folder 'sales/products/' exists in the container and contains multiple .parquet files. What is the most likely cause?

A.The file format is incorrect; should be DELIMITEDTEXT instead of PARQUET.

B.The LOCATION path in the external table does not match the actual file path.

C.The external data source location uses the wrong endpoint; should use .blob.core.windows.net instead.

D.The credential used in the external data source does not have read permissions.

AnswerB

If the files are in a subfolder or the path is incorrect, no files are read.

Why this answer

Option B is correct because the external table's LOCATION parameter specifies a path relative to the external data source's root. Even though the folder 'sales/products/' exists, the LOCATION must exactly match the subfolder path within the container. A mismatch (e.g., missing a trailing slash, case sensitivity, or an extra prefix) causes the serverless SQL pool to scan no files, returning zero rows.

Exam trap

The trap here is that candidates assume the LOCATION must include the full container path, but it is relative to the external data source's root, so a mismatch in the relative subfolder (e.g., missing a slash or using an absolute path) leads to zero rows without an error.

How to eliminate wrong answers

Option A is wrong because the files are .parquet, so PARQUET is the correct file format; using DELIMITEDTEXT would fail to parse the binary Parquet data. Option C is wrong because serverless SQL pools in Azure Synapse use the .dfs.core.windows.net endpoint (Azure Data Lake Storage Gen2) by default; .blob.core.windows.net is for legacy Blob Storage and would cause a connection error, not zero rows. Option D is wrong because if the credential lacked read permissions, the query would throw an authorization error (e.g., 'Access denied'), not silently return zero rows.

Full explanation →

428

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. The data engineering team notices that queries against a large fact table are running slowly. The table uses round-robin distribution and has a columnstore index. The team wants to improve query performance without adding more resources. Which action should the team take?

A.Keep round-robin distribution but increase the degree of parallelism.

B.Change the distribution to hash on multiple columns.

C.Change the distribution to hash on the column that is most frequently used in joins.

D.Rebuild the table as a heap to improve insert performance.

AnswerC

Hash distribution on a join key reduces data shuffling.

Why this answer

Option B is correct because hash-distributing the fact table on a join key that is frequently used in queries allows parallel processing and reduces data movement, improving performance. Option A is incorrect because increasing distribution columns in a hash-distributed table is not supported. Option C is incorrect because round-robin is already used and is not optimal for large fact tables.

Option D is incorrect because clustered columnstore is already in place; a heap would degrade performance.

Full explanation →

429

MCQmedium

Refer to the exhibit. A data engineer notices that the target SQL table contains duplicate rows after a pipeline run. Which change to the pipeline configuration would prevent duplicates?

A.Remove the 'preCopyScript'

B.Change 'writeBatchSize' to 5000

C.Add a 'Upsert' setting in SqlSink with a key column

D.Set 'recursive' to false

AnswerC

Upsert ensures that rows are updated or inserted based on a key, preventing duplicates.

Why this answer

The preCopyScript truncates the table before copy, but if the pipeline is run multiple times and truncation fails or is skipped, duplicates can occur. Using upsert semantics or adding a watermark could help, but the simplest fix is to ensure truncation is reliable. However, the best practice is to use a merge/upsert pattern.

Among the options, adding a surrogate key and using upsert is most effective.

Full explanation →

430

MCQeasy

You have a pipeline in Azure Data Factory that copies data from on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection timed out' error. You have already verified that the Integration Runtime is running and the SQL Server firewall allows connections from the Integration Runtime. What should you check next?

A.Ensure the Integration Runtime is registered and online

B.Check if the Blob Storage endpoint is accessible from the Integration Runtime

C.Check if the SQL Server is configured to allow remote connections and that TCP/IP is enabled

D.Verify that the SQL Server login credentials are correct

AnswerC

Timeout often indicates network blocking or SQL Server not listening on TCP/IP.

Why this answer

Option C is correct because the 'Connection timed out' error, despite the Integration Runtime being running and the firewall allowing connections, typically indicates that SQL Server is not listening on the expected TCP port. This often happens when TCP/IP is disabled in SQL Server Configuration Manager or remote connections are not enabled. Without TCP/IP enabled, the Integration Runtime cannot establish a network connection to the SQL Server instance, leading to a timeout.

Exam trap

The trap here is that candidates assume a 'Connection timed out' error is always a firewall or network issue, overlooking the SQL Server-side protocol configuration that must be explicitly enabled for remote TCP connections.

How to eliminate wrong answers

Option A is wrong because the question states that the Integration Runtime is already verified as running, so re-checking its registration and online status is redundant and does not address the timeout. Option B is wrong because the error is a connection timeout to SQL Server, not to Blob Storage; the pipeline fails before data transfer begins, so Blob Storage accessibility is irrelevant at this stage. Option D is wrong because incorrect login credentials would result in an authentication error (e.g., 'Login failed'), not a 'Connection timed out' error, which is a network-level issue.

Full explanation →

431

MCQhard

A company stores sensitive customer data in Azure Data Lake Storage Gen2. They need to implement a data retention policy where data older than 90 days is automatically moved to the 'cold' access tier, and data older than 365 days is deleted. Which Azure feature should be used to automate this?

A.Blob Storage lifecycle management

B.Azure Automation

C.Azure Data Factory

D.Azure Policy

AnswerA

Lifecycle management policies can automatically move blobs between tiers and delete based on age.

Why this answer

Azure Blob Storage lifecycle management is the correct feature because it allows you to define rules that automatically transition blobs to cooler access tiers (like the 'cold' tier) after a specified number of days and delete them after a further period. This directly meets the requirement to move data older than 90 days to the cold tier and delete data older than 365 days, all without custom code or manual intervention.

Exam trap

The trap here is that candidates often confuse Azure Policy (which enforces resource-level compliance) with data lifecycle management, or they assume a general automation tool like Azure Automation is needed when a native, policy-driven feature already exists.

How to eliminate wrong answers

Option B (Azure Automation) is wrong because it is a general-purpose automation service for running PowerShell or Python runbooks, not a native data lifecycle management feature; it would require custom scripting to enumerate blobs, check ages, and perform tier changes or deletions, adding complexity and maintenance overhead. Option C (Azure Data Factory) is wrong because it is an orchestration and data integration service for moving and transforming data between stores, not a policy-based lifecycle management tool; it could be used to copy or delete data but lacks the declarative, rule-based tiering and deletion capabilities of lifecycle management. Option D (Azure Policy) is wrong because it enforces compliance rules on Azure resource configurations (e.g., requiring encryption or specific SKUs) and cannot directly manage blob tier transitions or deletions based on data age.

Full explanation →

432

MCQhard

You are designing a near-real-time data processing solution for a retail company. The source is a Kafka cluster on-premises. The target is an Azure Synapse Dedicated SQL Pool. The solution must handle up to 10,000 events per second with less than 5-minute latency. Which Azure service should you use to ingest the data?

A.Azure Event Hubs (with Kafka protocol support)

B.Azure Data Lake Storage Gen2

C.Azure IoT Hub

D.Azure Stream Analytics

AnswerA

Event Hubs supports Kafka protocol and can ingest 10K events/sec with low latency.

Why this answer

Azure Event Hubs with Kafka protocol support is the correct choice because it provides a fully managed, high-throughput data ingestion service that can handle up to 10,000 events per second with sub-second latency, and it natively supports the Kafka protocol, allowing direct integration with your on-premises Kafka cluster without custom code or additional gateways. This meets the near-real-time requirement (<5-minute latency) and scales to the specified throughput.

Exam trap

The trap here is that candidates often confuse Azure Stream Analytics as an ingestion service, but it is a processing engine that requires an ingestion layer (like Event Hubs) first, and they may overlook that Azure Event Hubs natively supports the Kafka protocol, making it the direct replacement for Kafka ingestion in Azure.

How to eliminate wrong answers

Option B (Azure Data Lake Storage Gen2) is wrong because it is a hierarchical file store designed for batch analytics and data lake storage, not a real-time event ingestion service; it cannot natively consume Kafka streams or provide sub-5-minute latency for streaming data. Option C (Azure IoT Hub) is wrong because it is optimized for device-to-cloud telemetry from IoT devices, not for high-throughput event streams from a Kafka cluster, and it imposes device identity and throttling limits that are unsuitable for 10,000 events per second from a non-IoT source. Option D (Azure Stream Analytics) is wrong because it is a stream processing engine that requires an input source (like Event Hubs) to ingest data; it cannot directly ingest from Kafka on-premises and is not an ingestion service itself.

Full explanation →

433

MCQeasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion. Which Azure service should you use to ingest streaming data into Synapse Analytics?

A.Azure Data Factory

B.Azure Blob Storage

C.Azure Event Hubs

D.Azure Analysis Services

AnswerC

Event Hubs is designed for streaming data ingestion and works with Synapse.

Why this answer

Option A is correct because Azure Event Hubs is a big data streaming platform and event ingestion service that integrates with Synapse. Option B is wrong because Azure Data Factory is primarily for batch data integration. Option C is wrong because Azure Blob Storage is a storage service, not an ingestion service.

Option D is wrong because Azure Analysis Services is for semantic modeling, not streaming ingestion.

Full explanation →

434

MCQmedium

You are designing a data processing solution for a global company. Data must be processed in near real-time and aggregated by region. You need to minimize latency for downstream consumers. Which Azure service should you use for stream processing?

A.Azure Batch

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Synapse Pipelines

AnswerB

Stream Analytics provides real-time stream processing with SQL-like queries.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed stream processing engine designed for near real-time analytics on high-volume data streams. It can ingest data from sources like Azure Event Hubs or IoT Hub, apply SQL-based transformations, and output aggregated results to sinks such as Azure Synapse or Power BI with sub-second latency, meeting the requirement for minimal downstream latency.

Exam trap

The trap here is that candidates often confuse Azure Data Factory or Synapse Pipelines with stream processing because they support 'real-time' triggers, but these services are fundamentally batch-oriented and cannot achieve the sub-second latency required for continuous stream aggregation.

How to eliminate wrong answers

Option A is wrong because Azure Batch is a batch processing service for running large-scale parallel jobs, not designed for near real-time stream processing; it introduces significant latency due to job scheduling and queuing. Option C is wrong because Azure Data Factory is an ETL and data orchestration service focused on batch data movement and transformation, lacking native support for continuous stream processing. Option D is wrong because Azure Synapse Pipelines are built on the same orchestration engine as Data Factory and are intended for batch-oriented workflows, not for real-time stream aggregation.

Full explanation →

435

MCQhard

You are designing a data pipeline in Azure Data Factory to load incremental changes from an on-premises SQL Server database to Azure Synapse Analytics. The source table has over 1 billion rows and a datetime column 'LastUpdated' that is indexed but not always increasing. The requirement is to capture all changes with minimal latency and no missed rows. Which approach should you recommend?

A.Enable Change Data Capture (CDC) on the source table.

B.Use SQL Server change tracking with a watermark table.

C.Use a custom staging table with triggers to capture changes.

D.Use a watermark column (LastUpdated) and query rows where LastUpdated > last run time.

AnswerB

Change tracking captures changes reliably and works with all editions.

Why this answer

Option D is correct because change tracking reliably captures all changes including deletes and out-of-order updates, and the watermark mechanism ensures no rows are missed. Option A is incorrect because it can miss rows when timestamps are out of order. Option B is incorrect because it requires schema changes (triggers) and may impact source performance.

Option C is incorrect because it requires costly schema changes.

Full explanation →

436

Multi-Selecteasy

Which TWO are valid methods to load data into Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.BCp utility

B.Azure Data Factory Copy Activity

C.PolyBase

D.BULK INSERT

E.COPY statement

AnswersC, E

Loads from Azure Storage.

Why this answer

PolyBase is a valid method for loading data into Azure Synapse Analytics dedicated SQL pool because it uses the T-SQL language to access and combine data from external sources like Azure Blob Storage or Azure Data Lake Store without needing to move the data first. The COPY statement is also valid as it provides a high-throughput, flexible ingestion mechanism that supports various file formats and error handling options directly via T-SQL.

Exam trap

The trap here is that candidates often confuse BULK INSERT (which is SQL Server-specific) with the COPY statement (which is Synapse-specific), or they mistakenly think Azure Data Factory is a direct loading method rather than an orchestration tool.

Full explanation →

437

Multi-Selectmedium

Which TWO actions should you take to secure sensitive data in Azure Data Lake Storage Gen2? (Choose two.)

Select 2 answers

A.Enable public network access from all networks for ease of use

B.Use access control lists (ACLs) to restrict access to specific directories

C.Allow anonymous access to enable sharing

D.Disable soft delete to prevent accidental retention of deleted data

E.Enable encryption at rest using customer-managed keys in Azure Key Vault

AnswersB, E

Granular permissions

Why this answer

Options A and C are correct. Option A: Enabling encryption at rest using customer-managed keys ensures data is encrypted. Option C: Using ACLs provides granular access control.

Option B is wrong because disabling soft delete reduces security. Option D is wrong because public network access should be disabled for security. Option E is wrong because anonymous access is a security risk.

Full explanation →

438

MCQeasy

Which Azure storage solution is best suited for storing large volumes of unstructured data, such as log files and media files, and supports both hierarchical namespace and POSIX-like access control lists?

A.Azure Blob Storage

B.Azure Data Lake Storage Gen2

C.Azure Files

D.Azure SQL Database

AnswerB

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) combines a hierarchical namespace with POSIX-like access control lists (ACLs) on top of Azure Blob Storage. This makes it ideal for storing large volumes of unstructured data (e.g., log files, media files) while supporting fine-grained, POSIX-compliant permissions and directory-level operations that are essential for big data analytics workloads.

Exam trap

The trap here is that candidates often choose Azure Blob Storage because it is the underlying storage for ADLS Gen2, but they overlook the key differentiators—hierarchical namespace and POSIX ACLs—that are exclusive to ADLS Gen2 and not available in standard Blob Storage.

Why the other options are wrong

A

Blob Storage supports unstructured data but does not provide a hierarchical namespace or POSIX ACLs by default.

C

Azure Files is for SMB file shares, not optimized for large-scale unstructured data.

D

Azure SQL Database is a relational database for structured data, not for unstructured data.

Full explanation →

439

MCQmedium

Your company uses Azure Synapse Analytics to run a data warehouse. You have a dedicated SQL pool with a hash-distributed fact table named Sales. The distribution column is ProductID. You notice that queries against the Sales table are slow due to data skew. After analysis, you find that a few products (e.g., ProductID 100, 200) account for 80% of the rows. You need to optimize query performance without redesigning the entire table. You also need to minimize data movement during queries. Which action should you take?

A.Change the distribution to round-robin.

B.Increase the number of distributions to 120.

C.Change the distribution to replicate for the Sales table.

D.Create non-clustered indexes on the ProductID column.

AnswerA

Round-robin distributes rows evenly, eliminating skew.

Why this answer

Option B is correct. Round-robin distribution distributes data evenly across distributions, eliminating skew. However, it may increase data movement for joins.

Given the severe skew, round-robin is a reasonable trade-off. Option A is wrong because adding more distributions does not fix skew. Option C is wrong because changing to replicate distribution is not suitable for large fact tables.

Option D is wrong because creating non-clustered indexes does not address distribution skew.

Full explanation →

440

MCQhard

Your company uses Azure Synapse Analytics to run a large-scale batch processing job every night. The job currently runs on a dedicated SQL pool and takes 4 hours. Management wants to reduce the runtime to under 2 hours without increasing cost. The job involves heavy compute operations with no data movement limitations. What should you do?

A.Create materialized views on frequently queried tables.

B.Increase the service level objective (DWU) of the dedicated SQL pool.

C.Implement workload management to prioritize the job.

D.Enable result-set caching on the dedicated SQL pool.

AnswerD

Result-set caching reduces compute for repeated queries without increasing cost.

Why this answer

Option C is correct because result-set caching stores query results in SSD storage, reducing compute time for repeated queries. Option A is wrong because increasing the service level (DWU) would increase cost. Option B is wrong because materialized views require storage but also incur compute for refresh.

Option D is wrong because workload management separates resources but does not automatically reduce compute time without additional cost.

Full explanation →

441

MCQhard

You are optimizing a data pipeline in Azure Data Factory that uses a Copy activity to transfer data from an Azure SQL Database to a dedicated SQL pool in Azure Synapse Analytics. The source table has 500 million rows and the copy operation is taking too long. You need to reduce the copy duration. Which configuration change will have the most impact?

A.Enable staging and use PolyBase as the copy method for the sink.

B.Change the copy behavior to 'sequential' to reduce load on the source.

C.Increase the degree of copy parallelism (DOP) to the maximum value supported.

D.Split the source data into multiple smaller files and use multiple copy activities running in parallel.

AnswerA

Staging with PolyBase dramatically improves performance for large data loads.

Why this answer

Enabling staging with PolyBase as the copy method for the sink is the most impactful change because PolyBase leverages the massively parallel processing (MPP) architecture of Azure Synapse Analytics to load data in parallel directly into the dedicated SQL pool. This bypasses the single-threaded bottleneck of the standard INSERT-based copy method, dramatically reducing the time required to ingest 500 million rows.

Exam trap

The trap here is that candidates often assume increasing parallelism (DOP) or splitting data into multiple activities is always better, but they overlook that PolyBase's MPP integration with Synapse is the only option that fundamentally changes the data loading mechanism from a serial to a parallel bulk operation.

How to eliminate wrong answers

Option B is wrong because changing the copy behavior to 'sequential' would reduce parallelism and increase the copy duration, not reduce it. Option C is wrong because increasing the degree of copy parallelism (DOP) to the maximum value supported can cause resource contention and throttling on the source Azure SQL Database, often leading to diminishing returns or even slower performance. Option D is wrong because splitting the source data into multiple smaller files and using multiple copy activities running in parallel would require additional orchestration and staging, and without PolyBase or staging, each copy activity would still use the slow row-by-row INSERT method, making it less effective than a single PolyBase-based load.

Full explanation →

442

Multi-Selectmedium

Which TWO of the following are recommended practices for designing a data storage solution using Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete to protect against accidental deletion

B.Use Kerberos authentication for the storage account

C.Use a partition strategy that groups related data together

D.Store all files in a single directory for simplicity

E.Enable anonymous public access for ease of use

AnswersA, C

Soft delete provides a recovery window for deleted data.

Why this answer

Option A is correct because enabling soft delete on Azure Data Lake Storage Gen2 protects against accidental deletion by retaining deleted data for a specified retention period. This allows recovery of blobs or snapshots that were deleted, overwritten, or modified, which is a critical data protection practice for enterprise storage solutions.

Exam trap

Microsoft often tests the misconception that Kerberos is the primary authentication method for Azure Data Lake Storage Gen2, but the correct protocol is Azure AD OAuth 2.0, and candidates may confuse Gen2 with Gen1 or on-premises Hadoop.

Full explanation →

443

MCQeasy

Refer to the exhibit. You have created an external table in Azure Synapse Analytics serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. When you query the external table, you get an error that the external table is not accessible. What should you check first?

A.Check that the external table's LOCATION path is relative to the container and does not start with a slash.

B.Verify that the serverless SQL pool has been granted the 'Storage Blob Data Reader' role on the storage account.

C.Ensure that the external file format is correctly referencing the Parquet format.

D.Confirm that the Snappy compression codec is supported by the serverless SQL pool.

AnswerB

The serverless SQL pool needs read permissions on the storage account to access the data.

Why this answer

The error 'external table is not accessible' in Azure Synapse serverless SQL pool typically indicates an authorization failure when the SQL pool attempts to read the underlying Parquet files in Azure Data Lake Storage Gen2. Serverless SQL pool uses its own service identity to access storage, and it must be granted the 'Storage Blob Data Reader' role on the storage account at the storage account scope to have read permissions. Without this role assignment, the SQL pool cannot authenticate to the storage, resulting in the access error.

Exam trap

The trap here is that candidates often confuse 'external table not accessible' with file path or format issues, but the error message specifically points to a permissions/authorization problem, not a configuration or syntax error.

How to eliminate wrong answers

Option A is wrong because the LOCATION path in an external table for serverless SQL pool must be relative to the container and should not start with a slash; however, an incorrect path format would cause a 'file not found' or 'path does not exist' error, not an 'external table is not accessible' error which is specifically about permissions. Option C is wrong because if the external file format incorrectly references the Parquet format, the error would be about format mismatch or parsing failure (e.g., 'Cannot parse file'), not about table accessibility. Option D is wrong because Snappy compression is fully supported by serverless SQL pool for Parquet files; an unsupported codec would cause a decompression error, not an access-denied error.

Full explanation →

444

Multi-Selectmedium

Which TWO Azure services can be used to implement a polyglot persistence architecture for an e-commerce application that requires both a relational database for orders and a document database for product catalogs?

Select 2 answers

A.Azure Cache for Redis

B.Azure SQL Database

C.Azure Cosmos DB

D.Azure Table Storage

E.Azure Data Lake Storage Gen2

AnswersB, C

Suitable for relational data like orders.

Why this answer

Azure SQL Database is a relational database service that supports ACID transactions and structured querying, making it ideal for storing and managing e-commerce order data with strong consistency and referential integrity. Azure Cosmos DB is a multi-model NoSQL database that provides document database capabilities with flexible schemas and low-latency access, perfectly suited for product catalogs that require high read throughput and schema evolution. Together, they enable a polyglot persistence architecture by using the best storage model for each workload.

Exam trap

The trap here is that candidates often confuse Azure Table Storage with a document database, but Table Storage is a key-value store without native JSON support or rich querying, whereas Cosmos DB provides a true document database with SQL API and indexing.

Full explanation →

445

MCQhard

You are troubleshooting a slow-running pipeline in Azure Data Factory. The pipeline copies data from an on-premises SQL Server to Azure Synapse Analytics using a self-hosted integration runtime. The copy activity is using the 'Auto' copy method. You notice that network bandwidth is limited. Which configuration change would most likely improve performance?

A.Enable staging using Azure Blob Storage and use PolyBase to load into Synapse

B.Increase the Data Integration Units (DIU) for the copy activity

C.Change the copy method to 'Bulk insert'

D.Set the Fault Tolerance option to skip incompatible rows

AnswerA

Staging improves performance by using parallel uploads to Blob Storage.

Why this answer

Option A is correct because when network bandwidth is limited, staging data in Azure Blob Storage allows the copy activity to use PolyBase, which leverages Azure's internal high-speed network for the final load into Synapse. This bypasses the constrained on-premises-to-cloud link for the bulk of the data transfer, significantly improving throughput.

Exam trap

The trap here is that candidates assume increasing DIU or changing the copy method directly speeds up data movement, when in fact the real bottleneck is the network link, and only staging with PolyBase offloads the heavy data transfer to Azure's internal network.

How to eliminate wrong answers

Option B is wrong because Data Integration Units (DIU) control parallelism within the copy activity but do not address the underlying network bandwidth bottleneck; increasing DIU on a constrained link can actually worsen contention. Option C is wrong because 'Bulk insert' is the default method for loading into Synapse and does not change the data path; it still sends all data over the limited network connection. Option D is wrong because Fault Tolerance skips incompatible rows to avoid failures, but it has no impact on data transfer speed or network utilization.

Full explanation →

446

MCQeasy

You are monitoring an Azure Data Factory pipeline that runs daily. You notice that some runs are failing due to transient network errors. You want to automatically retry the failed activities with a 5-minute delay, up to 3 times. How should you configure this?

A.Set the pipeline's 'Concurrency' to 3 and 'Retry' to 1.

B.Leave the default settings as they are because Azure Data Factory automatically retries failed activities 3 times.

C.On each activity, set 'Retry' to 3 and 'Retry interval' to 00:05:00.

D.Configure a 'Retry' policy on the pipeline itself, setting maximum retries to 3 and retry interval to 5 minutes.

AnswerC

Correct: Activities have individual retry settings. Setting retry to 3 with 5-minute interval achieves the requirement.

Why this answer

Option C is correct because Azure Data Factory activities have a 'Retry' property that can be set to 3, and 'Retry interval' to 00:05:00. Option A is too low (1 retry). Option B is wrong because retry is per activity, not at pipeline level.

Option D is wrong because the default retry is 0.

Full explanation →

447

MCQeasy

You run the above query on a table named 'visits' in a dedicated SQL pool. The table has 1 billion rows and is hash-distributed on user_id. The query takes a long time. What is the most likely reason?

A.The query uses a date filter which cannot be pushed down to the distribution.

B.The table is hash-distributed on user_id, but the query uses a different column for aggregation.

C.The table should use a replicated distribution instead of hash distribution.

D.COUNT(DISTINCT) operations are expensive because they require data movement across distributions.

AnswerD

COUNT(DISTINCT) needs to combine distinct values from all distributions.

Why this answer

In a dedicated SQL pool, COUNT(DISTINCT) is inherently expensive because it requires all distinct values to be gathered across distributions before counting. Since the table is hash-distributed on user_id, the distinct count on a different column (likely visit_date or another attribute) forces data shuffling across all distributions to ensure uniqueness, causing significant performance degradation.

Exam trap

The trap here is that candidates often blame the distribution key mismatch (Option B) or filter pushdown (Option A), overlooking the fact that COUNT(DISTINCT) forces a global data movement step regardless of distribution strategy.

How to eliminate wrong answers

Option A is wrong because date filters can be pushed down to distributions in dedicated SQL pool via partition elimination or predicate pushdown, so this is not the primary bottleneck. Option B is wrong because aggregation on a different column than the distribution key does not inherently cause slowness; hash distribution supports aggregation on any column, though it may require partial aggregation per distribution. Option C is wrong because replicated distribution is typically beneficial for small dimension tables (under 2 GB), not for a 1-billion-row fact table, and would cause massive storage overhead and maintenance issues.

Full explanation →

448

MCQeasy

A logistics company needs to store delivery tracking data that is updated frequently by multiple services. The solution must support transactions across multiple documents and provide real-time analytics. Which Azure service should you recommend?

A.Azure Table Storage

B.Azure Cosmos DB with SQL API

C.Azure Data Lake Storage Gen2

D.Azure Service Bus

AnswerB

Cosmos DB supports transactions via stored procedures and analytical store for analytics.

Why this answer

Azure Cosmos DB with SQL API is the correct choice because it provides multi-document transaction support (ACID within a logical partition) and real-time analytics via its change feed and integrated analytical store. This meets the requirement for frequent updates from multiple services while enabling low-latency reads for analytics, unlike other Azure storage options that lack transactional guarantees across documents or real-time query capabilities.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's single-entity transactions with multi-document support, or mistakenly think Azure Data Lake Storage Gen2 can handle transactional updates, when it is designed for append-heavy, analytical workloads.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage does not support multi-document transactions; it only offers single-entity transactions and lacks the ability to perform ACID operations across multiple documents. Option C is wrong because Azure Data Lake Storage Gen2 is optimized for large-scale batch analytics and data lake workloads, not for transactional updates or real-time analytics on frequently updated data. Option D is wrong because Azure Service Bus is a message broker for decoupling services and asynchronous communication, not a data store for transactional or analytical workloads.

Full explanation →

449

MCQhard

Refer to the exhibit. The pipeline fails because the source and sink datasets do not match. The source file is a CSV with columns: CustomerID, Name, City. The sink table dbo.Customer has columns: CustomerID, Name, City, CreatedDate (default GETDATE()). The pipeline uses auto-mapping. Why does the pipeline fail?

A.Auto-mapping fails because the source and sink have different column counts.

B.The recursive setting should be false for CSV files.

C.The writeBatchSize is too large for the sink.

D.The pre-copy script is not allowed for SQL sink.

AnswerA

Auto-mapping maps by name but expects the same number of columns or explicit mapping for extras.

Why this answer

Option C is correct because the source has 3 columns and the sink has 4 columns. With auto-mapping, the copy activity tries to map all source columns to sink columns by name but fails because the sink has an extra column (CreatedDate) that is not nullable and has no default value in the mapping context (the default is a SQL Server default constraint, but auto-mapping expects an explicit mapping or a NULL). The pre-copy script truncates the table, but that does not resolve the column count mismatch.

Option A is wrong because the pre-copy script is valid. Option B is wrong because recursive is fine for a single file. Option D is wrong because writeBatchSize is acceptable.

Full explanation →

450

MCQmedium

You need to process a large dataset stored as CSV files in Azure Data Lake Storage Gen2 using Azure Databricks. The processing involves several transformations and aggregations. You want to minimize shuffle operations. Which approach should you use?

A.Use Delta Lake and apply Z-ordering on the columns used in filters and aggregations

B.Cache the data in memory after reading

C.Use bucketing with a fixed number of buckets

D.Partition the data by a high-cardinality column

AnswerA

Z-ordering co-locates related data, reducing data shuffling.

Why this answer

Z-ordering in Delta Lake co-locates related data within files based on specified columns, which significantly reduces the amount of data scanned during filter and aggregation operations. By minimizing the data that needs to be read, Z-ordering inherently reduces shuffle operations because fewer partitions need to be exchanged across the cluster during transformations. This approach is specifically designed to optimize query performance on large datasets in Azure Databricks without increasing the number of shuffle stages.

Exam trap

The trap here is that candidates often confuse partitioning (which can increase shuffle) with Z-ordering (which reduces shuffle by improving data locality without creating new partitions), leading them to choose bucketing or high-cardinality partitioning as a solution for shuffle minimization.

How to eliminate wrong answers

Option B is wrong because caching data in memory after reading only speeds up repeated access to the same data but does not reduce shuffle operations during transformations or aggregations; shuffle is caused by data movement across partitions, not by I/O latency. Option C is wrong because bucketing with a fixed number of buckets can actually increase shuffle operations if the bucketing columns do not align with the join or aggregation keys, and it does not inherently minimize shuffle; it is primarily used for optimizing joins and aggregations when the number of buckets matches the cluster parallelism. Option D is wrong because partitioning by a high-cardinality column (e.g., a column with many unique values) creates many small partitions, which leads to excessive shuffle overhead and task scheduling inefficiency, increasing rather than minimizing shuffle operations.

Full explanation →

Page 6 of 12

All pages

1 2 3 4 5 6 7 8 9 10 11 12

Practice DP-203 by domain

Target a specific domain to shore up weak areas.

Secure, monitor, and optimize data storage and data processing Design and develop data processing Design and implement data security Monitor and optimize data storage and processing Design and implement data storage Develop data processing

See all domains with question counts →