Knowledge + Practice

CCNA Develop Data Processing Questions

75 of 297 questions · Page 2/4 · Develop Data Processing topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQhard

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from a CSV file in Blob Storage to a Synapse dedicated SQL pool table named dbo.Sales. The pipeline fails. The error message indicates that the 'Amount' column in the sink table does not allow NULLs but the source contains NULL values. What is the best way to resolve this issue without losing data?

A.Use a Mapping Data Flow with a Derived Column transformation to replace NULLs with 0

B.Add a filter in the copy activity to exclude rows with NULL Amount

C.Modify the sink table to have a default value for the Amount column

D.Change the sink table column to allow NULLs

AnswerA

Mapping Data Flow allows you to handle NULLs by providing a default value, ensuring data integrity.

Why this answer

Option D is correct: Adding a derived column transformation in a Mapping Data Flow allows you to replace NULLs with a default value. Option A is wrong because it changes the sink schema which may not be allowed. Option B is wrong because filtering out rows with NULLs would lose data.

Option C is wrong because setting the column to allow NULLs might not be permitted and could impact downstream processes.

Practice this question →

77

Multi-Selecthard

You are designing a data processing solution in Azure Synapse Analytics. You need to load data from multiple sources into a dedicated SQL pool. Which THREE of the following are best practices for loading data?

Select 3 answers

A.Use round-robin distribution for staging tables

B.Use a staging table to load data and then insert into the final table

C.Split data into many small files to increase parallelism

D.Use PolyBase or COPY statement for parallel loading

E.Enable clustered columnstore index during load to improve query performance

AnswersA, B, D

Round-robin distributes data evenly for fast load.

Why this answer

Round-robin distribution is recommended for staging tables because it distributes data evenly across all distributions without requiring a hash key, minimizing data movement during the load process. Since staging tables are temporary and used for intermediate storage, the even distribution ensures that the subsequent INSERT...SELECT operation into the final table benefits from maximum parallelism and reduced skew.

Exam trap

The trap here is that candidates often assume more files always increase parallelism, but Azure Synapse Analytics optimizes for fewer, larger files to minimize metadata operations and maximize throughput.

Practice this question →

78

MCQhard

Refer to the exhibit. You have an Azure Synapse pipeline that runs a Spark notebook daily. The notebook uses the inputDate parameter to filter data. The notebook successfully processes data for '2024-01-01' but fails for '2024-01-02' with an error that the 'sales' table does not exist. The 'sales' table is created daily by a preceding job. What is the most likely cause?

A.The notebook expects a table named 'sales_20240102' but the preceding job creates 'sales_20240101'

B.The notebook activity should have a dependency on the job that creates the table

C.The Spark pool does not have permissions to read the storage account where the table data is stored

D.The pipeline parameter 'inputDate' is not being passed to the notebook correctly

AnswerA

The notebook likely constructs table name from the date parameter, and the table for the new date hasn't been created.

Why this answer

The error indicates that the notebook is looking for a table named 'sales_20240102' (based on the inputDate parameter for '2024-01-02'), but the preceding job creates a table named 'sales_20240101' (the previous day's table). This mismatch occurs because the notebook dynamically constructs the table name using the inputDate parameter, and the preceding job likely creates the table with a date suffix that does not align with the current inputDate. The correct answer is A because the table naming convention is inconsistent between the two processes.

Exam trap

The trap here is that candidates may assume the error is due to missing dependencies or permissions, but the real issue is a logical mismatch in table naming conventions between the table creation job and the notebook's expected table name.

How to eliminate wrong answers

Option B is wrong because adding a dependency on the job that creates the table would only ensure the job runs before the notebook, but it would not fix the naming mismatch between the table created and the table expected. Option C is wrong because the error message explicitly states the 'sales' table does not exist, not a permission issue; a permissions error would typically manifest as an 'Access Denied' or 'AuthorizationFailure' exception. Option D is wrong because if the inputDate parameter were not passed correctly, the notebook would likely fail for all dates or use a default value, not fail specifically for '2024-01-02' while succeeding for '2024-01-01'.

Practice this question →

79

MCQmedium

You are developing a data processing pipeline in Azure Synapse Analytics that uses a SQL script to transform data in a dedicated SQL pool. The pipeline currently runs in the built-in serverless pool, but you want to migrate it to a dedicated SQL pool for better performance. What must you change in the pipeline?

A.Change the linked service from serverless to dedicated SQL pool.

B.Replace the SQL script with a Mapping Data Flow activity.

C.Update the integration runtime to a self-hosted IR.

D.Modify the SQL script to use T-SQL compatible with dedicated SQL pool.

AnswerD

Dedicated SQL pool has a different T-SQL surface area; some functions may need to be rewritten.

Why this answer

Option A is correct because dedicated SQL pool uses T-SQL with specific syntax differences from serverless SQL pool. Option B (linked service) may also need updating but is not the primary change. Option C (integration runtime) is not necessarily required if both pools are in the same workspace.

Option D (data flow) is a different activity type.

Practice this question →

80

MCQhard

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored in Delta Lake format. Which of the following statements is true regarding querying Delta Lake tables with serverless SQL pool?

A.Serverless SQL pool supports Delta Lake only if the files are in Parquet format.

B.You can use the OPENROWSET function with the BULK option and FORMAT='DELTA' to query Delta Lake tables.

C.Serverless SQL pool cannot query Delta Lake tables; you must convert them to Parquet first.

D.You must create an external table in serverless SQL pool using the CREATE EXTERNAL TABLE statement with the Delta format.

AnswerB

OPENROWSET with FORMAT='DELTA' allows querying Delta Lake tables directly.

Why this answer

Serverless SQL pool in Azure Synapse Analytics supports querying Delta Lake tables directly using the OPENROWSET function with the BULK option and FORMAT='DELTA'. This allows you to read Delta Lake data stored in Azure Data Lake Storage Gen2 without needing to convert it to Parquet or create external tables first. The Delta format is natively supported, enabling time travel and schema evolution features.

Exam trap

The trap here is that candidates assume Delta Lake requires special handling or conversion to Parquet, but serverless SQL pool natively supports Delta via OPENROWSET with FORMAT='DELTA', making options A and C incorrect, while option D misleads by suggesting external tables are the primary method.

How to eliminate wrong answers

Option A is wrong because serverless SQL pool supports Delta Lake natively, not only when files are in Parquet format; Delta Lake itself uses Parquet as the underlying storage format but adds transaction logs and metadata. Option C is wrong because serverless SQL pool can query Delta Lake tables directly using OPENROWSET with FORMAT='DELTA', so conversion to Parquet is unnecessary. Option D is wrong because while you can create external tables for Delta Lake, the correct and simplest method is to use OPENROWSET with FORMAT='DELTA', not a CREATE EXTERNAL TABLE statement with Delta format (which is not supported for external tables in serverless SQL pool).

Practice this question →

81

Multi-Selectmedium

Which THREE options are valid ways to transform data in Azure Synapse Analytics?

Select 3 answers

A.Use Power Query online in Synapse pipelines.

B.Use T-SQL scripts in a dedicated SQL pool.

C.Use Mapping Data Flows in Synapse pipelines.

D.Use Spark notebooks in Synapse Spark pools.

E.Use Azure Machine Learning pipelines for data wrangling.

AnswersB, C, D

T-SQL is a primary way to transform data in Synapse SQL pools.

Why this answer

Option B is correct because T-SQL scripts are a native and primary method for transforming data within a dedicated SQL pool in Azure Synapse Analytics. You can use CREATE TABLE AS SELECT (CTAS), INSERT...SELECT, and other T-SQL statements to perform complex transformations like aggregations, joins, and data cleansing directly on the distributed data, leveraging the MPP (Massively Parallel Processing) engine for high performance.

Exam trap

The trap here is that candidates often confuse Power Query Online (a Power BI/ADF feature) with Mapping Data Flows (a Synapse pipeline activity), or assume Azure Machine Learning pipelines are valid for data wrangling in Synapse, when in fact they are separate services for ML lifecycle management.

Practice this question →

82

Multi-Selecthard

Which THREE considerations are important when designing a data processing solution using Azure Databricks for ETL workloads? (Select three.)

Select 3 answers

A.Auto-scaling of compute resources

B.Use of Delta Lake for data reliability

C.Fixed pricing model

D.Minimum network bandwidth

E.Cluster configuration and autoscaling

AnswersA, B, E

Auto-scaling helps handle variable workloads efficiently.

Why this answer

Option A is correct because cluster configuration affects performance and cost. Option C is correct because Delta Lake provides ACID transactions and optimization. Option E is correct because auto-scaling adjusts resources based on workload.

Option B is wrong because fixed pricing is not a design consideration. Option D is wrong because minimum bandwidth is not a typical consideration.

Practice this question →

83

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. The solution must support exactly-once semantics and handle late-arriving data. Which Azure service should you use to implement this solution?

A.Azure Data Factory with tumbling window trigger.

B.Azure Stream Analytics.

C.Azure Functions with Event Hubs trigger.

D.Azure HDInsight Spark Structured Streaming.

AnswerB

Azure Stream Analytics provides exactly-once semantics and handles late-arriving data.

Why this answer

Azure Stream Analytics is the correct choice because it natively integrates with Azure Event Hubs and dedicated SQL pools, supports exactly-once semantics through checkpointing and output deduplication, and provides built-in handling for late-arriving data via configurable late arrival tolerance windows and out-of-order event policies.

Exam trap

The trap here is that candidates often confuse batch-oriented services like Azure Data Factory with streaming solutions, or assume that any event-driven compute (like Azure Functions) can provide exactly-once semantics and late-arriving data handling without understanding the specialized streaming engine requirements.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory with a tumbling window trigger is a batch-oriented orchestration service that cannot process streaming data in real time; it lacks native support for exactly-once semantics in streaming contexts and cannot handle late-arriving data with event-time ordering. Option C is wrong because Azure Functions with an Event Hubs trigger processes events individually or in small batches, does not provide built-in exactly-once output guarantees to a dedicated SQL pool, and lacks native support for late-arriving data handling such as watermarking or out-of-order policies. Option D is wrong because Azure HDInsight Spark Structured Streaming requires significant manual configuration for exactly-once semantics (e.g., managing checkpoint locations and idempotent sinks) and does not offer the same level of integrated, low-latency output to dedicated SQL pools as Azure Stream Analytics; it also adds operational overhead for cluster management.

Practice this question →

84

MCQeasy

You have an Azure Data Factory pipeline that uses a Copy activity to move data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with a timeout error. You need to improve the reliability of the data transfer. Which configuration change should you make?

A.Use staged copy with an intermediate Azure Blob Storage.

B.Use PolyBase as the sink.

C.Enable fault tolerance and configure skip incompatible rows.

D.Increase the retry count in the pipeline activity.

AnswerC

This allows the copy to continue even if some rows fail, improving reliability.

Why this answer

Option C is correct because enabling fault tolerance and configuring 'skip incompatible rows' allows the Copy activity to continue processing even when some rows cause errors (e.g., type conversion failures), which can manifest as timeouts when the activity repeatedly retries the same problematic rows. This setting improves reliability by skipping rows that cannot be copied, preventing the entire pipeline from failing on intermittent data issues.

Exam trap

The trap here is that candidates confuse 'fault tolerance' with 'retry policy,' assuming that increasing retries is the only way to handle failures, whereas fault tolerance addresses row-level errors that cause timeouts without requiring a full activity restart.

How to eliminate wrong answers

Option A is wrong because using staged copy with an intermediate Azure Blob Storage is designed to improve performance for large data transfers or to enable PolyBase, not to address timeout errors caused by incompatible rows or transient failures. Option B is wrong because PolyBase is a sink for loading data into Azure Synapse Analytics (SQL Data Warehouse), not for Azure Blob Storage, and it does not resolve timeout errors in a Copy activity. Option D is wrong because increasing the retry count only re-executes the entire activity on failure, which can exacerbate timeout issues if the root cause is incompatible rows or data skew, leading to longer execution times without addressing the underlying problem.

Practice this question →

85

MCQmedium

You are using Azure Data Lake Storage Gen2 as the data lake for your organization. You need to process files in the 'incoming' folder using a scheduled Azure Databricks notebook. After processing, the files should be moved to the 'processed' folder. The files are large (up to 10 GB) and you want to minimize the time to move them. Which approach should you use?

A.Use the Azure Databricks dbutils.fs.mv() to move the file.

B.Use Azure Data Factory with a Copy activity to move the file, then delete the source.

C.Change the file's metadata to update its directory path.

D.Use the Azure Databricks dbutils.fs.cp() to copy the file to the processed folder, then delete the original.

AnswerA

The move operation in ADLS Gen2 is a metadata rename, which is instantaneous.

Why this answer

Option A is correct because `dbutils.fs.mv()` performs a metadata-only rename operation on Azure Data Lake Storage Gen2, which is instantaneous regardless of file size. This avoids any data movement, making it the fastest approach for moving large files (up to 10 GB) between folders within the same storage account.

Exam trap

The trap here is that candidates often assume moving large files requires copying, but Azure Data Lake Storage Gen2's hierarchical namespace enables instant metadata-only renames, making `dbutils.fs.mv()` the optimal choice.

How to eliminate wrong answers

Option B is wrong because Azure Data Factory Copy activity physically copies the file data across folders, which is slower and incurs additional read/write costs, even though it can delete the source afterward. Option C is wrong because Azure Data Lake Storage Gen2 does not support moving files by changing metadata; directory paths are part of the file's hierarchical namespace and cannot be updated via metadata alone. Option D is wrong because `dbutils.fs.cp()` performs a full data copy, which is time-consuming for large files, and then requires an explicit delete, adding unnecessary overhead compared to a rename.

Practice this question →

86

MCQhard

You are optimizing an Azure Synapse Analytics dedicated SQL pool. A fact table named Sales is partitioned by date and has a hash-distributed column ProductKey. You notice that queries filtering on OrderDate are performing poorly. You need to improve query performance for date range filters. What should you do?

A.Ensure the table is partitioned on the same column used in WHERE clause (OrderDate) and enable partition elimination.

B.Create a non-clustered index on OrderDate.

C.Add ProductKey as a distribution column alongside the existing hash.

D.Change the distribution to round-robin to spread data evenly.

AnswerA

Partition elimination reduces the data scanned by skipping partitions not matching the filter.

Why this answer

Option A is correct because Partition Elimination allows the engine to skip irrelevant partitions. Option B is wrong because round-robin distribution does not improve partition elimination. Option C is wrong because increasing distribution columns may not help with date filtering.

Option D is wrong because indexes are not the primary solution for partition pruning.

Practice this question →

87

MCQhard

You are analyzing a Kusto query in Azure Data Explorer that calculates total sales per product for January 2024 and filters for products with sales over 10,000. The query uses the materialize() function. You notice that the query runs slower than expected. What is the primary reason the materialize() function may not be providing the expected performance benefit in this query?

A.The join with the Products table forces a shuffle that bypasses the materialized result

B.The query uses summarize, which already materializes results internally

C.The datetime range filter is not sargable, causing full table scan

D.The materialize() result is referenced only once in the query, so materialization adds unnecessary overhead

AnswerD

materialize() caches the result; if used once, caching is wasted.

Why this answer

Option B is correct because materialize() only helps if the result is referenced multiple times; in this query, TotalSales is used only once, so materialize() adds overhead without benefit. Option A is wrong because the query is already using summarize. Option C is wrong because datetime filtering is fine.

Option D is wrong because order by does not conflict with materialize.

Practice this question →

88

Multi-Selecteasy

You are optimizing a Spark DataFrame transformation in Azure Synapse Analytics. The DataFrame has 20 columns and 100 million rows. You notice that the job is slow due to many small files being written to the output. Which two actions can you take to reduce the number of output files? (Choose two.)

Select 2 answers

A.Use coalesce() to reduce the number of partitions without a shuffle.

B.Enable caching on the DataFrame before writing.

C.Apply bucketing on a column to group data.

D.Increase the number of partitions using repartition() with a larger number.

E.Use repartition() with a smaller number of partitions.

AnswersA, E

Coalesce reduces partitions and thus output files, minimizing shuffle.

Why this answer

Option A is correct because `coalesce()` reduces the number of partitions without triggering a full shuffle, which minimizes the number of output files while preserving performance. Since the DataFrame already has 100 million rows and 20 columns, coalescing to fewer partitions directly reduces the number of files written, addressing the small-file problem efficiently.

Exam trap

The trap here is that candidates often confuse `coalesce()` with `repartition()`, assuming both cause a shuffle, or they mistakenly think increasing partitions (Option D) will improve performance when it actually exacerbates the small-file issue.

Practice this question →

89

MCQmedium

You are designing a data transformation solution for a retail company. The company receives daily CSV files from 200 stores via SFTP. The files must be cleaned, validated, and aggregated before loading into Azure Synapse dedicated SQL pool. The solution must minimize administrative overhead and support easy monitoring. Which approach do you recommend?

A.Use Azure Functions to process each file and write to Synapse via REST API

B.Use PolyBase external tables to load raw data and then use T-SQL stored procedures for transformation

C.Use Azure Databricks with Python notebooks to process the files and write to Synapse

D.Use Azure Data Factory with Mapping Data Flows to clean, validate, and aggregate the data, then load into Synapse SQL pool

AnswerD

Mapping Data Flows provide a visual interface for transformations, are serverless, and have rich monitoring via ADF.

Why this answer

Option D is correct because Azure Data Factory (ADF) with Mapping Data Flows provides a fully managed, code-free ETL service that can read CSV files from SFTP, perform cleaning, validation, and aggregation at scale using Spark clusters, and load the results directly into Azure Synapse dedicated SQL pool via the PolyBase sink. This minimizes administrative overhead by eliminating infrastructure management and supports easy monitoring through ADF’s built-in integration with Azure Monitor and pipeline run views.

Exam trap

The trap here is that candidates often overestimate the simplicity of Azure Functions for batch ETL or assume PolyBase alone handles transformations, when in fact ADF Mapping Data Flows are purpose-built for visual, scalable, and monitorable ETL with minimal overhead.

How to eliminate wrong answers

Option A is wrong because Azure Functions are stateless, event-driven compute units that lack native connectors for SFTP and Synapse, requiring custom code for file parsing, state management, and batch loading, which increases administrative overhead and complexity. Option B is wrong because PolyBase external tables can only load raw data into staging tables, but the transformation logic (cleaning, validation, aggregation) would need to be implemented in T-SQL stored procedures, which are harder to monitor, scale, and maintain compared to a visual ETL tool like ADF. Option C is wrong because Azure Databricks with Python notebooks introduces significant administrative overhead for cluster management, notebook orchestration, and monitoring, and requires more specialized skills than ADF’s low-code Mapping Data Flows, making it less suitable for minimizing overhead.

Practice this question →

90

MCQhard

You are troubleshooting a Synapse Spark notebook that fails when reading Parquet files from Azure Data Lake Storage Gen2. The error message indicates 'Permission denied'. The notebook uses a managed identity (System-assigned) for authentication. The Data Lake Storage account has a firewall enabled with 'Allow Azure services on the trusted services list' turned on. The storage account's RBAC role assignments include 'Storage Blob Data Contributor' for the managed identity. What is the most likely cause of the failure?

A.Parquet files require special permissions that are not granted by RBAC roles

B.The managed identity has not been granted the 'Storage Blob Data Reader' role in addition to 'Storage Blob Data Contributor'

C.The storage account firewall does not have a 'Resource instances' exception for the managed identity

D.The notebook is using an incorrect connection string with account key

AnswerC

Firewall rules require explicit addition of the managed identity as a resource instance to allow access when the firewall is enabled.

Why this answer

Option C is correct because when a storage account firewall is enabled with 'Allow Azure services on the trusted services list' turned on, it only allows trusted Azure platform services to access the storage account, but it does not automatically grant access to a specific managed identity. To allow a managed identity to bypass the firewall, you must add a 'Resource instances' exception for that managed identity's resource (e.g., the Synapse workspace). Without this explicit exception, the managed identity's request is blocked by the firewall, resulting in a 'Permission denied' error even though the RBAC role assignment is correct.

Exam trap

The trap here is that candidates assume 'Allow Azure services on the trusted services list' automatically includes all Azure resources with managed identities, but it only covers specific Azure platform services, not custom managed identities from services like Synapse.

How to eliminate wrong answers

Option A is wrong because Parquet files do not require special permissions beyond what RBAC roles provide; RBAC roles like 'Storage Blob Data Contributor' grant sufficient permissions to read and write Parquet files. Option B is wrong because 'Storage Blob Data Contributor' already includes all permissions of 'Storage Blob Data Reader' (read, write, delete), so adding the reader role is redundant and not the cause of the failure. Option D is wrong because the notebook uses a managed identity for authentication, not a connection string with an account key; the error is about permission denied, not an incorrect connection string or key.

Practice this question →

91

Multi-Selectmedium

You are implementing a data processing solution using Azure Data Factory. You have a pipeline that copies data from Azure Blob Storage to Azure Data Lake Storage Gen2. You need to ensure that the copy activity uses managed identity for authentication and that the data is transferred securely. Which TWO configurations should you apply?

Select 2 answers

A.Set the copy activity's 'enableDataIntegrityValidation' to true.

B.Set the copy activity's 'UseManagedIdentity' property to true.

C.Configure the source and sink linked services to use managed identity authentication.

D.Set the integration runtime to 'AutoResolveIntegrationRuntime' to ensure data stays within Azure.

E.Enable staged copy with a temporary blob store to improve security.

AnswersC, D

Linked services support managed identity for authentication.

Why this answer

Options B and D are correct. Option B is correct because managed identity authentication is configured in the linked service. Option D is correct because setting integration runtime to 'AutoResolveIntegrationRuntime' uses Azure's network, ensuring data transfer remains within Azure and does not go over public internet (if using managed VNet).

Option A is wrong because the copy activity does not have a 'UseManagedIdentity' setting; it's in the linked service. Option C is wrong because HTTPS is already used by default; setting it explicitly is not required. Option E is wrong because staging is not needed for security.

Practice this question →

92

Multi-Selecteasy

Which TWO of the following are supported sources for Azure Data Factory Copy activity? (Choose two.)

Select 2 answers

A.Power BI Dataset

B.Azure DevOps

C.Amazon S3

D.Azure Blob Storage

E.Azure Analysis Services

AnswersC, D

Amazon S3 is supported via the Amazon S3 connector.

Why this answer

Option A is correct because Azure Blob Storage is a supported source. Option C is correct because Amazon S3 is a supported source via a connector. Option B is wrong because Power BI is not a data source for Copy activity.

Option D is wrong because Azure Analysis Services is not a supported source. Option E is wrong because Azure DevOps is not a data store.

Practice this question →

93

MCQmedium

You have a dedicated SQL pool in Azure Synapse that stores a fact table with over 100 billion rows. Query performance is degrading over time. You notice that the table is hash-distributed on a column with many duplicate values. What is the most likely impact?

A.Statistics on the table are outdated.

B.The table is not properly partitioned.

C.Data compression is not working efficiently.

D.Data is unevenly distributed across distributions, causing some distributions to be overloaded.

AnswerD

Duplicate values in hash column cause skew.

Why this answer

D is correct because a hash-distributed table with a column that has many duplicate values leads to data skew. When the hash function maps many rows to the same distribution, some distributions become overloaded with data while others are underutilized. This imbalance causes query performance to degrade as the overloaded distributions become bottlenecks for processing.

Exam trap

The trap here is that candidates often confuse distribution skew with partitioning or statistics issues, but the key clue is the mention of 'many duplicate values' in the hash-distributed column, which directly points to data skew as the root cause.

How to eliminate wrong answers

Option A is wrong because outdated statistics can cause suboptimal query plans, but the primary issue described is data skew due to hash distribution on a column with many duplicates, not statistics freshness. Option B is wrong because partitioning is a separate concept from distribution; while partitioning can help with partition elimination, it does not address the fundamental data skew caused by hash distribution on a high-duplicate column. Option C is wrong because data compression efficiency is affected by data patterns and storage, not directly by distribution skew; compression works at the page level and is not the root cause of query performance degradation from uneven distribution.

Practice this question →

94

MCQhard

You have a pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection closed' error after 1 hour. The data volume is 500 GB over a slow network. What is the most likely cause and solution?

A.The source SQL Server query command timeout is set to 1 hour; increase the command timeout or use staging copy.

B.The network bandwidth is insufficient; use a faster network connection.

C.The copy activity is using a single parallel copy; increase the degree of copy parallelism.

D.The Self-Hosted Integration Runtime (SHIR) has a time-to-live (TTL) setting that is too short; extend the TTL.

AnswerA

The default command timeout for SQL source is 30 minutes, but if set to 1 hour, the connection closes after that. Increasing timeout or using staging copy resolves it.

Why this answer

The 'Connection closed' error after exactly 1 hour indicates the SQL Server source query command timeout (default 60 minutes) is being hit. When copying 500 GB over a slow network, the query execution exceeds this timeout, causing the connection to drop. Increasing the command timeout or using staging copy (which breaks the query into chunks) resolves this by allowing the query to run longer or reducing the per-query duration.

Exam trap

The trap here is that candidates confuse a network bandwidth issue (Option B) with a command timeout, but the exact 1-hour failure is a dead giveaway of the default SQL command timeout, not a generic connectivity problem.

How to eliminate wrong answers

Option B is wrong because insufficient bandwidth would cause slow transfer or eventual timeout, but not a precise 1-hour 'Connection closed' error; the error is tied to a timeout setting, not bandwidth. Option C is wrong because increasing parallelism improves throughput but does not address the root cause of the source query timing out after a fixed duration. Option D is wrong because the Self-Hosted Integration Runtime TTL controls how long the SHIR stays active between activities, not the duration of a single copy operation; the error occurs during the copy, not after idle time.

Practice this question →

95

MCQeasy

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime and runs successfully during business hours. However, after a recent network security update, the pipeline fails with a connection error to the on-premises SQL Server. What is the most likely cause?

A.A firewall rule on the on-premises SQL Server is blocking the self-hosted integration runtime.

B.The Azure Blob Storage account has been moved to a different subscription.

C.The self-hosted integration runtime node has been de-registered from Azure Data Factory.

D.The blob container has reached its maximum capacity.

AnswerA

Network security updates often change firewall rules, blocking the IR.

Why this answer

The self-hosted integration runtime (SHIR) connects to on-premises SQL Server via TCP port 1433 by default. A recent network security update likely added a firewall rule on the SQL Server or the on-premises network that blocks outbound or inbound traffic on this port, preventing the SHIR from establishing a connection. Since the pipeline ran successfully before the update, the most probable cause is a new firewall restriction targeting the SHIR's IP address or subnet.

Exam trap

The trap here is that candidates may confuse a source-side connectivity failure with a sink-side or authentication issue, but the question explicitly states a 'connection error to the on-premises SQL Server,' which points directly to network or firewall blocking at the source, not to storage capacity or SHIR registration status.

How to eliminate wrong answers

Option B is wrong because moving an Azure Blob Storage account to a different subscription does not affect the connectivity between the on-premises SHIR and the on-premises SQL Server; it would only change the storage account's resource ID and require updating linked service credentials, not cause a connection error to SQL Server. Option C is wrong because if the SHIR node were de-registered, the pipeline would fail with an integration runtime not found error, not a connection error to the on-premises SQL Server; the error message would reference the SHIR status, not a network-level timeout or refused connection. Option D is wrong because a blob container reaching maximum capacity (5 TB per container) would result in a storage write error (e.g., 403 or 409) when copying data, not a connection error to the on-premises SQL Server; the error would occur at the sink, not the source.

Practice this question →

96

MCQmedium

Your company is building a real-time dashboard for monitoring website traffic. The data is ingested from web servers into Azure Event Hubs. You need to design a stream processing solution using Azure Stream Analytics that computes the number of unique visitors per minute, per country, and outputs the results to Azure Synapse Analytics for reporting. The solution must handle out-of-order events with a maximum late arrival of 10 seconds. You also need to ensure exactly-once semantics for the output. Which combination of settings should you use?

A.Use a Sliding window of length 1 minute, set out-of-order policy to 10 seconds, and use 'At least once' output mode.

B.Use a Session window with timeout of 1 minute and maximum duration of 10 seconds, set late arrival to 10 seconds, and use 'Exactly once' output mode.

C.Use a Hopping window of size 1 minute and hop 1 minute, set late arrival policy to 10 seconds, and use 'Exactly Once' output mode.

D.Use a Tumbling window of size 1 minute, set late arrival policy to 10 seconds, and configure output to Azure Synapse Analytics with `Exactly once` semantics.

AnswerD

Tumbling window aligns with fixed intervals; late arrival policy handles out-of-order events; Azure Synapse Analytics supports exactly-once with proper configuration.

Why this answer

Option D is correct because a Tumbling window counts events per minute, and the late arrival policy with a 10-second window handles out-of-order events. Exactly-once semantics require that output format is Delta Lake or similar with Idempotent writes; however, Azure Synapse Analytics supports exactly-once with appropriate configuration. Option A is wrong because Hopping windows are for overlapping windows.

Option B is wrong because Sliding windows are for point-in-time counts. Option C is wrong because Session windows are for periods of inactivity.

Practice this question →

97

MCQmedium

Your organization uses Azure Synapse Analytics to run large-scale ETL jobs. A pipeline that loads data from Azure Data Lake Storage Gen2 into a dedicated SQL pool is failing with 'Out of memory' errors during the staging step. The source data is 500 GB in size, and the SQL pool is currently set to DW500c. What should you do to resolve the issue without changing the pipeline logic?

A.Use PolyBase instead of the Copy activity.

B.Switch the source to Azure Data Lake Storage Gen1.

C.Increase the DWU of the dedicated SQL pool to a higher tier (e.g., DW1000c).

D.Partition the source data into smaller files.

AnswerC

Higher DWU provides more memory per distribution, resolving OOM errors.

Why this answer

The 'Out of memory' error during the staging step indicates that the dedicated SQL pool lacks sufficient memory resources to handle the 500 GB data load within the current DWU setting. Increasing the DWU (Data Warehouse Units) from DW500c to a higher tier, such as DW1000c, scales up the memory and compute resources available to the pool, resolving the memory pressure without altering the pipeline logic or source data.

Exam trap

The trap here is that candidates may assume partitioning the source data is a 'logical' fix without realizing it requires pipeline logic changes, or they may confuse PolyBase as a memory-saving alternative rather than a loading protocol that still relies on the pool's resources.

How to eliminate wrong answers

Option A is wrong because PolyBase is not a separate activity but a technology used within the Copy activity or other loading methods; switching to PolyBase does not inherently increase memory and would still fail under the same resource constraints. Option B is wrong because switching the source to Azure Data Lake Storage Gen1 does not affect the memory allocation of the dedicated SQL pool and introduces no performance benefit for the staging step. Option D is wrong because partitioning the source data into smaller files would require changing the pipeline logic (e.g., adding loops or multiple activities) to process files sequentially, which violates the constraint of not changing pipeline logic.

Practice this question →

98

MCQhard

Refer to the exhibit. The pipeline fails with a 'Type mismatch' error. The source file has a column 'Name' of type String, and the destination table expects 'FullName' of type String. What is the most likely cause of the failure?

A.The recursive setting on the source store is causing duplicate reads.

B.The source file does not contain a column named 'Name'.

C.The source file has leading/trailing whitespace in the column names.

D.The sink column 'FullName' is expecting a different data type than String.

AnswerB

The column mapping references a source column 'Name', but if the actual file has a different column header, the copy activity fails with type mismatch.

Why this answer

The pipeline fails with a 'Type mismatch' error because the source file does not contain a column named 'Name'. When the source dataset is configured to read a 'Name' column but the actual file lacks that column, Azure Data Factory or Synapse Pipelines cannot map it to the sink column 'FullName', resulting in a type mismatch error. The error message is misleading because the mismatch is not about data types but about missing source columns, which the service treats as a type incompatibility.

Exam trap

The trap here is that candidates assume 'Type mismatch' always refers to incompatible data types (e.g., String vs. Int), when in fact it can also be triggered by a missing source column that the pipeline expects to map, causing a schema-level mismatch.

How to eliminate wrong answers

Option A is wrong because the recursive setting on the source store controls whether subfolders are included in the read operation; it does not cause duplicate reads that would lead to a type mismatch error. Option C is wrong because leading/trailing whitespace in column names would cause a different error (e.g., column not found) but not a 'Type mismatch' error, and the pipeline would still attempt to map the column if the name matched after trimming. Option D is wrong because the sink column 'FullName' is of type String, matching the source column 'Name' type String, so there is no data type mismatch between the two; the error arises from the missing source column, not from incompatible types.

Practice this question →

99

MCQhard

You are optimizing a data pipeline in Azure Synapse Analytics that loads data from a CSV file in ADLS Gen2 into a dedicated SQL pool using PolyBase. The load is slow and you need to improve performance. Which action would be MOST effective?

A.Increase the service level (DWU) of the dedicated SQL pool.

B.Change the file format from CSV to Avro.

C.Combine the CSV files into fewer, larger files before loading.

D.Use Azure Data Factory to stage the data in Azure Blob Storage before loading.

AnswerC

PolyBase performs more efficiently with fewer, larger files due to reduced metadata operations.

Why this answer

Combining many small CSV files into fewer, larger files reduces the number of file open/close operations and minimizes the overhead of PolyBase's external file enumeration and parallel split logic. PolyBase performs best when each file is at least 256 MB, as it can then assign a full file to each distribution, avoiding the overhead of splitting tiny files across multiple threads.

Exam trap

The trap here is that candidates assume scaling up (DWU) or changing file formats always improves performance, but the exam specifically tests the understanding that PolyBase's parallel processing is most efficient when file sizes align with distribution boundaries, making file consolidation the most effective optimization.

How to eliminate wrong answers

Option A is wrong because increasing DWU scales resources but does not address the root cause of slow PolyBase loads—file fragmentation and metadata overhead—and may incur unnecessary cost. Option B is wrong because changing to Avro improves compression and schema evolution but does not reduce the number of file operations; the performance gain from file count reduction is more direct. Option D is wrong because staging in Azure Blob Storage adds an extra copy step without reducing the number of files PolyBase must process; the bottleneck remains the file enumeration and split overhead.

Practice this question →

100

MCQhard

You are designing a data processing solution for a financial services company. They need to process sensitive customer data in Azure Databricks while complying with GDPR. The data must be encrypted at rest and in transit, and access must be audited. You need to recommend a configuration that meets these requirements. Which combination of actions should you take?

A.Enable Azure Storage Service Encryption (SSE) with platform-managed keys and configure Azure Databricks to use a VNet injection.

B.Apply an Azure Policy to require encryption and assign a built-in GDPR blueprint.

C.Use Azure Security Center to monitor for data exposure and enable Azure Defender for Storage.

D.Enable double encryption using Azure Key Vault with customer-managed keys and enable Azure Databricks audit logs.

AnswerD

Double encryption provides encryption at rest and in transit; customer-managed keys give control; audit logs track access.

Why this answer

Option B is correct because enabling double encryption and using Azure Key Vault for encryption keys ensures encryption at rest and in transit; Azure Databricks audit logs capture access. Option A (Azure Storage Service Encryption) encrypts at rest but not necessarily with customer-managed keys. Option C (Azure Policy) enforces compliance but does not directly encrypt data.

Option D (Azure Security Center) provides monitoring but not encryption.

Practice this question →

101

MCQmedium

You are building a data pipeline in Azure Synapse Analytics that ingests JSON files from Azure Data Lake Storage Gen2, transforms the data using a mapping data flow, and loads it into a dedicated SQL pool. The pipeline must support incremental loads based on a LastModified timestamp in the source files. Which configuration should you use in the mapping data flow source transformation?

A.Use a 'Column pattern' to detect new columns and filter rows.

B.Configure 'Row sampling' settings to take only new rows.

C.Set 'Source' to 'New rows only' in the sink transformation.

D.Enable 'Incremental extraction' and specify the timestamp column for filtering.

AnswerD

Incremental extraction filters rows based on a watermark column like LastModified.

Why this answer

Option C is correct because the 'Incremental extraction' option in a mapping data flow source transformation allows you to filter rows based on a last modified timestamp column, enabling incremental loads. Option A is wrong because 'Column pattern' is used for detecting changes in column names, not row-level timestamps. Option B is wrong because 'Row sampling' selects a random subset of rows, not based on timestamps.

Option D is wrong because 'Source metadata' in the sink transformation is unrelated to incremental loading.

Practice this question →

102

MCQhard

Your team uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is inconsistent, and some queries take a long time to execute. You need to improve query performance. What should you do?

A.Increase the MAXDOP setting in the query

B.Create statistics on the columns used in joins and filters

C.Move the data to a dedicated SQL pool

D.Convert the Parquet files to CSV format

AnswerB

Statistics help the optimizer generate efficient plans.

Why this answer

Option B (Create statistics on the columns used in joins and filters) is correct because serverless SQL pool relies on statistics for optimal query plans. Option A (Increase the maximum degree of parallelism) is not directly applicable. Option C (Convert to CSV) would degrade performance.

Option D (Use a dedicated SQL pool) may be an option but not the best immediate step.

Practice this question →

103

MCQmedium

A data engineering team is building a batch processing solution for a financial services company. Data is ingested daily from multiple sources into Azure Data Lake Storage Gen2 in CSV format. The data must be transformed (filtered, aggregated, joined) and loaded into Azure Synapse Analytics dedicated SQL pool. The team must optimize for cost and performance. The total data volume is 2 TB per day. The team has the following options: Option A: Use Azure Data Factory pipelines with copy activity to load raw CSV files into Synapse staging tables, then use T-SQL stored procedures in Synapse to perform transformations. Option B: Use Azure Databricks with Auto Loader to incrementally ingest CSV files, perform transformations in Spark, and write the results to Synapse using the Spark Synapse connector. Option C: Use Azure Data Factory with mapping data flows to transform the data in a serverless environment and then write to Synapse. Option D: Use Azure Synapse Pipelines (built on ADF) with a notebook activity that runs a PySpark notebook in Synapse Spark pool to transform and load data. Which option should the team choose to minimize cost and management overhead while meeting performance requirements?

A.Option B

B.Option C

C.Option A

D.Option D

AnswerB

Serverless, cost-effective, low overhead.

Why this answer

Option C is correct because mapping data flows run on a serverless ADF integration runtime, which scales automatically and incurs cost only per execution, minimizing management overhead. Option A requires managing staging tables and stored procedures. Option B requires managing Spark clusters.

Option D requires managing Synapse Spark pools. Mapping data flows are cost-effective for batch transformations of this size.

Practice this question →

104

MCQhard

A financial services firm uses Azure Synapse Analytics to process daily trade data. The data is stored in a dedicated SQL pool as partitioned tables by date. Recently, queries that filter on a specific date range have become slow. You suspect that partition pruning is not working effectively. What should you do to improve query performance?

A.Rebuild the columnstore indexes on the table

B.Convert the table to a rowstore heap

C.Create statistics on the date column

D.Increase the number of partitions for the table

AnswerA

Rebuilding reduces fragmentation and improves partition pruning.

Why this answer

Option D (Rebuild the columnstore indexes) is correct because fragmented columnstore indexes can cause poor partition pruning and overall slow query performance. Option A (Increase the number of partitions) may worsen performance due to partition overhead. Option B (Switch to rowstore) would degrade performance for large tables.

Option C (Create statistics on the date column) helps but is not the primary issue here; the problem is likely index fragmentation.

Practice this question →

105

MCQmedium

You are developing a data processing pipeline in Azure Data Factory. The pipeline must copy data from an on-premises SQL Server to Azure Blob Storage. The data volumes are large, and the network connection is unreliable. Which configuration should you use to ensure resilience and resume capability?

A.Use a staging copy with Azure Blob Storage as an intermediate store.

B.Configure the copy activity with fault tolerance and enable 'Skip incompatible rows'. Use a self-hosted integration runtime for the source.

C.Increase the 'parallelCopies' setting to 10 and use a staging table.

D.Use PolyBase to load data directly into Azure Synapse Analytics.

AnswerB

Fault tolerance allows the copy to continue despite errors, and self-hosted IR supports checkpointing for resume.

Why this answer

Option B is correct because using a self-hosted integration runtime enables connectivity to on-premises SQL Server, and enabling fault tolerance with 'Skip incompatible rows' allows the copy activity to continue despite transient failures or data inconsistencies. This configuration ensures resilience by automatically retrying failed operations and skipping rows that cause errors, which is critical for large data volumes over an unreliable network.

Exam trap

Microsoft often tests the misconception that staging or parallelization alone provides resilience, but the key is that fault tolerance with row-level skipping is the only option that directly handles data errors and network interruptions without failing the entire copy.

How to eliminate wrong answers

Option A is wrong because using Azure Blob Storage as an intermediate store (staging copy) adds unnecessary complexity and cost without addressing the core issue of network unreliability; it is typically used for cross-region or cross-account copies, not for resilience against transient failures. Option C is wrong because increasing 'parallelCopies' to 10 can improve throughput but does not provide fault tolerance or resume capability; it may even exacerbate failures by creating more concurrent connections that can fail independently. Option D is wrong because PolyBase is a technology for loading data into Azure Synapse Analytics, not for copying data to Azure Blob Storage, and it does not address the on-premises source or network resilience requirements.

Practice this question →

106

Multi-Selectmedium

You are optimizing an Azure Synapse Pipeline that processes large volumes of data using a Mapping Data Flow. The pipeline is taking too long due to data skew. Which THREE techniques can help mitigate data skew?

Select 3 answers

A.Use a broadcast hash join to distribute the smaller table to all nodes.

B.Increase the Data Integration Units (DIU) in the data flow.

C.Use the 'Optimize' tab in the data flow to enable skewed data optimization.

D.Apply a salting technique by adding a random salt to the join key.

E.Repartition the data on a different column using a random distribution.

AnswersA, C, D

Broadcast join avoids shuffling large data, reducing skew impact.

Why this answer

Option B (use skewed data optimization), Option D (broadcast hash join), and Option E (salting the key) are techniques to handle skew in data flows. Option A (increasing DIU) adds resources but doesn't address skew. Option C (repartition) can help but is not a direct skew mitigation; broadcast join is more effective.

Practice this question →

107

MCQeasy

A small business uses Azure SQL Database as their operational data store. They want to build a simple reporting solution that refreshes daily. The data volume is about 10 GB. They have limited budget and no dedicated data engineering team. They need to extract data from Azure SQL Database, perform basic transformations (e.g., aggregations, joins), and load into a reporting table in the same database. They want to minimize administrative effort and cost. Which approach should they take? Option A: Use Azure Data Factory with a copy activity to copy data from the source to a staging table, then use a stored procedure activity to run T-SQL transformations. Option B: Use Azure Logic Apps to periodically query the source and insert transformed data into the reporting table. Option C: Use Azure Databricks notebooks scheduled via Azure Data Factory to perform the ETL. Option D: Use Azure SQL Database elastic job to run a T-SQL script that performs the ETL within the database.

A.Option B

B.Option A

C.Option D

D.Option C

AnswerC

Simple, cost-effective, and fits requirements.

Why this answer

Option D is correct because using an Azure SQL Database elastic job to run a T-SQL script directly within the database is the simplest, lowest-cost, and lowest-administration approach for this scenario. It avoids external services, requires no data movement, and leverages existing SQL skills, making it ideal for a small business with limited budget and no dedicated data engineering team.

Exam trap

The trap here is that candidates often over-engineer the solution by choosing Azure Data Factory or Databricks, assuming that any ETL requires a dedicated orchestration or big-data tool, when in fact a simple in-database job is the most cost-effective and administratively lightweight option for small-scale, in-place transformations.

How to eliminate wrong answers

Option A is wrong because Azure Data Factory with a copy activity and stored procedure activity introduces unnecessary complexity, cost, and administrative overhead for a simple in-database ETL that could be done with a single T-SQL script. Option B is wrong because Azure Logic Apps is designed for lightweight orchestration and integration, not for performing complex T-SQL transformations or handling 10 GB of data efficiently; it would be slow, costly, and lack proper error handling for data processing. Option C is wrong because Azure Databricks is a powerful but overkill solution for a 10 GB workload; it requires cluster management, incurs significant cost, and demands specialized Spark knowledge, which contradicts the requirement to minimize administrative effort and cost.

Practice this question →

108

MCQhard

You are designing a real-time analytics solution for IoT devices that emit telemetry data every second. The data must be aggregated every minute and stored in Azure SQL Database for historical analysis. You need to minimize latency and operational overhead. Which approach should you recommend?

A.Use Azure Databricks with Structured Streaming to aggregate and write to SQL Database

B.Use Event Hubs Capture to store raw data in blob storage, then use Azure Data Factory to load into SQL Database hourly

C.Use Azure Stream Analytics with a tumbling window of 1 minute and output to Azure SQL Database

D.Use Azure Functions to process events and write to SQL Database

AnswerC

Minimal latency and operational overhead.

Why this answer

Option C is correct because Azure Stream Analytics natively supports real-time stream processing with tumbling windows, allowing you to aggregate IoT telemetry data every minute and output directly to Azure SQL Database with minimal latency. This approach avoids the overhead of managing clusters (Databricks) or orchestrating batch loads (Data Factory), directly meeting the requirement for low latency and operational simplicity.

Exam trap

The trap here is that candidates often over-engineer the solution by choosing Databricks (Option A) for its flexibility, overlooking that Stream Analytics is purpose-built for low-latency, windowed aggregations with minimal operational overhead, while Databricks adds unnecessary complexity for simple time-based aggregations.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming introduces significant operational overhead for cluster management and is overkill for simple minute-level aggregation, plus it adds latency from Spark job initialization and checkpointing. Option B is wrong because Event Hubs Capture stores raw data in blob storage, and using Azure Data Factory to load hourly into SQL Database introduces at least 60 minutes of latency, failing the real-time requirement. Option D is wrong because Azure Functions are stateless and event-driven, lacking built-in windowing capabilities for time-based aggregation, so you would need to implement custom state management (e.g., using Durable Functions or external storage), increasing complexity and latency.

Practice this question →

109

Multi-Selectmedium

You are building a real-time processing solution using Azure Stream Analytics. The solution must handle out-of-order events and late arrivals. Which THREE mechanisms should you configure in the Stream Analytics job?

Select 3 answers

A.Set an 'Out-of-order tolerance' window in the event ordering settings.

B.Adjust the 'Streaming units' to handle higher throughput.

C.Configure a 'Late-arrival tolerance' window.

D.Enable 'Event hub capture' to store raw events for reprocessing.

E.Choose an output adapter that supports exactly-once delivery.

AnswersA, C, E

This defines how late events can be reordered.

Why this answer

Option A is correct because Azure Stream Analytics allows you to configure an 'Out-of-order tolerance' window in the event ordering settings. This window defines the maximum time difference that out-of-order events can be reordered before being considered late. By setting this tolerance, you ensure that events arriving slightly out of sequence are still processed correctly, which is critical for real-time analytics where event order matters.

Exam trap

The trap here is that candidates confuse scaling mechanisms (like Streaming units) or storage features (like Event Hubs Capture) with event ordering controls, which are specifically designed to manage temporal anomalies in streaming data.

Practice this question →

110

MCQmedium

Your organization is using Azure Synapse Analytics dedicated SQL pool. You notice that queries are running slower than expected. Upon reviewing the execution plans, you see that some queries are performing table scans instead of seeks on large fact tables. What is the most likely cause?

A.The statistics on the tables are outdated or missing.

B.The tables are distributed using round-robin distribution.

C.Result-set caching is disabled.

D.The resource class for the user is set to smallrc.

AnswerA

Outdated or missing statistics can cause the optimizer to choose suboptimal access methods like table scans.

Why this answer

Outdated or missing statistics prevent the Azure Synapse Analytics dedicated SQL pool query optimizer from accurately estimating row counts and data distribution. Without reliable statistics, the optimizer may incorrectly choose a table scan over a more efficient index seek or partition elimination, leading to slower query performance on large fact tables.

Exam trap

The trap here is that candidates often confuse performance issues caused by distribution type or resource class with the optimizer's reliance on statistics, overlooking that even with optimal distribution and sufficient resources, stale statistics force scans instead of seeks.

How to eliminate wrong answers

Option B is wrong because round-robin distribution evenly distributes data across distributions without considering join keys, which can cause data movement but does not directly cause table scans instead of seeks; scans are a symptom of missing statistics or poor index usage. Option C is wrong because result-set caching only affects repeated execution of the same query by storing results, not the initial query plan choice between scan and seek. Option D is wrong because the resource class (e.g., smallrc) controls memory and concurrency slots for the user, not the query optimizer's decision to use scans versus seeks; scans occur regardless of resource class if statistics are stale.

Practice this question →

111

Multi-Selecthard

You are a Data Engineer at Contoso Ltd. The company operates an e-commerce platform that generates streaming data from user interactions (clicks, page views, purchases) at a rate of 50,000 events per second. The data is ingested into Azure Event Hubs. You need to design a data processing solution that meets the following requirements: 1. Real-time dashboards must display aggregated metrics (e.g., total sales, active users) with a latency of less than 2 seconds. 2. Historical data must be stored in Azure Data Lake Storage Gen2 in Delta Lake format for batch analytics. 3. The solution must support exactly-once semantics for the streaming output to both the dashboard and the data lake. 4. You must use Azure Stream Analytics as the primary processing engine. 5. The output to the dashboard should use Azure Power BI, while the output to the data lake should use Azure Synapse Analytics serverless SQL pool for querying. You have configured an Azure Stream Analytics job with Event Hubs as input. For output, you added a Power BI dataset and an Azure Data Lake Storage Gen2 output. However, you discover that the Power BI dataset is being updated with duplicate records, and the data lake output sometimes misses small windows of data during job restarts. Which two actions should you take to meet the requirements? (Choose two.) A. Change the Power BI output to use the 'Exactly once' output mode. B. Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once. C. Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output. D. Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.

Select 2 answers

A.Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output.

B.Change the Power BI output to use the 'Exactly once' output mode.

C.Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once.

D.Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.

AnswersA, B

Checkpointing prevents data loss; exactly-once mode ensures no duplicates.

Why this answer

Option A is correct because Power BI output supports 'Exactly once' mode to avoid duplicates. Option C is correct because enabling checkpointing ensures no data loss during restarts, and 'Exactly once' mode for ADLS Gen2 ensures exactly-once delivery. Option B is wrong because compatibility level 1.0 is outdated and does not guarantee exactly-once.

Option D is wrong because it adds complexity and does not solve the issue of duplicates on Power BI.

Practice this question →

112

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slower than expected. The files are large (500 MB each) and not partitioned. You need to improve query performance without moving data. What should you do?

A.Convert the Parquet files to Delta format using Azure Databricks.

B.Change the files to CSV format with a header row.

C.Use OPENROWSET with BULK and specify ROWSET_OPTIONS for better performance.

D.Create external tables with explicit file format and partition elimination hints.

AnswerD

External tables with file format optimization can improve query performance by enabling metadata-based pruning.

Why this answer

Creating external tables with file format options allows the serverless SQL pool to use metadata for optimization. Changing the file format to CSV or using OPENROWSET without external tables does not improve performance as much. Converting to Delta format requires data movement.

Partitioning the files would be best, but that requires reorganizing files.

Practice this question →

113

MCQeasy

Your team is building a real-time dashboard using Azure Stream Analytics. The data source is an Azure Event Hub that receives clickstream events. You need to output aggregated data (counts per page per minute) to an Azure SQL Database for reporting. The query must handle late-arriving events and ensure exactly-once semantics. Which Stream Analytics feature should you use?

A.Use a temporal window function with a 'late arrival' policy specified in the query.

B.Use the Input Order section in the Stream Analytics job configuration to set a late arrival tolerance window.

C.Define a watermark in the query to specify a maximum out-of-order tolerance.

D.Set the event ordering policy to 'Adjust' to reorder events within a certain time window.

AnswerB

Input Order policy allows handling late events, and Stream Analytics ensures exactly-once delivery to SQL Database.

Why this answer

Option C is correct because the temporal window functions (e.g., TumblingWindow) with a late arrival policy and exactly-once semantics are built into Stream Analytics. Option A is wrong because watermarks are a concept in Spark, not Stream Analytics. Option B is wrong because the Input Order policy in Stream Analytics allows handling late events, but exactly-once semantics are guaranteed by the combination of checkpointing and output adapters.

Option D is wrong because the event ordering policy is for out-of-order events, not for late arrival.

Practice this question →

114

MCQhard

You are designing a data processing pipeline in Azure Synapse Analytics that reads streaming data from Azure Event Hubs, performs aggregations in real time, and writes results to Azure Cosmos DB for a dashboard. The data volume is 10,000 events per second with 2 KB each. The latency requirement is under 5 seconds from event ingestion to dashboard visibility. Which technology should you use for the real-time aggregation?

A.Azure Synapse Spark with Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory mapping data flows

D.Azure Synapse dedicated SQL pool with T-SQL queries

AnswerB

Sub-second latency, native Event Hubs and Cosmos DB connectors.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed, real-time analytics service designed specifically for low-latency stream processing. It can ingest data from Azure Event Hubs, perform windowed aggregations (e.g., tumbling, hopping, sliding windows) with sub-second latency, and output directly to Azure Cosmos DB, meeting the 5-second latency requirement for the dashboard.

Exam trap

The trap here is that candidates often confuse Azure Synapse Spark Structured Streaming (which is micro-batch, not true streaming) with a real-time engine, or they assume Azure Data Factory can handle streaming data because it supports 'real-time' triggers, but it cannot perform in-flight aggregations with sub-second latency.

How to eliminate wrong answers

Option A is wrong because Azure Synapse Spark with Structured Streaming is a batch-micro-batch engine with higher latency (typically seconds to minutes) and is not optimized for sub-5-second real-time aggregation; it is better suited for complex transformations on large datasets. Option C is wrong because Azure Data Factory mapping data flows are designed for batch ETL/ELT operations on data at rest, not for real-time streaming ingestion or low-latency aggregations. Option D is wrong because Azure Synapse dedicated SQL pool with T-SQL queries is a massively parallel processing (MPP) data warehouse for analytical queries on stored data, not for real-time stream processing; it cannot directly ingest streaming data from Event Hubs and perform sub-5-second aggregations.

Practice this question →

115

MCQmedium

You are implementing a data pipeline in Azure Synapse Analytics that uses PolyBase to load data from Azure Data Lake Storage Gen2 into a dedicated SQL pool. The pipeline runs nightly and processes approximately 500 GB of data. You notice that the load operation is slow and frequently times out. What should you do to improve performance?

A.Create a columnstore index on the target table.

B.Use a clustered index on the target table.

C.Increase the resource class for the loading user.

D.Change the distribution of the target table to round-robin.

AnswerC

Higher resource class allocates more resources to the load operation.

Why this answer

Option C is correct because increasing the resource class in dedicated SQL pool provides more resources (CPU, memory) to the PolyBase load operation, which directly addresses the timeout issue. Option A is wrong because columnstore indexes are relevant for query performance, not load speed. Option B is wrong because round-robin distribution can improve load speed, but the primary issue is resource allocation.

Option D is wrong because the data type of the clustered index does not impact PolyBase load performance.

Practice this question →

116

Multi-Selectmedium

Which THREE factors should you consider when designing a real-time streaming solution using Azure Stream Analytics to process IoT data from thousands of devices?

Select 3 answers

A.The need to join input data with reference data that changes every few seconds.

B.The batch size for output to Azure Synapse Analytics to minimize write transactions.

C.The windowing function and late arrival policy for handling out-of-order events.

D.The latency requirements for writing output to Azure Cosmos DB for NoSQL.

E.The number of streaming units and partitioning scheme for the input Event Hubs.

AnswersB, C, E

Batching reduces transaction costs and improves throughput.

Why this answer

Option A is correct because output batching reduces the number of write operations. Option B is correct because partitioning the input stream improves parallel processing. Option D is correct because late arrival policies handle out-of-order events.

Option C is wrong because reference data is typically loaded once and does not require real-time joins. Option E is wrong because Azure Stream Analytics cannot directly write to Azure Cosmos DB for NoSQL without a custom output, and even then, it is not the primary consideration for real-time processing.

Practice this question →

117

MCQeasy

Your team is developing a data processing solution in Azure Synapse Analytics. You need to ensure that the solution can automatically scale compute resources based on workload demand for serverless SQL pools. Which feature should you configure?

A.Set a cache size for the serverless SQL pool

B.Configure a dedicated SQL pool with auto-scaling

C.Use workload classification to assign resources

D.Enable auto-resume and auto-pause on the serverless SQL pool endpoint

AnswerD

Serverless SQL pools automatically scale compute resources.

Why this answer

Option A is correct because serverless SQL pools automatically scale based on query demand without manual configuration. Option B is wrong because it's for dedicated SQL pools. Option C is wrong because serverless pools don't use explicit cache.

Option D is wrong because serverless pools do not support workload classification.

Practice this question →

118

MCQmedium

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with timeout errors. You need to improve reliability. What should you do?

A.Use a self-hosted integration runtime with high availability

B.Enable fault tolerance and use staging

C.Change the source to Azure SQL Database

D.Increase the degree of copy parallelism

AnswerB

Fault tolerance allows the copy activity to retry on transient errors, and staging improves performance and reliability.

Why this answer

Option B is correct because enabling fault tolerance with staging in Azure Data Factory allows the copy activity to automatically retry transient failures (such as timeout errors) by staging intermediate data in Azure Blob Storage. This mechanism uses a two-phase commit approach: data is first written to a staging location, and then committed to the final sink only after successful validation, which isolates the pipeline from intermittent source or sink failures.

Exam trap

The trap here is that candidates often confuse high availability of the integration runtime (Option A) with fault tolerance of the copy activity, not realizing that HA only protects the IR nodes, not the data transfer itself.

How to eliminate wrong answers

Option A is wrong because using a self-hosted integration runtime with high availability improves the reliability of the integration runtime itself (e.g., node failures), but does not address timeout errors caused by the copy activity's interaction with the source or sink; it does not provide retry logic for data transfer failures. Option C is wrong because changing the source to Azure SQL Database does not resolve timeout errors in the copy activity; it merely shifts the source location, and the pipeline could still experience timeouts due to network latency or throttling. Option D is wrong because increasing the degree of copy parallelism can actually exacerbate timeout issues by overwhelming the source or sink with concurrent requests, and it does not provide any fault tolerance or retry mechanism for transient failures.

Practice this question →

119

MCQeasy

You need to incrementally load new and updated records from a source SQL Server database to Azure Synapse Dedicated SQL Pool. The source table has a LastModifiedDate column. Which Azure Data Factory feature should you use to implement incremental loading efficiently?

A.Alter Row transformation

B.Incremental copy (watermark) pattern using a Lookup activity and a Copy activity

C.Schedule trigger

D.Lookup activity alone

AnswerB

The watermark pattern uses a lookup to get the last watermark value and a copy activity to copy data changed after that watermark.

Why this answer

The incremental copy (watermark) pattern using a Lookup activity and a Copy activity is the correct approach because it allows you to query the source table for the maximum LastModifiedDate value (the watermark), store it in a control table or variable, and then use a Copy activity with a WHERE clause to load only rows where LastModifiedDate is greater than the last run's watermark. This pattern is purpose-built for efficiently handling new and updated records in Azure Data Factory without reprocessing the entire dataset.

Exam trap

The trap here is that candidates often confuse a scheduling mechanism (Schedule trigger) with the actual data processing logic required for incremental loads, or they mistakenly think a single activity like Lookup or Alter Row can handle the entire incremental copy workflow without understanding the need for a watermark pattern.

How to eliminate wrong answers

Option A is wrong because Alter Row transformation is a data flow transformation used to mark rows for insert, update, upsert, or delete in a sink, but it does not provide the incremental loading logic or watermark mechanism needed to identify new/updated rows from a source. Option C is wrong because a Schedule trigger only defines when a pipeline runs (e.g., every hour), but it does not implement the incremental copy logic itself; you still need the watermark pattern inside the pipeline. Option D is wrong because a Lookup activity alone can retrieve the watermark value but cannot copy data; it must be combined with a Copy activity to actually move the incremental rows.

Practice this question →

120

MCQmedium

You are designing a data pipeline in Azure Synapse Analytics to ingest data from Azure Blob Storage into a dedicated SQL pool. The source files are CSV with varying row lengths, and you need to ensure optimal performance for reads. Which file format and compression should you recommend?

A.Avro with Deflate compression

B.CSV with Gzip compression

C.Parquet with Snappy compression

D.ORC with Zlib compression

AnswerC

Parquet is columnar and Snappy provides fast compression/decompression, ideal for Synapse dedicated SQL pool.

Why this answer

Parquet with Snappy compression is optimal for dedicated SQL pools in Azure Synapse Analytics because Parquet is a columnar format that enables efficient predicate pushdown and column pruning, reducing I/O. Snappy provides fast compression/decompression with minimal CPU overhead, which is critical for high-throughput reads in a distributed MPP environment.

Exam trap

Microsoft often tests the misconception that row-based formats like Avro or CSV are suitable for analytical workloads, but the trap here is that columnar formats (Parquet/ORC) are required for optimal read performance in Synapse dedicated SQL pools, and Snappy is preferred over Zlib for speed-critical pipelines.

How to eliminate wrong answers

Option A is wrong because Avro is a row-based format that does not support columnar pruning, leading to higher I/O for analytical queries on dedicated SQL pools. Option B is wrong because CSV with Gzip compression is row-oriented and not splittable at the row level, causing poor parallelism and slower read performance in Synapse. Option D is wrong because ORC with Zlib compression offers higher compression ratios but significantly slower decompression compared to Snappy, which can bottleneck read performance in Synapse's MPP engine.

Practice this question →

121

Multi-Selecthard

Which THREE are required to run a stored procedure in Azure SQL Database from Azure Data Factory?

Select 3 answers

A.A linked service to Azure SQL Database.

B.A dataset that references the stored procedure.

C.A self-hosted integration runtime.

D.A staging blob storage account.

E.A stored procedure activity in the pipeline.

AnswersA, B, E

The linked service provides connection details.

Why this answer

A linked service to Azure SQL Database is required because it defines the connection string and authentication method (e.g., SQL authentication or managed identity) that Azure Data Factory (ADF) uses to connect to the database. Without this connection configuration, ADF cannot authenticate or communicate with the database to execute the stored procedure.

Exam trap

The trap here is that candidates often think a dataset is mandatory for any activity, but for stored procedure execution, the dataset is only needed if you want to pass parameters or map output, not for the execution itself.

Practice this question →

122

MCQhard

You are designing a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow with Azure Integration Runtime (IR). The pipeline runs slowly and you notice that the IR's data movement is limited by the number of cores. Which configuration should you adjust to improve performance?

A.Enable staging for the copy activity within the data flow.

B.Increase the 'Data Flow Compute Type' and 'Core Count' in the Azure IR settings.

C.Use a Self-Hosted IR instead of Azure IR for data flows.

D.Increase the 'Number of partitions' in the source transformation.

AnswerB

These settings directly allocate more resources to mapping data flows.

Why this answer

The Azure Integration Runtime (IR) for mapping data flows uses a Spark cluster, and its performance is directly tied to the compute resources allocated. By increasing the 'Data Flow Compute Type' (e.g., from General Purpose to Memory Optimized) and the 'Core Count' (e.g., from 4 to 8 or 16 cores), you provide more parallel processing power, which directly addresses the core-limited data movement bottleneck.

Exam trap

The trap here is that candidates confuse the 'Number of partitions' setting (which controls data parallelism within the flow) with the Azure IR's core count (which controls the Spark cluster's overall compute capacity), leading them to pick D instead of B.

How to eliminate wrong answers

Option A is wrong because enabling staging for the copy activity is used to offload data movement to a staging blob store for PolyBase or COPY statement scenarios, not to increase the core count of the Azure IR for mapping data flows. Option C is wrong because Self-Hosted IR is designed for on-premises or private network data sources and does not improve the Spark-based compute performance of a mapping data flow; in fact, it adds network latency. Option D is wrong because increasing the 'Number of partitions' in the source transformation can improve parallelism within the data flow, but it does not change the underlying Azure IR's core count or compute type, which is the root cause of the core-limited bottleneck.

Practice this question →

123

MCQeasy

You are developing a data processing solution that requires aggregating sales data from multiple CSV files stored in Azure Data Lake Storage Gen2. The data should be cleansed and transformed before loading into Azure Synapse Analytics. Which Azure service should you use to implement a code-free transformation pipeline?

A.Azure HDInsight with Hive

B.Azure Analysis Services

C.Azure Data Factory with Mapping Data Flows

D.Azure Databricks with PySpark

AnswerC

Mapping Data Flows provide code-free data transformation at scale.

Why this answer

Azure Data Factory with Mapping Data Flows allows code-free visual transformations. Azure Databricks and HDInsight require code. Azure Analysis Services is for tabular modeling, not data processing.

Practice this question →

124

MCQeasy

Your company uses Azure Data Lake Storage Gen2 as the central data lake. You need to process batch data using serverless Spark jobs that can be scheduled daily. Which Azure service should you use?

A.Azure Batch with custom Spark containers.

B.Azure Synapse Analytics serverless Spark pool with pipelines.

C.Azure Databricks with a job cluster.

D.Azure Machine Learning with Spark compute.

AnswerB

Synapse provides serverless Spark pools with automatic scaling and built-in scheduling via pipelines.

Why this answer

Option C is correct because Azure Synapse Analytics provides serverless Spark pools with a built-in scheduler. Option A (Azure Databricks) is not serverless (requires cluster management). Option B (Azure Batch) is for custom compute jobs.

Option D (Azure Machine Learning) is for ML workloads.

Practice this question →

125

Multi-Selecteasy

Which TWO of the following are required components to set up a data pipeline that uses Change Data Capture (CDC) to incrementally load data from SQL Server to Azure Synapse using Azure Data Factory?

Select 2 answers

A.CDC enabled on the source SQL Server database and tables

B.A staging Azure Blob Storage account

C.A Lookup activity to get the last watermark

D.A stored procedure in the source database to capture changes

E.A linked service to the Azure Synapse dedicated SQL pool

AnswersA, E

CDC must be enabled on the source to track changes.

Why this answer

Option A is correct because Change Data Capture (CDC) must be enabled on the source SQL Server database and the specific tables you intend to track. Without CDC enabled, SQL Server will not generate the change tracking tables (e.g., cdc.<capture_instance>_CT) that Azure Data Factory’s CDC connector reads to identify inserts, updates, and deletes. This is a prerequisite for any incremental load using the native CDC mechanism in ADF.

Exam trap

The trap here is that candidates often confuse CDC-based incremental loading with watermark-based incremental loading, leading them to incorrectly select a Lookup activity (Option C) or a staging storage account (Option B) as required components.

Practice this question →

126

MCQmedium

You are building a real-time dashboard to monitor user activity on a website. The data is ingested via Azure Event Hubs and must be aggregated every minute with a 30-second late-arrival tolerance. The aggregated results should be stored in Azure Cosmos DB for low-latency reads. Which Azure service should you use to perform the windowed aggregation?

A.Azure Stream Analytics with a tumbling window of 1 minute and a late-arrival policy of 30 seconds.

B.Azure Functions triggered by Event Hubs to aggregate data and write to Cosmos DB.

C.Azure Databricks with structured streaming and a sliding window.

D.Azure Analysis Services to process streaming data directly from Event Hubs.

AnswerA

Stream Analytics provides built-in windowing functions and late-arrival handling, perfect for this scenario.

Why this answer

Azure Stream Analytics is the correct choice because it natively supports windowed aggregations (tumbling, hopping, sliding, session) and allows you to define a late-arrival policy to handle out-of-order events. A tumbling window of 1 minute with a late-arrival tolerance of 30 seconds meets the requirement exactly, and the output can be directly written to Azure Cosmos DB for low-latency reads.

Exam trap

The trap here is that candidates often confuse tumbling windows (fixed, non-overlapping) with sliding windows (continuous, overlapping) or assume that any compute service (like Functions or Databricks) can easily replicate Stream Analytics' built-in windowing and late-arrival handling, ignoring the complexity of state management and exactly-once semantics.

How to eliminate wrong answers

Option B is wrong because Azure Functions triggered by Event Hubs do not provide built-in windowing or late-arrival policy support; you would have to manually implement stateful aggregation, which is complex and error-prone. Option C is wrong because Azure Databricks with structured streaming uses a sliding window, not a tumbling window, and does not offer a native late-arrival policy configuration as simple as Stream Analytics; it also introduces unnecessary overhead for this real-time dashboard scenario. Option D is wrong because Azure Analysis Services is an OLAP engine for analytical queries on pre-aggregated data, not a real-time stream processing service; it cannot directly process streaming data from Event Hubs.

Practice this question →

127

Multi-Selecteasy

You are designing a data processing solution in Azure Data Factory that uses mapping data flows. You need to perform type conversions on incoming data. Which three transformations can be used to change data types? (Choose three.)

Select 3 answers

A.Conditional Split

B.Derived Column

C.Assert

D.Sort

E.Select

AnswersB, C, E

Derived Column allows type conversion via expressions.

Why this answer

Options A, B, and C are correct. Derived Column can change types via expressions. Select can cast columns during projection.

Assert can check and convert types. Option D is wrong because Sort only orders data. Option E is wrong because Conditional Split routes rows based on conditions, not type conversion.

Practice this question →

128

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must support incremental loading of data from an Azure SQL Database to a dedicated SQL pool using PolyBase. Which approach should you use to minimize data movement and maximize performance?

A.Use the bcp utility to export data from Azure SQL Database to a text file, then bulk insert into the dedicated SQL pool.

B.Create external tables in the dedicated SQL pool that reference the source data, then use CREATE TABLE AS SELECT (CTAS) to load incrementally.

C.Use Azure Data Factory with a copy activity to load data into staging tables, then merge into the target table.

D.Use Azure Databricks to read the source data, apply transformations, and write to the dedicated SQL pool using the Spark connector.

AnswerB

PolyBase external tables enable direct query of source data, and CTAS allows efficient incremental loading with minimal data movement.

Why this answer

Option B is correct because using external tables with PolyBase in Azure Synapse Analytics allows you to directly query the source Azure SQL Database without moving the data first. The CREATE TABLE AS SELECT (CTAS) statement then loads only the incremental data into the dedicated SQL pool, minimizing data movement by leveraging PolyBase's parallel streaming capability for maximum performance.

Exam trap

The trap here is that candidates often assume external tables are only for static data or Hadoop, but PolyBase in Synapse supports external tables against Azure SQL Database for efficient incremental loading, making options that introduce extra hops (like Data Factory or bcp) seem more familiar but less optimal.

How to eliminate wrong answers

Option A is wrong because the bcp utility exports data to a text file, which introduces an intermediate storage step and additional I/O overhead, increasing data movement and latency compared to direct PolyBase access. Option C is wrong because Azure Data Factory copy activity moves data through an intermediate staging area (e.g., Azure Blob Storage), which adds extra data transfer and storage costs, whereas PolyBase can read directly from the source without staging. Option D is wrong because Azure Databricks with the Spark connector requires moving data out of Azure SQL Database into a Spark cluster for processing, then writing back to the dedicated SQL pool, which increases data movement and complexity compared to the native PolyBase approach.

Practice this question →

129

Multi-Selecteasy

You are using Azure Stream Analytics to process real-time data from an IoT hub. The output is sent to Azure Blob Storage for long-term storage. You need to ensure that the output files are partitioned by date and hour for easy querying. Which THREE configurations should you set? (Choose three.)

Select 3 answers

A.Use a path pattern that includes {date} and {time} tokens.

B.Configure the event ordering policy to adjust late events.

C.Set the output serialization format to Avro or Parquet.

D.Set the compatibility level to 1.2 or higher.

E.Enable 'Write to blob storage partitioned by time' in the output settings.

AnswersA, C, E

Tokens in the path pattern create folder structure based on date and time.

Why this answer

Options A, D, and E are correct. Option A: Setting the output format to Avro or Parquet is common for partitioned data. Option D: Enabling 'DateTime format' in the output allows partitioning.

Option E: Using a path pattern with {date} and {time} creates folder structure. Option B is wrong because partitioning is not part of event ordering. Option C is wrong because compatibility level does not affect output partitioning.

Practice this question →

130

MCQhard

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data and comply with GDPR. The data will be stored in Azure Synapse Analytics. You need to ensure that only authorized users can view specific columns (e.g., credit card numbers). Which security feature should you implement?

A.Row-level security (RLS)

B.Column-level security

C.Dynamic data masking

D.Microsoft Defender for Cloud

AnswerB

Column-level security restricts access to specific columns.

Why this answer

Option C is correct because Column-level security in Azure Synapse allows you to restrict access to specific columns for specific users or roles. Option A is wrong because Row-level security restricts rows, not columns. Option B is wrong because Dynamic data masking obfuscates data but does not prevent access entirely.

Option D is wrong because Microsoft Defender for Cloud is a security monitoring tool, not for column-level access control.

Practice this question →

131

MCQeasy

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to store fact and dimension tables. The fact table is expected to have billions of rows. Which distribution strategy should you recommend for the fact table to optimize query performance and minimize data movement?

A.Round-robin distribution.

B.Partitioned table with a partition key.

C.Hash distribution on a column that is frequently used in joins and aggregations.

D.Replicated distribution.

AnswerC

Hash distribution collocates rows with the same key, reducing data movement.

Why this answer

Hash distribution on a column frequently used in joins and aggregations is the best choice for a fact table with billions of rows in a dedicated SQL pool. It distributes rows across distributions based on a hash of the distribution column, ensuring that rows with the same key value are co-located on the same distribution. This minimizes data movement during joins and aggregations, as the data required for these operations is already local to each distribution, significantly improving query performance.

Exam trap

The trap here is that candidates often confuse partitioning with distribution, thinking that partitioning alone can optimize data movement across nodes, but partitioning operates within a distribution and does not affect how data is distributed across compute resources.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without considering data relationships, which leads to excessive data movement during joins and aggregations, degrading performance for large fact tables. Option B is wrong because partitioning is a data organization technique within a distribution, not a distribution strategy; it helps with data management and partition elimination but does not control how data is distributed across compute nodes, so it cannot minimize data movement across distributions. Option D is wrong because replicated distribution copies the entire table to each compute node, which is impractical for a fact table with billions of rows due to massive storage overhead and write performance penalties; it is intended for small dimension tables, not large fact tables.

Practice this question →

132

MCQhard

Refer to the exhibit. You are creating a serverless SQL table in Azure Synapse Analytics that reads Parquet files from the specified location. The folder contains multiple Parquet files with different schemas. When querying the table, you get an error about schema mismatch. What is the most likely reason?

A.The Parquet files are not using the .parquet extension.

B.The derivedModel option is set to false, which disables schema inference.

C.The serverless SQL pool infers schema from the first file and expects all files to have the same schema.

D.The recursive option is causing the table to include files from subfolders that have different schemas.

AnswerC

Serverless SQL uses schema inference from the first file; subsequent files with different schemas cause errors.

Why this answer

Serverless SQL pools infer the schema from the first file encountered. If files have different schemas, the inference may be inconsistent. Option A is correct because the schema inference is done on the first file, and mismatches cause errors.

Option B is wrong because recursive reads subfolders but not different schemas. Option C is wrong because .parquet extension is fine. Option D is wrong because derivedModel is false, which means no model is used, but that doesn't cause schema mismatch.

Practice this question →

133

MCQeasy

You are designing a data processing pipeline using Azure Data Factory. The pipeline must ingest data from an HTTP endpoint that returns a JSON array. The data must be transformed by flattening nested arrays and then loaded into an Azure SQL Database table. The pipeline should be triggered daily. You need to choose the appropriate activities and transformations. The solution must be cost-effective and easy to maintain. Which combination of activities should you use?

A.Use a Lookup activity to read the JSON, then a ForEach activity to iterate and insert rows into SQL Database.

B.Use a Copy activity to ingest data from the HTTP source into Azure Blob Storage, then a Data Flow activity with a Flatten transformation to flatten the JSON, and finally a Copy activity to load into SQL Database.

C.Use a Data Flow activity directly from HTTP source with a Flatten transformation and sink to SQL Database.

D.Use two Copy activities: one to copy JSON to Blob Storage, and another to copy from Blob Storage to SQL Database without transformation.

AnswerB

This is the standard pattern: ingest, transform, load.

Why this answer

Option A is correct. A Copy activity can ingest the JSON from HTTP, and a Data Flow with Flatten transformation can flatten the nested arrays. Option B is wrong because Execute Pipeline is unnecessary.

Option C is wrong because a Lookup activity is for reading a single row, not for data ingestion. Option D is wrong because multiple Copy activities are not needed.

Practice this question →

134

MCQmedium

A company is ingesting streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time and stored in Azure Synapse Analytics for reporting. The solution must handle late-arriving data and ensure exactly-once semantics. Which Azure service should you use for stream processing?

A.Azure Data Factory with Event Hubs source

B.Azure Synapse Spark Structured Streaming

C.Azure Stream Analytics

D.Azure Event Hubs with Capture

AnswerC

Provides exactly-once delivery and can handle late arrivals.

Why this answer

Option D is correct because Azure Stream Analytics supports exactly-once semantics to Synapse, handles late arrivals, and is optimized for real-time. Option A is incorrect because Spark Structured Streaming in Synapse does not provide exactly-once to Synapse out of the box. Option B is incorrect because Event Hubs itself does not process data.

Option C is incorrect because Data Factory is for batch, not streaming.

Practice this question →

135

MCQhard

You are reviewing a mapping data flow in Azure Data Factory that reads a CSV file from ADLS Gen2 and writes to an Azure Synapse Analytics dedicated SQL pool. The data flow includes a Derived Column transformation with the expression: `column1 == "Error" ? toString(column1) : column1`. The pipeline fails with an error indicating that the sink table could not be created. What is the most likely cause?

A.The source file does not have a header row.

B.The Derived Column expression has a syntax error.

C.Using allowCopyCommand with autoCreate is not supported.

D.The sink dataset is configured for JSON format.

AnswerC

allowCopyCommand requires the table to exist.

Why this answer

Option B is correct because `allowCopyCommand` is set to `true`, which requires the sink table to already exist. The `tableOption` is `autoCreate`, but with `allowCopyCommand` enabled, auto-create is not supported. Option A is wrong because the expression is syntactically correct.

Option C is wrong because the issue is with sink, not source. Option D is wrong because the source is delimited text, not JSON.

Practice this question →

136

MCQhard

You are monitoring an Azure Synapse Pipeline that uses a Mapping Data Flow. The data flow processes 2 GB of data from a CSV source and writes to a Delta sink. The pipeline fails with a 'DataFlowException: Operation aborted' error after running for 45 minutes. The cluster is configured with 8 cores. What is the most likely cause?

A.The cluster size is too small for the data volume.

B.The CSV source contains malformed rows that cause parsing errors.

C.The data flow cluster's time-to-live (TTL) is set to 45 minutes and the job exceeded it.

D.The data flow is using the Spark cluster's default timeout setting.

AnswerC

The default TTL for data flow clusters is 60 minutes, but if custom set to 45 minutes, the cluster may be terminated during long-running jobs.

Why this answer

Option C is correct because the error 'Operation aborted' after exactly 45 minutes aligns with the default time-to-live (TTL) setting for Azure Synapse Mapping Data Flow clusters. When the TTL expires, the cluster is terminated, and any running job is aborted. The 8-core cluster and 2 GB data volume are not inherently problematic for a 45-minute window, but the TTL default of 45 minutes causes the abort if the job runs longer than that.

Exam trap

The trap here is that candidates confuse the TTL (a Synapse cluster lifecycle setting) with a Spark job timeout or a data volume issue, leading them to incorrectly select cluster size or malformed data as the cause.

How to eliminate wrong answers

Option A is wrong because 8 cores can process 2 GB of data within 45 minutes under normal conditions; the error is not due to insufficient cluster size but rather a timeout. Option B is wrong because malformed rows would cause a parsing error (e.g., 'MalformedRecordException'), not a generic 'Operation aborted' error after a fixed duration. Option D is wrong because the Spark cluster's default timeout is not a configurable setting that causes this specific error; the TTL is a Synapse-specific cluster lifecycle setting, not a Spark-level timeout.

Practice this question →

137

MCQmedium

You are designing a data processing pipeline in Azure Data Factory that ingests data from an on-premises SQL Server database to Azure Data Lake Storage Gen2. The data volume is large (500 GB). The network connection between on-premises and Azure is limited to 100 Mbps. You need to minimize the time to transfer the initial full load while ensuring data integrity. Which approach should you recommend?

A.Use Azure Data Factory copy activity with parallel connections

B.Use Azure ExpressRoute to increase bandwidth

C.Compress the data using GZip and use copy activity

D.Use Azure Data Box to copy the data offline

AnswerD

Data Box transfers data physically, bypassing network limitations.

Why this answer

Option D is correct because Azure Data Box physically ships the data, bypassing network bandwidth limitations for large initial loads. Option A is wrong because it would take over 11 hours even at full bandwidth, and network may not be stable. Option B is wrong because it compresses but still uses network.

Option C is wrong because VPN adds overhead.

Practice this question →

138

MCQeasy

You are designing a data pipeline that uses Azure Data Factory to load data from an FTP server to Azure Data Lake Storage. The FTP server requires authentication with username and password. Which type of linked service should you create?

B.Azure Blob Storage

D.Rest service

AnswerA

FTP linked service supports username and password authentication.

Why this answer

Option A is correct because FTP linked service supports username/password authentication. Option B is incorrect because HTTP linked service is for HTTP endpoints. Option C is incorrect because Azure Blob Storage linked service is for Azure blobs.

Option D is incorrect because Rest service is for REST APIs.

Practice this question →

139

MCQmedium

You are running a pipeline in Azure Data Factory that uses a Mapping Data Flow. The data flow reads from Azure SQL Database and writes to Azure Synapse Analytics. You find that the data flow is very slow. Which configuration change would most likely improve performance?

A.Set the 'Staging' option to 'Use staging'

B.Increase the 'Compute type' to 'Memory Optimized' and the 'Core count'

C.Enable staging for the sink and use PolyBase

D.Set the 'Partition option' to 'Round robin' on the source

AnswerB

More compute resources speed up data flow execution.

Why this answer

Mapping Data Flows in Azure Data Factory execute on Spark clusters. The default compute configuration may not provide sufficient memory or parallelism for large data volumes. Increasing the 'Compute type' to 'Memory Optimized' and raising the 'Core count' directly allocates more memory and processing cores to the Spark cluster, which accelerates transformations and data movement between Azure SQL Database and Azure Synapse Analytics.

Exam trap

The trap here is that candidates confuse Mapping Data Flow performance tuning with Copy Activity optimizations, such as PolyBase or staging, which are irrelevant to Spark-based data flows.

How to eliminate wrong answers

Option A is wrong because setting 'Staging' to 'Use staging' in a Mapping Data Flow is not a valid configuration; staging is used for copy activities, not for data flows. Option C is wrong because enabling staging for the sink and using PolyBase is a performance optimization for Copy Activity, not for Mapping Data Flow, which uses Spark-native connectors. Option D is wrong because setting the 'Partition option' to 'Round robin' on the source distributes data evenly but does not address the root cause of slow performance, which is insufficient compute resources for the Spark cluster.

Practice this question →

140

MCQhard

You are a data engineer for a global retail company. The company has a hybrid architecture with on-premises SQL Server databases and Azure Synapse Analytics. You need to design a data processing solution that ingests incremental changes from the on-premises SQL Server database (source) into Azure Synapse Analytics (sink) with low latency (under 15 minutes) and high reliability. The source database is 5 TB and experiences high transaction volume during business hours. The solution must minimize impact on the source system and handle schema changes automatically. You have the following options: Option A: Use Azure Data Factory with a copy activity that uses a watermark column to query incremental changes every 10 minutes. The copy activity writes directly to the Synapse table using PolyBase. Option B: Use Azure Data Factory with a mapping data flow that reads from the source using a SQL query with a watermark, performs transformations, and writes to Synapse using staging via Blob Storage and PolyBase. Option C: Use SQL Server Integration Services (SSIS) running on Azure-SSIS Integration Runtime to extract data using change data capture (CDC) and load into Synapse. Option D: Use Azure Databricks with Auto Loader to ingest files from a staging area that is populated by a separate log-shipping process from the source. Which option should you choose?

A.Option C

B.Option A

C.Option D

D.Option B

AnswerD

Mapping data flow supports schema drift and uses staging for PolyBase.

Why this answer

Option B is correct because it handles incremental loads with low latency, uses PolyBase for efficient loading, and mapping data flow allows for schema drift handling and transformations without impacting source. Option A lacks schema drift handling. Option C requires SSIS packages and may have higher latency.

Option D requires additional log-shipping, increasing complexity and latency.

Practice this question →

141

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics that uses serverless SQL pools to query Parquet files in Azure Data Lake Storage Gen2. The files are partitioned by year and month. You need to optimize query performance and reduce data scanned. What should you do?

A.Use CREATE EXTERNAL TABLE AS SELECT (CETAS) to create new external tables.

B.Use OPENROWSET with the DATA_SOURCE parameter.

C.Create views that filter on partition columns.

D.Increase the number of files per partition.

AnswerC

Allows partition elimination when querying.

Why this answer

Option C is correct because serverless SQL pools in Azure Synapse Analytics support partition elimination only when queries use views or inline queries that explicitly filter on partition columns (e.g., year, month) in the WHERE clause. This allows the pool to skip scanning irrelevant partitions, reducing data scanned and improving performance. Creating views that encapsulate these filters ensures consistent partition pruning across queries.

Exam trap

The trap here is that candidates often assume that simply using external tables or OPENROWSET automatically provides partition pruning, but in serverless SQL pools, partition elimination only occurs when the query explicitly references the partition columns in the WHERE clause, typically through a view or inline filter.

How to eliminate wrong answers

Option A is wrong because CETAS creates external tables that store query results as new files, but it does not inherently optimize query performance or reduce data scanned for existing partitioned Parquet files; it is a data movement operation, not a query optimization technique. Option B is wrong because OPENROWSET with the DATA_SOURCE parameter allows querying files directly, but without explicit partition column filters in the WHERE clause, the serverless pool cannot perform partition elimination and will scan all files. Option D is wrong because increasing the number of files per partition increases metadata overhead and can degrade query performance due to more file open/read operations, and it does not reduce the amount of data scanned.

Practice this question →

142

Multi-Selectmedium

You are designing a data processing pipeline in Azure Data Factory that uses a Mapping Data Flow. You need to handle errors gracefully, such as when a row fails to convert a column value. Which TWO actions should you take? (Choose two.)

Select 2 answers

A.Wrap the data flow in a Try-Catch activity in the pipeline.

B.Set the data flow's error handling to 'Abort on error' to stop processing on first failure.

C.Enable schema drift on the source to automatically handle data type mismatches.

D.Configure the sink transformation to allow errors and log error rows to a separate file.

E.Use a Conditional Split transformation to separate rows that cause errors based on a condition.

AnswersD, E

Sink can be configured to continue on error and write error rows to a file.

Why this answer

Options A and D are correct. Option A: Using a conditional split to route error rows is a common pattern. Option D: Configuring the sink to allow errors and logging them ensures fault tolerance.

Option B is wrong because try-catch is not available in Mapping Data Flows. Option C is wrong because aborting the activity is not graceful. Option E is wrong because schema drift does not handle conversion errors.

Practice this question →

143

MCQhard

You are a data engineer working for a logistics company. You have an existing Azure Data Factory pipeline that ingests data from a REST API to Azure Data Lake Storage Gen2. The API has rate limiting that can cause failures. You need to implement a solution that can handle rate limiting by retrying with exponential backoff. The pipeline should also log the number of retries for each API call. What should you do?

A.Configure the Copy activity with retry policy using exponential backoff by setting the retry count and retry interval. Enable diagnostic logs to capture retry details.

B.Use a Web activity with a Until loop to implement custom retry logic.

C.Use an Azure Function as a custom activity in Azure Data Factory to implement retry logic with exponential backoff.

D.Use Azure Logic Apps to call the API and then copy the response to Azure Data Lake Storage Gen2.

AnswerA

Built-in retry with exponential backoff and logging.

Why this answer

Option B is correct because Azure Data Factory's Copy activity has built-in support for retry with exponential backoff when you set the retry count and retry interval. You can also log retry attempts using diagnostic logs. Option A is wrong because Azure Logic Apps would be a separate service and add complexity.

Option C is wrong because Azure Functions would require custom development. Option D is wrong because the Web activity does not have built-in retry with exponential backoff.

Practice this question →

144

MCQmedium

Refer to the exhibit. You are querying the sys.external_tables view in an Azure Synapse Analytics serverless SQL pool. The query returns no rows, but you know that external tables have been created. What is the most likely reason?

A.The external tables are using PolyBase, which is not supported in serverless SQL pool.

B.Serverless SQL pool does not support external tables; you must use a dedicated SQL pool.

C.The external tables were created using OPENROWSET, not CREATE EXTERNAL TABLE, so they do not appear in sys.external_tables.

D.The user does not have permission to view the sys.external_tables view.

AnswerC

OPENROWSET queries do not create external table metadata; they are ad-hoc queries.

Why this answer

In Azure Synapse Analytics serverless SQL pool, external tables are created using the CREATE EXTERNAL TABLE statement, and they are listed in the sys.external_tables view. However, if you query data directly via OPENROWSET without creating an external table, those ad-hoc queries do not register any metadata in sys.external_tables. Since the question states that external tables have been created but the view returns no rows, the most likely reason is that the tables were actually created using OPENROWSET, not CREATE EXTERNAL TABLE, so they are not cataloged in the system view.

Exam trap

The trap here is that candidates may assume any external data access creates a catalog entry, but the exam tests the specific difference between DDL-based external tables and ad-hoc OPENROWSET queries in serverless SQL pool.

How to eliminate wrong answers

Option A is wrong because PolyBase is fully supported in serverless SQL pool for reading external data; it is not unsupported. Option B is wrong because serverless SQL pool does support external tables via CREATE EXTERNAL TABLE, and they are visible in sys.external_tables. Option D is wrong because if the user lacked permission to view sys.external_tables, the query would typically return an error or no rows, but the question states the user knows external tables exist, making a permission issue less likely than the metadata not being populated due to using OPENROWSET.

Practice this question →

145

MCQmedium

Refer to the exhibit. You have a managed identity that needs to read data from the 'data' container in Azure Data Lake Storage Gen2. The policy currently denies access. What is the most likely cause?

A.The condition on 'acs:RequestVersion' is preventing access because the request does not use the specified API version

B.The resource path is malformed; it should include the blob path

C.The action 'Microsoft.Storage/storageAccounts/blobServices/containers/read' is incorrect; it should be 'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'

D.The principal is a managed identity, but the policy requires a user-assigned identity

AnswerA

The condition requires API version 2019-12-12, which may not be used.

Why this answer

The condition `acs:RequestVersion` requires the request to use a specific API version, which may not be met. Option A is wrong because the action is correct for reading containers. Option B is wrong because the resource path is correct.

Option C is wrong because the principal is a managed identity.

Practice this question →

146

MCQmedium

You are designing a data processing solution that uses Azure Databricks to transform large datasets. You need to ensure that the processing is cost-effective and can scale to handle variable workloads. Which cluster configuration should you recommend?

A.Use an auto-scaling cluster with spot instances.

B.Use a fixed-size cluster with premium tier.

C.Use a Photon-accelerated cluster with premium tier.

D.Use an interactive cluster with a large number of workers.

AnswerA

Auto-scaling and spot instances provide cost-effectiveness and scalability.

Why this answer

Option A is correct because auto-scaling clusters in Azure Databricks dynamically adjust the number of workers based on workload demands, ensuring cost-effectiveness by scaling down during low activity. Spot instances (Azure Spot VMs) further reduce costs by using unused Azure capacity at a significant discount, making this combination ideal for variable workloads where fault tolerance is acceptable.

Exam trap

The trap here is that candidates often assume premium tier or Photon acceleration automatically improves cost-effectiveness, but these features address performance or governance, not the core requirement of scaling with variable workloads and minimizing cost via spot pricing.

How to eliminate wrong answers

Option B is wrong because a fixed-size cluster cannot scale to handle variable workloads, leading to either over-provisioning (higher costs) or under-provisioning (performance degradation). Option C is wrong because Photon-accelerated clusters are optimized for high-performance SQL and DataFrame workloads, but they do not inherently address cost-effectiveness for variable workloads; the premium tier adds features like role-based access control but does not enable scaling or spot pricing. Option D is wrong because an interactive cluster with a large number of workers is designed for ad-hoc analysis and collaboration, not for cost-effective batch processing; it lacks auto-scaling and spot instance support, leading to higher costs during idle periods.

Practice this question →

147

MCQmedium

You are designing a data pipeline in Azure Data Factory (ADF) that copies data from an on-premises SQL Server database to Azure Synapse Analytics dedicated SQL pool. The pipeline must run daily and handle incremental loads efficiently. Which sink dataset type and copy method should you use?

A.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use the Copy activity with PolyBase enabled.

B.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and enable the built-in Upsert option.

C.Use Azure Blob Storage as the sink dataset, then use PolyBase to load into the dedicated SQL pool.

D.Use Azure Synapse Analytics dedicated SQL pool as the sink dataset and use Stored Procedure with staging table and PolyBase.

AnswerD

This combination enables high-throughput ingestion and supports incremental loading via merge logic in the stored procedure.

Why this answer

Option D is correct because it uses a staging table and PolyBase to efficiently load incremental data into Azure Synapse Analytics dedicated SQL pool. PolyBase provides high-throughput parallel loading, and the stored procedure handles the merge logic (upsert) to manage incremental changes. This approach is recommended for large-scale, daily incremental loads to Synapse.

Exam trap

The trap here is that candidates assume the built-in Upsert option works for all Azure SQL targets, but it is not supported for Azure Synapse Analytics dedicated SQL pool, requiring a custom staging-and-merge pattern instead.

How to eliminate wrong answers

Option A is wrong because the Copy activity with PolyBase enabled does not natively support incremental upsert logic; it only supports bulk insert or append, not merge operations. Option B is wrong because the built-in Upsert option is not available for Azure Synapse Analytics dedicated SQL pool as a sink in ADF Copy activity; it is only supported for Azure SQL Database and SQL Server. Option C is wrong because using Azure Blob Storage as an intermediate sink adds unnecessary complexity and latency; PolyBase can load directly from ADF into Synapse without an intermediate Blob Storage hop.

Practice this question →

148

MCQeasy

You are designing a data processing solution for a marketing company that uses Azure Synapse Analytics. The solution needs to process customer data from multiple sources, including CRM and web analytics. The data must be cleansed and transformed before loading into a dedicated SQL pool. The transformations include string manipulations, date conversions, and lookups. You need to choose a serverless transformation approach that integrates with Azure Synapse pipelines. Which approach should you use?

A.Use Azure Stream Analytics to transform the data in real time.

B.Use PolyBase to load data and then use T-SQL stored procedures to transform.

C.Use Azure Databricks notebooks with Spark to perform transformations.

D.Use mapping data flows in Azure Synapse pipelines.

AnswerD

Serverless, visual transformation within Synapse.

Why this answer

Option C is correct because Azure Synapse pipelines support mapping data flows, which are serverless and provide a visual interface for transformations. Option A is wrong because PolyBase is for loading, not transformation. Option B is wrong because Azure Databricks would require a cluster.

Option D is wrong because Azure Stream Analytics is for streaming, not batch transformations.

Practice this question →

149

MCQmedium

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data volume is approximately 500 GB per hour with peak spikes. The pipeline must minimize data loss during transient failures. Which feature should you implement?

A.Use Azure Synapse Pipeline with Auto-commit and checkpointing to process streaming data.

B.Use PolyBase to load data directly from Event Hubs to the dedicated SQL pool.

C.Use COPY INTO statement to ingest data from Event Hubs into the dedicated SQL pool.

D.Enable Event Hubs Capture to write data to Azure Data Lake Storage and then load using PolyBase.

AnswerA

Auto-commit with checkpointing in Synapse Pipeline provides fault tolerance and exactly-once processing for streaming data.

Why this answer

Option A is correct because Azure Synapse Pipeline with Auto-commit and checkpointing provides exactly-once processing semantics for streaming data from Event Hubs, ensuring no data loss during transient failures by committing offsets only after successful writes to the dedicated SQL pool. This feature is designed for high-volume streaming (500 GB/hour) and handles peak spikes through parallelization and retry logic, making it the optimal choice for minimizing data loss.

Exam trap

The trap here is that candidates often confuse batch-loading technologies like PolyBase or COPY INTO with streaming capabilities, overlooking that only checkpointing-based pipelines provide the fault tolerance needed for real-time data ingestion with minimal loss.

How to eliminate wrong answers

Option B is wrong because PolyBase is a bulk-loading technology for batch data from external sources like Azure Data Lake Storage or Blob Storage, not designed for real-time streaming from Event Hubs, and it lacks checkpointing or auto-commit mechanisms to handle transient failures without data loss. Option C is wrong because the COPY INTO statement is used for batch ingestion from files in Azure Data Lake Storage or Blob Storage, not for streaming data from Event Hubs, and it does not provide streaming-specific fault tolerance like checkpointing. Option D is wrong because Event Hubs Capture writes data to Azure Data Lake Storage in batches (e.g., every 5 minutes or 200 MB), introducing latency and potential data loss during the capture window, and PolyBase loading from there adds further delay, failing the requirement to minimize data loss during transient failures in a streaming pipeline.

Practice this question →

150

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. The data engineering team notices that queries against a large fact table are running slowly. The table uses round-robin distribution and has a columnstore index. The team wants to improve query performance without adding more resources. Which action should the team take?

A.Keep round-robin distribution but increase the degree of parallelism.

B.Change the distribution to hash on multiple columns.

C.Change the distribution to hash on the column that is most frequently used in joins.

D.Rebuild the table as a heap to improve insert performance.

AnswerC

Hash distribution on a join key reduces data shuffling.

Why this answer

Option B is correct because hash-distributing the fact table on a join key that is frequently used in queries allows parallel processing and reduces data movement, improving performance. Option A is incorrect because increasing distribution columns in a hash-distributed table is not supported. Option C is incorrect because round-robin is already used and is not optimal for large fact tables.

Option D is incorrect because clustered columnstore is already in place; a heap would degrade performance.

Practice this question →

← PreviousPage 2 of 4 · 297 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Develop Data Processing questions.

Start 20-question session