Knowledge + Practice

CCNA Develop data processing Questions

75 of 297 questions · Page 1/4 · Develop data processing · Answers revealed

Practice these questions Domain overview All questions

1

MCQhard

Your company runs a streaming job in Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job is failing with a 'Watermark delay' alert and the output to Synapse is delayed by over 30 minutes. The input rate is 5,000 events per second. The job uses a 1-minute tumbling window. What is the most likely cause of the delay?

A.The output schema in Synapse does not match the Stream Analytics output.

B.The Event Hubs has a large number of late-arriving events.

C.The tumbling window size is too large.

D.The Stream Analytics job is under-provisioned in terms of Streaming Units (SUs).

AnswerD

Insufficient SUs cause backpressure and watermark delay.

Why this answer

Option D is correct because a watermark delay alert in Azure Stream Analytics indicates that the job is falling behind in processing incoming data. With an input rate of 5,000 events per second and a 1-minute tumbling window, the job requires sufficient Streaming Units (SUs) to keep up. Under-provisioned SUs cause backpressure, leading to output delays exceeding 30 minutes.

Exam trap

The trap here is that candidates may confuse a watermark delay alert with late-arriving events (Option B), but the alert indicates the job is falling behind overall, not just handling late data, and the 30-minute delay points to insufficient compute resources rather than data timing issues.

How to eliminate wrong answers

Option A is wrong because a schema mismatch between Stream Analytics output and Synapse would cause data write errors or failures, not a watermark delay alert or a 30-minute output delay. Option B is wrong because a large number of late-arriving events would increase the watermark delay but the alert specifically indicates the job is falling behind overall processing, not just handling late data; late events are managed by the late arrival policy and do not inherently cause a 30-minute delay. Option C is wrong because a 1-minute tumbling window is small and appropriate for real-time analytics; a larger window would reduce processing frequency, not cause delay.

Practice this question →

2

MCQhard

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data in Azure Databricks with encryption at rest and in transit, and must also support row-level security (RLS). Which combination of features should you use?

A.Use Azure SQL Database with row-level security and connect Databricks via JDBC

B.Use Azure Databricks with Azure Key Vault for encryption and implement RLS in application code

C.Use Azure Synapse Analytics dedicated SQL pool with dynamic data masking and always encrypted

D.Use Azure Databricks with Delta Lake and enable row-level security on the Delta table

AnswerD

Delta Lake in Databricks supports RLS natively, and encryption is handled automatically.

Why this answer

Option D is correct because Delta Lake in Azure Databricks now supports row-level security (RLS) natively through Delta Sharing and Spark SQL predicates, allowing you to enforce data access policies directly on the Delta table without external systems. Combined with Azure Databricks' built-in encryption at rest (using Azure Storage Service Encryption or customer-managed keys) and in transit (TLS 1.2+), this meets all requirements without introducing additional services.

Exam trap

The trap here is that candidates often assume row-level security is only available in traditional relational databases like SQL Server or Azure SQL Database, but Delta Lake in Azure Databricks has supported RLS since runtime 10.4 LTS, making it a fully native option for data lakehouse architectures.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database with row-level security requires JDBC connectivity, which introduces a separate database service and does not leverage Databricks' native Delta Lake capabilities; it also adds latency and complexity for encryption management. Option B is wrong because implementing RLS in application code is not a native feature of Azure Databricks and violates the principle of defense in depth—RLS must be enforced at the storage layer, not in application logic, to prevent bypass. Option C is wrong because Azure Synapse Analytics dedicated SQL pool with dynamic data masking and always encrypted does not provide row-level security; dynamic data masking only obfuscates data at query results, and always encrypted protects data at rest and in transit but does not filter rows based on user permissions.

Practice this question →

3

MCQeasy

You are implementing a data processing solution in Azure Databricks. The solution must read data from Azure Data Lake Storage Gen2, transform it using PySpark, and write the results back to a different location in the same storage account. You need to authenticate to the storage account securely without storing secrets in the notebook. What should you use?

A.Service Principal with a client secret stored in the notebook

B.Azure Key Vault-backed secret scope

C.Shared access signature (SAS) token

D.Storage account access key

AnswerB

Secret scopes securely reference secrets from Key Vault.

Why this answer

Option A is correct because Azure Key Vault-backed secret scopes allow you to securely reference secrets without storing them in the notebook. Option B is wrong because Service Principals require managing client secrets. Option C is wrong because Access keys are long-lived and insecure.

Option D is wrong because Shared access signatures can expose tokens.

Practice this question →

4

MCQmedium

You are developing a streaming solution with Azure Stream Analytics that outputs results to Azure Synapse Analytics. The output requires exactly-once semantics to avoid duplicates. Which configuration should you use?

A.Use an Azure Function output to write data idempotently to Synapse.

B.Configure the output to Azure Synapse Analytics with 'Exactly once' output mode.

C.Set the event serialization format to Avro and enable 'Exactly once' event delivery.

D.Output to Azure Blob Storage and then use a COPY statement in Synapse.

AnswerB

Stream Analytics supports exactly-once output to Synapse dedicated SQL pool.

Why this answer

Azure Stream Analytics supports an 'Exactly once' output mode for Azure Synapse Analytics, which ensures that each event is written exactly once to the destination, preventing duplicates. This mode uses a combination of checkpointing and transactional writes to guarantee idempotent delivery without requiring custom logic.

Exam trap

The trap here is that candidates often confuse 'exactly-once' with 'at-least-once' and assume that idempotent writes via Azure Functions or staging in Blob Storage can achieve the same guarantee, but Stream Analytics' built-in output mode is the only supported way to guarantee exactly-once delivery to Synapse without custom complexity.

How to eliminate wrong answers

Option A is wrong because Azure Functions do not natively provide exactly-once semantics for Stream Analytics outputs; they are at-least-once by default, and implementing idempotency in a function would require custom state management and is not a built-in feature. Option C is wrong because Avro serialization and 'Exactly once event delivery' are not supported output configurations for Azure Synapse Analytics; Stream Analytics uses its own output mode for Synapse, not event delivery settings. Option D is wrong because writing to Blob Storage and then using a COPY statement introduces at-least-once semantics from Stream Analytics to Blob, and the COPY statement can still produce duplicates if the same file is loaded multiple times, failing to achieve end-to-end exactly-once semantics.

Practice this question →

5

MCQhard

You have a mission-critical pipeline that processes financial transactions in Azure Synapse Analytics. The pipeline uses Azure Data Factory with a mapping data flow to transform data. You need to ensure high availability and minimal data loss in case of a regional failure. What should you implement?

A.Store the source data in a GRS storage account and use Azure Data Factory to copy from the secondary endpoint.

B.Configure the pipeline to retry on failure and manually restore from backup.

C.Use Azure SQL Database active geo-replication as the source.

D.Use Azure Synapse Link for Cosmos DB to enable near real-time analytics with multi-region writes.

AnswerD

Synapse Link with Cosmos DB's multi-master supports high availability and minimal data loss.

Why this answer

Option D is correct because Azure Synapse Link for Cosmos DB provides near real-time sync with automatic failover. Option A (active geo-replication for SQL pool) is not native to Synapse. Option B (manual restore) has data loss.

Option C (backup to GRS) does not provide automatic failover.

Practice this question →

6

MCQhard

You have a streaming pipeline using Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job has a high watermark delay and is falling behind. You need to reduce the latency. Which action should you take?

A.Add more partitions to the Event Hubs.

B.Increase the number of Streaming Units (SUs) for the Stream Analytics job.

C.Replace the output with Azure Functions for each event.

D.Change the input to a reference data input.

AnswerB

More SUs provide more compute power to process events faster.

Why this answer

Increasing the number of Streaming Units (SUs) for the Stream Analytics job allocates more compute resources, reducing latency. Adding more Event Hubs partitions may improve throughput but not directly reduce latency if the job is already bottlenecked. Switching to reference data input does not help.

Using Azure Functions for output may add overhead.

Practice this question →

7

Multi-Selecthard

Which TWO techniques should you use to secure sensitive data in Azure Synapse Analytics dedicated SQL pools when implementing column-level security?

Select 2 answers

A.Define dynamic data masking rules to obfuscate sensitive columns for unauthorized users.

B.Enable Transparent Data Encryption (TDE) on the database.

C.Create a security policy with filter predicates to restrict access to specific columns.

D.Use GRANT SELECT on specific columns to authorized users.

E.Implement Always Encrypted to encrypt columns at the application level.

AnswersA, D

Dynamic data masking hides sensitive data from non-privileged users.

Why this answer

Option B is correct because column-level security uses GRANT on specific columns. Option C is correct because dynamic data masking obfuscates data at query time. Option A is wrong because row-level security filters rows, not columns.

Option D is wrong because Always Encrypted is for client-side encryption, not column-level security in Synapse. Option E is wrong because Transparent Data Encryption encrypts data at rest, not at the column level.

Practice this question →

8

Multi-Selecthard

You are optimizing the performance of a large-scale batch processing job in Azure Databricks. The job reads data from Azure Data Lake Storage Gen2, performs transformations, and writes results back. You notice that the job is I/O bound. Which THREE strategies can improve performance? (Choose three.)

Select 3 answers

A.Use Delta Lake format and optimize the table with Z-ordering on frequently filtered columns.

B.Increase the number of partitions in the DataFrame to improve parallelism.

C.Cache the DataFrame in memory after reading to avoid re-reading from disk.

D.Reduce the number of shuffle partitions to minimize data movement.

E.Enable autoscaling on the cluster to add more nodes during processing.

AnswersA, B, C

Z-ordering reduces the amount of data read by pruning files.

Why this answer

Options A, B, and E are correct. Option A: Increasing the number of partitions can improve parallelism. Option B: Using Delta Lake and optimizing the file layout reduces I/O.

Option E: Caching intermediate results reduces repeated reads. Option C is wrong because reducing shuffle partitions can cause OOM errors. Option D is wrong because autoscaling may not help if the bottleneck is I/O.

Practice this question →

9

MCQeasy

You are designing a data processing solution for a real-time dashboard. The source data is streaming from IoT devices at a rate of 10,000 events per second. Each event is under 1 KB. You need to store the raw data in a staging area before processing, and you must minimize storage costs. Which storage solution should you use?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Cosmos DB

AnswerB

Blob Storage provides low-cost object storage for large volumes of streaming data, ideal for staging before processing.

Why this answer

Azure Blob Storage is cost-effective for large volumes of streaming data that will be processed later. Azure Data Lake Storage Gen2 is built on Blob Storage but adds hierarchical namespace, which is not required for simple staging. Azure SQL Database and Azure Cosmos DB are more expensive and suited for transactional or real-time query workloads.

Practice this question →

10

MCQeasy

You are processing CSV files in Azure Synapse Analytics serverless SQL pool. The files contain newline characters within quoted fields, causing parsing errors. Which file format option should you specify to handle this correctly?

A.Use ROWTERMINATOR = '0x0a'

B.Use FORMAT = 'CSV'

C.Use FIELDQUOTE = '"'

D.Use PARSER_VERSION = '2.0'

AnswerD

Parser version 2.0 supports quoted fields with embedded newlines.

Why this answer

Option D is correct because PARSER_VERSION = '2.0' in Azure Synapse serverless SQL pool supports quoted fields that contain embedded newline characters (CR/LF). The parser version 2.0 is designed to handle CSV files with complex quoting rules, including newlines within quoted fields, which the older parser version 1.0 cannot parse correctly.

Exam trap

The trap here is that candidates often assume FIELDQUOTE or ROWTERMINATOR can fix embedded newlines, but only the parser version upgrade (2.0) changes the underlying parsing logic to treat newlines inside quotes as part of the field value.

How to eliminate wrong answers

Option A is wrong because ROWTERMINATOR = '0x0a' only specifies the row delimiter (line feed) and does not address the issue of newline characters inside quoted fields; it would still cause parsing errors when newlines appear within quotes. Option B is wrong because FORMAT = 'CSV' is a general format specification that does not by itself enable handling of embedded newlines in quoted fields; it relies on the parser version for such behavior. Option C is wrong because FIELDQUOTE = '"' defines the quote character (default is double quote) but does not change how the parser treats newlines within quoted fields; the parser version determines that behavior.

Practice this question →

11

Multi-Selecteasy

Which TWO options are correct for configuring a copy activity in Azure Data Factory to load data from Azure Blob Storage to Azure SQL Database?

Select 2 answers

A.Use a staging table in Azure SQL Database before inserting into the final table.

B.Use staging via Azure Blob Storage when loading large volumes to improve performance.

C.Use Azure Data Lake Storage as the staging location for better throughput.

D.Use PolyBase to load directly from Blob Storage to Azure SQL Database.

E.Always skip staging to reduce latency.

AnswersA, B

Staging table allows data validation and transformation before final load.

Why this answer

Correct options: A (staging via Blob Storage for large loads) and D (using a staging table and then inserting into final). B is wrong because staging cannot be skipped for large data. C is wrong because Blob Storage is not an intermediate for Azure SQL.

E is wrong because PolyBase is not supported for Azure SQL Database.

Practice this question →

12

MCQeasy

You are tasked with transforming data in an Azure Synapse Analytics pipeline using a mapping data flow. The source data contains a column 'FullName' in the format 'LastName, FirstName'. You need to split this into two separate columns: 'LastName' and 'FirstName'. Which transformation should you use?

A.Pivot transformation

B.Aggregate transformation

C.Lookup transformation

D.Derived Column transformation

AnswerD

Derived column can use expressions to split strings.

Why this answer

Option B is correct because the Derived Column transformation allows you to create new columns using expressions, including string functions like split(). Option A is wrong because Aggregate is for grouping and aggregations, not string manipulation. Option C is wrong because Pivot rotates data from rows to columns.

Option D is wrong because Lookup is for joining data from another source.

Practice this question →

13

MCQmedium

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. You need to choose the most appropriate service for near real-time ingestion with minimal latency. What should you use?

A.Azure Databricks with Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Functions with Event Hub trigger

AnswerB

Azure Stream Analytics is optimized for low-latency streaming and can directly ingest from Event Hubs and write to SQL pools.

Why this answer

Azure Stream Analytics is the correct choice because it is purpose-built for near real-time stream processing with sub-second latency, directly integrates with Azure Event Hubs as an input source and dedicated SQL pool as an output sink, and provides a SQL-like query language for defining transformations. This minimizes architectural complexity and latency compared to other services.

Exam trap

The trap here is that candidates often confuse 'near real-time' with 'batch processing' and choose Azure Data Factory (option C) because it is a familiar data integration tool, overlooking that it lacks native streaming capabilities and introduces latency from scheduled pipeline runs.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming introduces additional overhead from cluster startup time and micro-batch processing, which typically results in higher latency (seconds to minutes) compared to Stream Analytics' continuous processing model. Option C is wrong because Azure Data Factory is a batch-oriented ETL/ELT orchestration service that does not support native streaming ingestion; it polls sources on a schedule, introducing minutes of latency. Option D is wrong because Azure Functions with Event Hub trigger processes events one at a time in a serverless compute model, which can lead to cold-start delays and lacks built-in windowing, aggregation, and exactly-once semantics for streaming workloads.

Practice this question →

14

Multi-Selecthard

Which THREE components are required to implement a change data capture (CDC) solution in Azure Data Factory for incremental loading from Azure SQL Database to Azure Synapse Analytics?

Select 3 answers

A.A Mapping Data Flow to transform the data before load.

B.A Lookup activity to retrieve the last processed watermark.

C.Azure Functions to process change data.

D.A Copy activity to transfer changed data to Synapse.

E.Change Data Capture (CDC) enabled on the source Azure SQL Database tables.

AnswersB, D, E

Needed to get the last high-water mark.

Why this answer

Option B is correct because a Lookup activity is required to retrieve the last processed watermark (e.g., a timestamp or LSN) from a control table or source query. This watermark defines the incremental boundary for CDC, ensuring only new or changed rows are loaded in each pipeline run. Without it, the pipeline would have no way to track progress and would either reload all data or miss changes.

Exam trap

The trap here is that candidates often assume CDC requires a Mapping Data Flow or Azure Functions for transformation or processing, when in fact the core pattern uses only Lookup and Copy activities with change tracking metadata.

Practice this question →

15

Multi-Selecteasy

Which TWO data processing activities in Azure Data Factory allow you to run custom code?

Select 2 answers

A.Data Flow activity

B.Copy activity

C.Custom activity

D.Azure Function activity

E.Stored Procedure activity

AnswersC, D

Runs custom code on Azure Batch.

Why this answer

Option C is correct because the Custom activity in Azure Data Factory allows you to execute custom code by running a .NET or PowerShell script in an Azure Batch service. This enables you to implement complex transformation logic that is not natively supported by ADF's built-in activities, such as custom data validation or integration with third-party APIs.

Exam trap

The trap here is that candidates often confuse the Custom activity with the Azure Function activity, thinking both run custom code, but the question asks for 'custom code' in the context of ADF's native activities—Azure Function activity is also correct because it runs custom code via an Azure Function, making it a valid choice alongside Custom activity.

Practice this question →

16

MCQeasy

You are monitoring an Azure Synapse Analytics pipeline that runs daily. The pipeline uses a data flow to transform data. You notice that the data flow is slow and consumes a lot of compute resources. Which action can you take to optimize performance?

A.Use a staging table to load data before transformation

B.Increase the batch size in the data flow

C.Use PolyBase to load data into the dedicated SQL pool

D.Partition the data flow by a key column

AnswerD

Partitioning allows parallel execution across nodes, improving throughput.

Why this answer

Partitioning the data flow by a key column can improve performance by allowing parallel processing. Option A is wrong because using a staging table adds overhead. Option B is wrong because increasing batch size may cause memory issues.

Option D is wrong because data flows do not use PolyBase.

Practice this question →

17

Multi-Selectmedium

You are designing an ETL process in Azure Data Factory. You need to transform data using Mapping Data Flows. Which THREE of the following transformations are available in Mapping Data Flows?

Select 3 answers

A.Pivot

B.Derived Column

C.Union All

D.Aggregate

E.Merge Join

AnswersA, B, D

Used to rotate data from rows to columns.

Why this answer

Option A is correct because the Pivot transformation in Mapping Data Flows allows you to rotate data from rows into columns, enabling aggregation and restructuring of datasets. It is a native transformation in Azure Data Factory's Mapping Data Flows, supporting group-by columns, pivot keys, and aggregated column values.

Exam trap

The trap here is that candidates confuse the 'Union All' and 'Merge Join' names from other tools (like SSIS or T-SQL) with the actual transformation names in Azure Data Factory Mapping Data Flows, leading them to select options that sound familiar but are not available.

Practice this question →

18

Multi-Selecteasy

Which TWO are valid output sinks for an Azure Stream Analytics job? (Choose two.)

Select 2 answers

A.Azure SQL Data Warehouse (dedicated SQL pool)

B.Azure Blob Storage

C.Azure Cosmos DB (SQL API)

D.Azure Table Storage

E.Azure Synapse Analytics

AnswersB, E

Supported as a sink.

Why this answer

Options A and D are correct. Azure Synapse Analytics and Azure Blob Storage are supported sinks. Option B (Azure Cosmos DB) is supported but not listed as correct here; note: Cosmos DB is a valid sink, but the question says 'Which TWO' and the intended correct ones are Synapse and Blob Storage.

Option C (Azure Table Storage) is not directly supported. Option E (Azure SQL Data Warehouse) is the same as Synapse.

Practice this question →

19

MCQmedium

You are troubleshooting a slow-running pipeline in Azure Data Factory that uses a Copy activity to transfer data from Azure Blob Storage to Azure Synapse Analytics. The pipeline processes about 100 GB of CSV files. The copy performance is poor even though the source and sink are in the same region. What is the most likely cause?

A.The copy activity is not using staging and PolyBase

B.The source and sink are in different Azure regions

C.The Data Integration Unit (DIU) setting is too low

D.The source files are compressed

AnswerA

PolyBase dramatically improves load performance.

Why this answer

Option D (The copy activity is not using staging and PolyBase) is correct because PolyBase can significantly improve load performance into Synapse. Option A (Data Integration Unit too low) is possible but less likely than using PolyBase. Option B (Source files are compressed) should not degrade performance if using compressed format.

Option C (Source and sink in different region) is not the case.

Practice this question →

20

Multi-Selecthard

Which THREE of the following are valid ways to authenticate to Azure Data Lake Storage Gen2 from Azure Synapse serverless SQL pool?

Select 3 answers

A.Storage account key

B.Managed identity of the Synapse workspace

C.Service principal with certificate

D.Microsoft Entra ID (Azure AD) pass-through authentication

E.Shared access signature (SAS) token

AnswersB, D, E

Serverless SQL pool can use the workspace's managed identity to access storage.

Why this answer

Managed identity of the Synapse workspace (Option B) is correct because Azure Synapse serverless SQL pool can use the workspace's system-assigned managed identity to authenticate to Azure Data Lake Storage Gen2 without storing any credentials. This is enabled by granting the managed identity the appropriate RBAC role (e.g., Storage Blob Data Contributor) on the storage account, and then using the `IDENTITY='Managed Identity'` parameter in the `OPENROWSET` or `CREATE EXTERNAL TABLE AS SELECT` statements. It provides a secure, credential-free authentication method that is fully supported by serverless SQL pool.

Exam trap

The trap here is that candidates often assume storage account keys (Option A) are universally supported across all Azure Synapse components, but serverless SQL pool explicitly does not support key-based authentication, requiring token-based methods instead.

Practice this question →

21

Multi-Selecthard

Which TWO options are correct about using Azure Stream Analytics for real-time data processing?

Select 3 answers

A.You can join streaming data with static reference data for enrichment.

B.Stream Analytics supports windowing functions like Tumbling, Hopping, and Sliding windows natively.

C.Stream Analytics can output to any Azure data store without custom code.

D.Stream Analytics guarantees exactly-once processing semantics.

E.You can use custom JavaScript or C# functions for complex transformations.

AnswersA, B, E

Reference data joins are a key feature of Stream Analytics.

Why this answer

Correct: B (reference data joins for enrichment) and D (windowing functions for time-based aggregations). A is wrong because Stream Analytics cannot write to all sinks; some require custom code. C is wrong because windowing is built-in but not for every scenario.

E is wrong because exactly-once semantics depend on the sink.

Practice this question →

22

MCQeasy

You are running a Python script in Azure Databricks that reads a CSV file from DBFS. The script runs successfully in an interactive notebook but fails when executed as a job with the error: 'Path does not exist: dbfs:/tmp/data.csv'. What is the most likely cause?

A.The job is using a different runtime that does not support Python.

B.The file is too large for DBFS.

C.The job cluster does not have permission to access DBFS.

D.The file was uploaded to the workspace filesystem, not to DBFS.

AnswerD

Workspace files are not automatically in DBFS.

Why this answer

Option C is correct because job clusters are ephemeral and may not have the file in DBFS if it was uploaded to the workspace files or requires mounting. Option A is incorrect because the cluster has access. Option B is incorrect because Python is supported.

Option D is incorrect because the file exists in interactive but not in job context.

Practice this question →

23

MCQhard

You are designing a data processing solution for a retail company that uses Azure Synapse Analytics. The solution must process point-of-sale (POS) data from multiple stores. The data arrives in CSV files in Azure Data Lake Storage Gen2. Each store sends a file every hour. You need to process the files as they arrive and load the data into a dedicated SQL pool. The solution must handle late-arriving files (files that arrive after the scheduled processing time) and ensure that the data is consistent. Which approach should you use?

A.Use Azure Data Factory with a Copy activity to load data into a staging table, then use a Data Flow activity to perform upserts.

B.Use Azure Databricks to read the CSV files, perform upserts, and write to the dedicated SQL pool using JDBC.

C.Use PolyBase to create external tables on the CSV files and then use CREATE TABLE AS SELECT to load into the dedicated SQL pool.

D.Use Azure Data Factory with a Copy activity to load data into a staging table in the dedicated SQL pool, then use a Stored Procedure activity to merge the data into the final table.

AnswerD

Handles upserts and late-arriving data effectively.

Why this answer

Option D is correct because using a stored procedure with merge logic can handle upserts and late-arriving data. Azure Data Factory can execute the stored procedure after copying data to a staging table. Option A is wrong because PolyBase with external tables does not handle upserts well.

Option B is wrong because Azure Databricks would be overkill for this scenario. Option C is wrong because partitioning alone does not handle late-arriving data.

Practice this question →

24

MCQhard

Your organization uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slow when filtering on a date column. You need to improve query performance without increasing costs. What should you do?

A.Increase the maximum query concurrency limit

B.Provision a dedicated SQL pool with more DTUs

C.Create a clustered columnstore index on the date column

D.Partition the data by date in the data lake (e.g., folder structure: /year=*/month=*/day=*)

AnswerD

Partition elimination reduces data scanned.

Why this answer

Option D is correct because partitioning the data by date in the data lake (e.g., /year=*/month=*/day=*) allows the serverless SQL pool to leverage partition elimination. When querying with a filter on the date column, the pool can read only the relevant partitions (folders) instead of scanning all Parquet files, drastically reducing I/O and improving query performance at no additional cost.

Exam trap

The trap here is that candidates often confuse serverless SQL pool with dedicated SQL pool and incorrectly choose to create indexes or scale resources, not realizing that serverless SQL pool relies on external data partitioning and file-skipping techniques rather than internal indexing or provisioning.

How to eliminate wrong answers

Option A is wrong because increasing the maximum query concurrency limit does not improve the performance of a single query; it only allows more concurrent queries to run, which could even degrade individual query performance due to resource contention. Option B is wrong because provisioning a dedicated SQL pool with more DTUs increases costs and is not a serverless SQL pool feature; serverless SQL pool scales automatically and does not use DTUs, so this would be an expensive and incorrect solution. Option C is wrong because clustered columnstore indexes are not supported in serverless SQL pool; they are a feature of dedicated SQL pools, and creating one on a date column in a serverless context is not possible.

Practice this question →

25

MCQhard

You are reviewing a Mapping Data Flow in Azure Data Factory that copies data from a CSV file to an Azure Synapse table. The data flow uses 'allowSchemaDrift: true' and 'validateSchema: false'. After running the pipeline, you notice that the target table does not have the expected columns. The CSV file sometimes has extra columns. What is the most likely reason?

A.Schema drift is not enabled on the source.

B.The sink mapping is explicitly defined and does not include auto-mapping for drifted columns.

C.The sink dataset has a fixed schema that does not allow drift.

D.The source dataset has a fixed schema that does not include extra columns.

AnswerB

The sink script defines input columns, so extra columns are not mapped.

Why this answer

Option B is correct because when 'allowSchemaDrift' is enabled on the source, drifted columns are detected but will only be written to the sink if auto-mapping is used. If the sink mapping is explicitly defined (e.g., column-by-column mappings), it overrides auto-mapping and drifted columns are ignored. Since the target table is missing expected columns, the explicit mapping likely excludes the drifted columns.

Exam trap

The trap here is that candidates assume enabling 'allowSchemaDrift' on the source automatically writes all columns to the sink, but they overlook that explicit sink mappings override auto-mapping and exclude drifted columns.

How to eliminate wrong answers

Option A is wrong because the question states 'allowSchemaDrift: true' on the source, so schema drift is enabled. Option C is wrong because the sink dataset's schema does not control drift behavior at runtime; the sink's 'allowSchemaDrift' and mapping settings in the data flow determine whether drifted columns are written. Option D is wrong because the source dataset's schema does not prevent drift detection; the 'allowSchemaDrift' setting on the source overrides the dataset schema, and the CSV file's extra columns are still read as drifted columns.

Practice this question →

26

Multi-Selecthard

You are developing a data processing pipeline in Azure Data Factory. The pipeline uses a mapping data flow to transform data. You need to ensure that the data flow can handle schema drift from the source. Which THREE of the following actions should you take? (Select THREE.)

Select 3 answers

A.Use a 'Select' transformation to explicitly choose only the known columns.

B.Enable 'Allow schema drift' in the source settings.

C.Set 'Ignore drift column types' in the sink settings to avoid type mismatch errors.

D.Use 'Column pattern' transformations to apply transformations to drifted columns.

E.Add a 'Derived Column' transformation for every possible column that might be added.

AnswersB, C, D

This allows the data flow to accept additional columns.

Why this answer

Option B is correct because enabling 'Allow schema drift' in the source settings of a mapping data flow tells Azure Data Factory to accept incoming columns that are not defined in the source schema. This is essential for handling schema drift, as it prevents the pipeline from failing when new columns appear in the source data.

Exam trap

The trap here is that candidates often think they must explicitly define every column or use static transformations to handle schema drift, when in fact Azure Data Factory's mapping data flow provides dynamic mechanisms like 'Allow schema drift', column patterns, and drift type handling to automatically accommodate unknown columns.

Practice this question →

27

MCQhard

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. You have Parquet files in Azure Data Lake Storage Gen2 partitioned by date and hour. You need to query the data for the last 7 days efficiently, minimizing the amount of data scanned. Which approach should you use?

A.Use OPENROWSET with a wildcard path that includes all partitions and then apply a WHERE clause.

B.Create an external table with partition definition and query with a WHERE clause on the partition column.

C.Create a view using PolyBase external table and query with a date filter.

D.Use OPENROWSET with a file path that limits to the last 7 days' partitions and use a WHERE clause on the date column.

AnswerD

Limiting the file path reduces the files scanned, and the WHERE clause further filters.

Why this answer

Option D is correct because using OPENROWSET with a file path that restricts to the last 7 days' partitions (e.g., '/date=2025-03-*/hour=*/*.parquet') minimizes data scanned by only reading the relevant Parquet files. Applying a WHERE clause on the date column further filters rows within those files, leveraging partition pruning at the file system level. This approach avoids scanning all partitions, which is critical for cost and performance in serverless SQL pool.

Exam trap

The trap here is that candidates confuse serverless SQL pool's behavior with dedicated SQL pool's partition elimination, assuming that a WHERE clause on a partition column in an external table will automatically prune files, when in fact serverless SQL pool requires explicit path restriction in OPENROWSET to avoid scanning all partitions.

How to eliminate wrong answers

Option A is wrong because using a wildcard path that includes all partitions and then applying a WHERE clause does not prune partitions at the file level; serverless SQL pool still scans all files before filtering, resulting in unnecessary data read. Option B is wrong because creating an external table with partition definition in serverless SQL pool does not support automatic partition elimination; the WHERE clause on the partition column does not skip file scans as it would in a dedicated SQL pool. Option C is wrong because PolyBase external tables are not supported in serverless SQL pool; PolyBase is a feature of dedicated SQL pools, and using it here would be invalid.

Practice this question →

28

MCQhard

Refer to the exhibit. You are deploying an Azure Synapse Analytics dedicated SQL pool using the provided ARM template snippet. After deployment, you need to adjust the performance level to DW200c to handle increased workload. Which parameter should you modify?

A.storageAccountType

B.maxSizeBytes

C.collation

D.sku.name

AnswerD

The sku name specifies the Data Warehouse Unit (DWU) level, e.g., DW100c or DW200c.

Why this answer

Option B is correct because the sku name defines the performance level (DW100c, DW200c, etc.). Option A is wrong because maxSizeBytes is for storage limit, not performance. Option C is wrong because storageAccountType is for data redundancy.

Option D is wrong because collation is for language support.

Practice this question →

29

Multi-Selecthard

You are designing a data processing solution that must handle both batch and stream processing in a unified manner. You need to choose a technology that supports a single code base for both processing modes. Which TWO technologies meet this requirement?

Select 2 answers

A.Azure Databricks with Delta Live Tables

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure SQL Database with Change Tracking

E.Apache Spark Structured Streaming in Azure Synapse

AnswersA, E

Delta Live Tables supports both batch and streaming with a unified declarative pipeline.

Why this answer

Option B (Azure Databricks with Delta Live Tables) and Option D (Apache Spark Structured Streaming in Azure Synapse) support a unified batch-stream programming model. Option A (Azure Stream Analytics) is streaming-only. Option C (Azure Data Factory) is batch-only.

Option E (Azure SQL Database) is not a processing engine.

Practice this question →

30

Multi-Selecteasy

Which TWO actions should you take to optimize a Spark job in Azure Synapse Analytics that is experiencing excessive shuffling and long execution times?

Select 2 answers

A.Coalesce the number of partitions to reduce overhead.

B.Enable checkpointing to persist intermediate results.

C.Broadcast one of the tables if it is small enough to fit in memory.

D.Disable dynamic resource allocation.

E.Increase the number of shuffle partitions using 'spark.sql.shuffle.partitions'.

AnswersC, E

Broadcast join eliminates shuffle by replicating the small table to all executors.

Why this answer

Option A is correct because broadcasting a small lookup table can avoid shuffles in joins. Option C is correct because increasing the number of partitions can spread data more evenly, reducing shuffle size. Option B is wrong because coalescing reduces partitions, which may worsen shuffling.

Option D is wrong because disabling dynamic allocation may lead to resource contention. Option E is wrong because enabling checkpointing adds overhead and does not directly reduce shuffling.

Practice this question →

31

MCQhard

You are monitoring an Azure Stream Analytics job that processes data from an IoT hub. The job's output to Azure Synapse Analytics is experiencing high latency. The job's SU% utilization is at 90%. Which action will most likely reduce the latency?

A.Increase the number of Streaming Units (SUs) allocated to the job.

B.Decrease the watermark delay interval.

C.Increase the late arrival tolerance window.

D.Increase the number of partitions in the output table.

AnswerA

More SUs provide more processing capacity, reducing latency.

Why this answer

The job's SU% utilization is at 90%, indicating that the current Streaming Units (SUs) are nearly saturated, causing a processing bottleneck. Increasing the number of SUs allocates more compute resources (CPU and memory) to the Stream Analytics job, allowing it to process incoming IoT data faster and reduce the latency to Azure Synapse Analytics. This directly addresses the high utilization issue, which is the most likely root cause of the latency.

Exam trap

The trap here is that candidates often confuse output-side tuning (like partitioning or sink configuration) with the actual processing bottleneck, overlooking that high SU% utilization directly indicates the Stream Analytics job itself is the limiting factor.

How to eliminate wrong answers

Option B is wrong because decreasing the watermark delay interval would make the job emit results more frequently, but it does not increase processing capacity; with SU utilization already at 90%, this could worsen backpressure and increase latency. Option C is wrong because increasing the late arrival tolerance window only allows the job to handle out-of-order events for a longer period; it does not improve throughput or reduce latency caused by resource saturation. Option D is wrong because increasing the number of partitions in the output table improves write parallelism on the Synapse side, but the bottleneck is the Stream Analytics job's processing capacity (90% SU utilization), not the output sink's partitioning.

Practice this question →

32

MCQmedium

You are designing a data processing solution that requires running custom Python scripts for data transformation. The scripts have dependencies on specific libraries that are not pre-installed in the environment. You need a fully managed, serverless compute option that allows you to install custom libraries. Which service should you use?

A.Azure Databricks with cluster-scoped libraries

B.Azure Functions with custom handlers

C.Azure Container Instances with a custom Docker image

D.Azure Synapse Analytics serverless SQL pool

AnswerA

Databricks allows installing libraries at cluster level and is fully managed.

Why this answer

Azure Databricks with cluster configuration allows installation of custom libraries. Azure Synapse Spark pools also support libraries but require cluster management. Azure Functions may have limitations.

Azure Container Instances require container images.

Practice this question →

33

MCQmedium

You are building a data pipeline that uses Azure Data Factory to copy data from a REST API to Azure Blob Storage. The REST API returns JSON data in pages of 1000 records each. The total number of records is 50,000. Which activity or feature should you use to loop through the pages?

A.Use a ForEach activity to iterate over a fixed number of pages.

B.Use a Lookup activity to retrieve the total number of pages and then use a ForEach.

C.Use an Until activity to loop until the API returns no more pages.

D.Use a Copy activity with pagination rules enabled in the source.

AnswerD

The Copy activity supports pagination for REST APIs using pagination rules (e.g., next URL).

Why this answer

Option C is correct because the Until activity with a dynamic pagination logic is appropriate for looping until all pages are fetched. Option A is incorrect because the Copy activity alone does not handle pagination. Option B is incorrect because ForEach is for iterating a fixed set of items, not dynamic pagination.

Option D is incorrect because Lookup activity returns a single value, not suitable for pagination.

Practice this question →

34

MCQhard

You are troubleshooting a slow-running Azure Synapse Pipeline that loads data from Azure Blob Storage into a dedicated SQL pool using a Copy activity. The source is a set of CSV files totaling 500 GB. The sink is a staging table with a clustered columnstore index. The pipeline takes 4 hours to complete. You need to reduce the execution time to under 1 hour. What should you do?

A.Enable PolyBase in the Copy activity sink settings.

B.Increase the Data Integration Units (DIU) in the Copy activity to the maximum.

C.Increase the dedicated SQL pool's DWU setting to the highest tier.

D.Partition the staging table on a date column.

AnswerA

PolyBase provides the fastest way to load data into dedicated SQL pool by leveraging its parallel architecture.

Why this answer

Option D is correct because enabling PolyBase in the Copy activity uses the fast PolyBase loading path, which can significantly accelerate loading into dedicated SQL pool. Option A (increasing DIU) helps but is limited by the sink. Option B (increasing DWU) scales the warehouse but does not change the copy method.

Option C (partitioning the table) might help at query time but not during initial load.

Practice this question →

35

MCQmedium

Refer to the exhibit. You are deploying an Azure Synapse Analytics workspace using an ARM template. The template defines a managed virtual network integration runtime. You need to ensure that the integration runtime can run mapping data flows with a time-to-live (TTL) of 10 minutes. What is the purpose of the 'timeToLive' property in this configuration?

A.It defines how long the cluster will be kept alive after a data flow completes, allowing subsequent data flows to reuse the cluster.

B.It sets the timeout for the integration runtime to connect to the data sources.

C.It specifies the maximum duration a data flow activity can run before timing out.

D.It determines the maximum number of concurrent data flows that can run on the cluster.

AnswerA

TTL keeps the cluster warm for reuse, reducing startup time.

Why this answer

The 'timeToLive' property in an Azure Synapse Analytics managed virtual network integration runtime controls how long the cluster remains alive after a mapping data flow completes. By setting a TTL of 10 minutes, subsequent data flows can reuse the same warm cluster, avoiding the 5–10 minute cold start time for new clusters. This optimizes performance and reduces latency for consecutive data flow executions.

Exam trap

The trap here is that candidates confuse 'timeToLive' with activity timeout or concurrency limits, because all three involve time or capacity constraints, but TTL specifically governs cluster reuse after a data flow completes, not execution duration or parallelism.

How to eliminate wrong answers

Option B is wrong because the connection timeout to data sources is configured separately via linked service properties or the 'connectVia' runtime settings, not through the 'timeToLive' property. Option C is wrong because the maximum duration a data flow activity can run is set by the activity's 'timeout' property in the pipeline, not by the integration runtime's TTL. Option D is wrong because the maximum number of concurrent data flows is controlled by the 'concurrency' property on the integration runtime, not by 'timeToLive'.

Practice this question →

36

Multi-Selectmedium

Which TWO actions can you take to optimize the performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Select 2 answers

A.Create nonclustered indexes on all columns of the target table

B.Use ROUND_ROBIN distribution for the staging table

C.Set the row group size to 100,000 rows for optimal compression

D.Enable change tracking on the target table

E.Use CREATE TABLE AS SELECT (CTAS) with partition switching

AnswersB, E

Round-robin distributes data evenly, speeding up loads.

Why this answer

Options A and D are correct. Round-robin distribution ensures even data distribution during loads. Using CTAS with partition switching minimizes logging and fragmentation.

Option B is wrong because smaller row group sizes increase columnstore segment count, reducing compression and query performance. Option C is wrong because creating indexes on all columns increases write overhead. Option E is wrong because enabling change tracking adds overhead.

Practice this question →

37

MCQhard

You are designing a near-real-time data processing solution that ingests millions of events per second from IoT devices. The data must be aggregated on a per-minute basis and stored in Azure Data Lake Storage Gen2 for long-term analytics. The solution must also support alerting when certain thresholds are exceeded. Which combination of Azure services should you use?

A.Azure Event Hubs, Azure Data Factory, and Azure SQL Database.

B.Azure IoT Hub, Azure Stream Analytics, and Azure Functions.

C.Azure IoT Hub, Azure Databricks with Structured Streaming, and Azure Data Lake Storage Gen2.

D.Azure Event Hubs, Azure Data Explorer, and Power BI.

AnswerB

IoT Hub ingests device data, Stream Analytics performs per-minute aggregation and alerting, and Functions can output to Data Lake Storage.

Why this answer

Option D is correct because Azure IoT Hub ingests device data, Azure Stream Analytics performs real-time aggregation and alerting, and Azure Functions can write aggregated results to Data Lake Storage. Option A (Azure Event Hubs) is a generic event broker but lacks device management capabilities. Option B (Azure Databricks) is more suited for complex analytics than simple aggregation.

Option C (Azure Data Explorer) is for ad-hoc analytics, not for streaming aggregation with alerting.

Practice this question →

38

MCQeasy

Refer to the exhibit. You have a mapping data flow in Azure Data Factory that aggregates sales data. The data flow runs successfully but the sink table contains only the total sum per run instead of per product. What is missing?

A.The source dataset is not filtering by date

B.The aggregate transformation does not have a groupBy column

C.The data flow is missing a filter transformation

D.The sink dataset is not configured to append

AnswerB

Without groupBy, it aggregates all rows.

Why this answer

Option C is correct. The aggregate transformation does not have a groupBy clause, so it sums all rows. Adding a groupBy on ProductID will produce per-product totals.

Option A is wrong because the source is correct. Option B is wrong because the sink is fine. Option D is wrong because there is no filter.

Practice this question →

39

MCQhard

You are building a batch processing solution in Azure Synapse Analytics that reads data from a dedicated SQL pool, applies complex transformations using Synapse Spark, and writes the results back to the dedicated SQL pool. The pipeline must run on a schedule and handle transient failures with retries. Which approach should you use?

A.Use Azure Batch with a custom application to run Spark jobs

B.Use Azure Synapse Pipelines with a Notebook activity that runs Spark code

C.Use Azure Functions to trigger Spark jobs on demand

D.Use Azure Databricks with Auto Loader and Delta Live Tables

AnswerB

Synapse Pipelines support scheduling and retries for Spark notebooks.

Why this answer

Option B is correct because Azure Synapse Pipelines support scheduling, retry policies, and can orchestrate Notebook activities for Spark transformations. Option A is wrong because Azure Batch is not integrated with Synapse. Option C is wrong because Databricks is a separate service; using it would require additional integration.

Option D is wrong because Azure Functions are for event-driven, short-lived tasks, not complex transformations.

Practice this question →

40

MCQmedium

You are partitioning a large fact table in Azure Synapse Dedicated SQL Pool by date. The table is used for queries that filter on CustomerID and Date. You want to minimize data movement. Which distribution strategy should you use?

A.Round-robin distribution

B.Hash distribution on CustomerID

C.Replicate distribution

D.Hash distribution on Date

AnswerB

Hash distribution on CustomerID ensures collocation for joins on that key, reducing data movement.

Why this answer

Hash distribution on CustomerID is correct because queries filtering on CustomerID and Date will benefit from collocated joins and aggregations when CustomerID is the distribution key. Since the table is large and partitioned by Date, hash distribution on CustomerID minimizes data movement by ensuring that rows with the same CustomerID reside on the same distribution node, allowing filters on Date to be applied locally within each partition.

Exam trap

The trap here is that candidates often assume partitioning and distribution should be on the same column (Date) to optimize date-range queries, but this ignores that distribution on the join key (CustomerID) is what minimizes data movement for the most common query pattern involving both CustomerID and Date filters.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without any logical grouping, causing every query that filters on CustomerID to require full data movement across all distributions. Option C is wrong because replicate distribution copies the entire table to each distribution node, which is impractical for a large fact table due to storage overhead and write performance degradation. Option D is wrong because hash distribution on Date would cause data movement for queries filtering on CustomerID, as the distribution key does not align with the join or filter column, and partitioning on Date already provides local pruning without needing distribution on the same column.

Practice this question →

41

MCQeasy

You need to transform JSON data containing nested arrays into a tabular format for analysis in Azure Synapse Analytics. Which transformation in Azure Data Factory or Synapse Pipelines should you use?

A.Join transformation

B.Derived Column transformation

C.Aggregate transformation

D.Flatten transformation

AnswerD

Flatten is specifically designed to unroll nested array elements into separate rows in mapping data flows.

Why this answer

The Flatten transformation is specifically designed to denormalize nested JSON arrays into a tabular format by expanding array elements into multiple rows while preserving parent attributes. In Azure Data Factory and Synapse Pipelines, this transformation handles complex hierarchical structures like arrays of objects, making it the correct choice for converting JSON with nested arrays into a row-based dataset suitable for analysis in Azure Synapse Analytics.

Exam trap

The trap here is that candidates often confuse the Flatten transformation with the Unpivot transformation, which pivots columns into rows but does not handle nested JSON arrays, or they mistakenly think the Derived Column transformation can handle array expansion through expressions.

How to eliminate wrong answers

Option A is wrong because the Join transformation combines two streams based on matching keys, not for flattening nested arrays within a single JSON document. Option B is wrong because the Derived Column transformation creates or modifies columns using expressions but cannot expand nested arrays into multiple rows. Option C is wrong because the Aggregate transformation performs grouping and summarization operations (e.g., SUM, COUNT) and does not handle the denormalization of nested array structures.

Practice this question →

42

MCQhard

You are designing a data processing solution for a large e-commerce company. The data includes sensitive customer information that must be encrypted at rest and in transit. The solution uses Azure Data Lake Storage Gen2 and Azure Databricks. Compliance requires that all access to the data is logged and monitored. Which combination of security features should you implement?

A.Enable Azure RBAC for storage and use Azure AD authentication.

B.Use Azure Firewall to restrict access and enable Azure AD Authentication for Databricks.

C.Enable Azure Storage encryption at rest, enforce HTTPS, and configure diagnostic settings to log all read/write operations.

D.Enable transparent data encryption (TDE) on the storage account and use service endpoints.

AnswerC

Meets all requirements.

Why this answer

Option B is correct because it covers encryption at rest (Azure Storage encryption), in transit (HTTPS), and logging (diagnostic settings). Option A is incorrect because RBAC alone does not log access details. Option C is incorrect because Azure Firewall is not for storage logging.

Option D is incorrect because TDE is for SQL databases, not Data Lake.

Practice this question →

43

MCQmedium

You are a data engineer at a retail company. You need to design a batch processing solution that ingests daily sales data from multiple stores. Each store uploads a CSV file to a dedicated folder in Azure Data Lake Storage Gen2. The files have the same schema but may have minor variations in column order and include null values. After ingestion, you must clean the data by removing rows with null values in the 'SalesAmount' column, convert the 'Date' column from string to date type, and aggregate sales by product category. The output should be stored as Parquet files partitioned by year and month in the same Data Lake. You need to choose a compute service and implement the transformation with minimal coding effort. The solution must be cost-effective and require no cluster management. What should you do?

A.Use Azure Synapse Serverless SQL pool. Create external tables over the CSV files, write a T-SQL query to filter, cast, aggregate, and use CETAS to write Parquet partitions.

B.Use Azure Synapse Dedicated SQL pool. Load CSV files via PolyBase, transform with T-SQL, and use CREATE TABLE AS SELECT to output partitioned Parquet.

C.Use Azure Data Factory with a Mapping Data Flow. Configure the source to read all CSV files from the ADLS Gen2 folder. Add a Filter transformation to remove null SalesAmount, a Derived Column to parse the Date, and an Aggregate to sum sales by category. Sink to ADLS Gen2 as Parquet with partition by year and month.

D.Use Azure Databricks with a PySpark notebook. Mount the ADLS Gen2 storage, read CSV files with schema inference, filter, cast, aggregate, and write partitioned Parquet.

AnswerC

Mapping Data Flow provides code-free transformations and handles schema drift.

Why this answer

Azure Data Factory Mapping Data Flows provide a code-free visual interface to perform transformations like filter, derived column, and aggregate. It can read from ADLS Gen2, handle schema drift, and write partitioned Parquet files. Option A is correct.

Option B requires coding in PySpark and cluster management. Option C uses serverless SQL which is not ideal for complex transformations and file partitioning. Option D uses SQL pool which requires provisioning and is more expensive.

Practice this question →

44

MCQeasy

Refer to the exhibit. You are reviewing an Azure Stream Analytics job query. The job has a stream input and a reference data input. The job is failing with the error 'Reference data input must be of type Reference, not Stream'. What is the cause of the error?

A.The input alias is incorrect.

B.The JOIN syntax is incorrect for reference data.

C.The input type is Stream; it should be Reference.

D.The output type is set to ReferenceData; it should be a different type.

AnswerD

ReferenceData is only for input; output must be a sink type.

Why this answer

Option B is correct because the output type is incorrectly set to ReferenceData; it should be set to a sink type like SQL or Blob. Option A is wrong because the alias is not the issue. Option C is wrong because the query syntax is correct.

Option D is wrong because the input type is correct.

Practice this question →

45

Multi-Selectmedium

You are building a data processing pipeline in Azure Synapse Analytics that ingests data from Azure Blob Storage and writes to a dedicated SQL pool. You need to ensure the pipeline can handle schema changes in the source data without failing. Which TWO actions should you take?

Select 2 answers

A.Use staged copy with an intermediate storage.

B.Enable schema drift in a Mapping Data Flow activity.

C.Define a fixed schema in the source dataset.

D.Use PolyBase to load data into the dedicated SQL pool.

E.Use the Copy activity with AutoCreateTable enabled.

AnswersB, E

Mapping Data Flow can automatically handle schema drift by mapping new columns.

Why this answer

Mapping Data Flows in Azure Synapse Analytics support schema drift, which allows the pipeline to dynamically handle changes in source data structure (e.g., new columns, changed data types) without failing. By enabling schema drift, the data flow can automatically detect and propagate these changes downstream, ensuring resilience against schema evolution.

Exam trap

The trap here is that candidates often confuse PolyBase or staged copy with schema drift handling, but those features are designed for performance or staging, not for dynamic schema adaptation.

Practice this question →

46

MCQmedium

You are designing a data processing solution in Azure using Azure Data Lake Storage Gen2 as the storage layer. You need to ensure that data ingested from various sources is immutable and can be used for both batch and streaming workloads. Which storage design pattern should you implement?

A.Store data in a normalized relational database structure.

B.Implement a medallion architecture with bronze, silver, and gold layers.

C.Use a data vault model with hubs, links, and satellites.

D.Design a star schema with fact and dimension tables.

AnswerB

The medallion architecture provides data immutability and supports both processing paradigms.

Why this answer

Option A is correct because a medallion architecture (bronze, silver, gold) is a common pattern for organizing data in a lakehouse, providing immutability in bronze, transformations in silver, and aggregations in gold. It supports both batch and streaming. Option B is wrong because a star schema is a dimensional modeling approach for data warehouses, not a storage pattern for immutability.

Option C is wrong because a data vault is an enterprise data modeling pattern, not specifically for immutability in a lake. Option D is wrong because a normalized relational model is not suited for big data lake scenarios.

Practice this question →

47

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is slow. Which action would most likely improve performance?

A.Partition the Parquet files by date

B.Create a materialized view on the external data

C.Create a PolyBase external table and query that

D.Create statistics on the columns used in WHERE clauses

AnswerD

Statistics enable the optimizer to generate better execution plans.

Why this answer

In Azure Synapse Analytics serverless SQL pool, query performance on external data like Parquet files heavily relies on statistics to enable the optimizer to generate efficient query plans. Without statistics, the engine makes default cardinality estimates, often leading to suboptimal joins and filters. Creating statistics on columns used in WHERE clauses allows the serverless SQL pool to accurately estimate row counts and choose better execution strategies, directly improving query performance.

Exam trap

The trap here is that candidates often assume physical data organization (partitioning) or materialized views are the primary levers for performance in serverless SQL pool, overlooking that statistics are the critical metadata for the query optimizer in this stateless, compute-on-demand environment.

How to eliminate wrong answers

Option A is wrong because partitioning Parquet files by date is a storage optimization that primarily benefits file pruning and partition elimination in batch processing or dedicated SQL pools, but serverless SQL pool does not automatically leverage file partitioning for query optimization in the same way; it relies on file metadata and statistics instead. Option B is wrong because materialized views are not supported in serverless SQL pool; they are a feature of dedicated SQL pool in Azure Synapse Analytics. Option C is wrong because PolyBase external tables are the default mechanism for querying external data in serverless SQL pool, so creating one does not change the query execution path or improve performance; the issue is lack of statistics, not the table type.

Practice this question →

48

Multi-Selectmedium

A data engineer is building a pipeline in Azure Data Factory to load data from Azure Blob Storage to Azure Synapse Analytics using PolyBase. The pipeline must handle large files efficiently and minimize load time. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Split large files into multiple smaller files (e.g., 256 MB each).

B.Use a staging table to stage the data before loading.

C.Convert files to Parquet format for better compression.

D.Use the copy activity with staging enabled.

E.Configure PolyBase settings to use 'RejectValue' for handling bad records.

AnswersA, E

Enables parallelism.

Why this answer

Options B and D are correct. Splitting large files into multiple smaller files enables parallel loading with PolyBase. Using the 'RejectValue' option handles errors without failing the entire load.

Option A is incorrect because Parquet is not supported by PolyBase to Synapse dedicated SQL pool. Option C is incorrect because PolyBase does not use staging tables. Option E is incorrect because copy activity with staging uses PolyBase but the question asks for PolyBase-specific optimizations.

Practice this question →

49

Multi-Selecteasy

Which TWO configuration options are required to enable change data capture (CDC) in Azure Data Factory for incremental loading from Azure SQL Database?

Select 2 answers

A.Create a dedicated folder in Azure Blob Storage for CDC logs.

B.Enable CDC on the source Azure SQL Database.

C.Create a 'Change data capture' resource in Azure Data Factory.

D.Provision a staging Azure SQL Database.

E.Configure a full load of the entire table first.

AnswersB, C

CDC must be enabled at the database level.

Why this answer

Option B is correct because enabling CDC on the source Azure SQL Database is a prerequisite for Azure Data Factory (ADF) to capture incremental changes. ADF uses the native CDC mechanism of the source database, which requires the database to have CDC enabled at the database level and on the specific tables being tracked. Without this, ADF cannot read the change data from the system tables.

Exam trap

The trap here is that candidates confuse the 'Change Data Capture' resource in ADF (which is a configuration option within a dataset) with a separate resource type that must be explicitly created, leading them to select option C as a required configuration when it is actually part of the dataset setup and not a standalone resource.

Practice this question →

50

MCQmedium

Your company uses Azure Data Factory to load data from an on-premises SQL Server database to Azure Synapse Analytics. The pipeline runs daily and the load volume is about 50 GB per run. Recently, the pipeline started failing with timeout errors during the copy activity. You need to ensure reliable data transfer. What should you do?

A.Use a self-hosted integration runtime and increase the copy activity timeout

B.Use PolyBase to load data in parallel

C.Split the copy into multiple smaller activities

D.Switch to Azure Integration Runtime

AnswerA

Self-hosted IR connects to on-premises; increasing timeout prevents failures.

Why this answer

Option B (Use a self-hosted integration runtime and increase the timeout) is correct because a self-hosted IR is required for on-premises sources, and increasing the timeout can handle larger volumes. Option A (Use Azure Integration Runtime) does not connect to on-premises. Option C (Use PolyBase) is for loading into Synapse but does not address the connectivity or timeout.

Option D (Split the data into smaller chunks) might help but is not the primary solution; increasing timeout is more direct.

Practice this question →

51

MCQhard

Refer to the exhibit. You have an Azure Data Factory dataset definition for a Parquet file stored in Azure Data Lake Storage Gen2. You attempt to use this dataset as a source in a copy activity, but the copy activity fails with an error indicating that the file is not found. The file 'sales_orders.parquet' exists at the specified path. What is the most likely cause of the error?

A.The dataset is configured with a specific file name, but the data is partitioned across multiple folders; you should use a wildcard pattern in the folderPath.

B.The compression codec 'snappy' is not supported for Parquet files.

C.The dataset type should be 'AzureBlobStorageLocation' instead of 'AzureBlobFSLocation'.

D.The linked service 'ADLSGen2' is not properly configured with the storage account key.

AnswerA

While the exhibit shows a single file, the error suggests the file is not found, likely because the dataset expects a different path structure. Using wildcards would allow the copy activity to find the file within the partition structure.

Why this answer

Option A is correct because the dataset definition specifies a single file name ('sales_orders.parquet') in the 'fileName' property, but the actual data is likely stored across multiple Parquet files in a partitioned folder structure (e.g., 'sales_orders/year=2024/month=01/'). When a copy activity uses this dataset as a source, Azure Data Factory looks for the exact file name at the specified path, not finding any file because the data is spread across subfolders. Using a wildcard pattern (e.g., '*.parquet') in the 'folderPath' or setting 'fileName' to '*' allows the copy activity to read all Parquet files in the folder, resolving the 'file not found' error.

Exam trap

The trap here is that candidates often assume the error is due to authentication or configuration issues (like linked service keys or location types) rather than recognizing that the dataset's explicit file name prevents ADF from scanning subdirectories for partitioned files.

How to eliminate wrong answers

Option B is wrong because Snappy is a widely supported compression codec for Parquet files in Azure Data Factory; it is not the cause of a 'file not found' error. Option C is wrong because 'AzureBlobFSLocation' is the correct location type for Azure Data Lake Storage Gen2 (which uses the Blob Filesystem API), while 'AzureBlobStorageLocation' is for Azure Blob Storage (which uses the Blob REST API); using the wrong type would cause a different error, not a 'file not found' error. Option D is wrong because if the linked service were misconfigured (e.g., wrong storage account key), the error would typically indicate authentication or authorization failure, not a 'file not found' error.

Practice this question →

52

MCQmedium

You are configuring a data pipeline in Azure Data Factory that uses a Mapping Data Flow. The source is a SQL Server table with 50 million rows. The sink is a Delta table in ADLS Gen2. The pipeline runs slowly. You need to improve performance by reducing the number of partitions in the data flow. Which setting should you adjust?

A.In the source transformation, set the 'Optimize' tab to 'Use current partitioning'.

B.Decrease the cluster size to 4 cores.

C.In the sink transformation, set the 'Optimize' tab to 'Single partition'.

D.In the pipeline activity, set the 'Data flow' property 'partitionOption' to 'Single partition'.

AnswerC

Sink optimization controls how data is written; setting to single partition reduces file count.

Why this answer

Option C is correct because setting the sink transformation's 'Optimize' tab to 'Single partition' forces all data to be written into a single partition, which reduces the number of output files and the associated metadata overhead. This can improve performance when the downstream processing or storage benefits from fewer, larger files, especially for small-to-medium datasets or when the sink is a Delta table that must be compacted later.

Exam trap

The trap here is that candidates often confuse the 'Optimize' tab settings on the sink with the 'Partition option' in the source or pipeline activity, or mistakenly think that reducing cluster size (Option B) is a valid performance tuning technique for partition reduction.

How to eliminate wrong answers

Option A is wrong because 'Use current partitioning' in the source transformation does not reduce partitions; it preserves the existing partition count from the source, which may still be high and cause slow performance. Option B is wrong because decreasing the cluster size to 4 cores reduces compute resources, which typically worsens performance rather than improving it, and does not directly control the number of partitions in the data flow. Option D is wrong because the 'partitionOption' property is not available at the pipeline activity level for Mapping Data Flows; partition settings are configured within the data flow transformations themselves, not in the pipeline activity properties.

Practice this question →

53

MCQeasy

Your organization uses Azure Data Lake Storage Gen2 (ADLS Gen2) and wants to transform data using Azure Databricks. The data is stored in Parquet format. You need to read the data into a Spark DataFrame. Which DataFrame reader method should you use?

A.spark.read.avro()

B.spark.read.json()

C.spark.read.parquet()

D.spark.read.csv()

AnswerC

The parquet() method reads Parquet files directly.

Why this answer

Option C is correct because the data is stored in Parquet format, and the Spark DataFrame reader method `spark.read.parquet()` is specifically designed to read Parquet files, which is a columnar storage format optimized for big data processing in Azure Databricks.

Exam trap

The trap here is that candidates may confuse file format reader methods (e.g., using `spark.read.avro()` for Parquet data) due to assuming all binary formats are interchangeable, but each reader method is strictly tied to its specific file format.

How to eliminate wrong answers

Option A is wrong because `spark.read.avro()` is used for Avro format, not Parquet. Option B is wrong because `spark.read.json()` is for JSON files, which are text-based and not columnar like Parquet. Option D is wrong because `spark.read.csv()` is for CSV files, which are row-based and lack the compression and schema efficiency of Parquet.

Practice this question →

54

MCQhard

Your company uses Azure Data Lake Storage Gen2 and Azure Databricks for data processing. Some Parquet files in the lake are written with a schema that includes a column 'address' of struct type. A downstream process expects 'address' to be a string. You need to transform the data in a way that minimizes read overhead and does not rewrite the entire dataset. Which approach should you use?

A.Use schema evolution in Delta Lake to change the column type.

B.Apply a schema-on-read transformation that casts the column to string when reading.

C.Create a view that selects the column as a string and save it as a new table.

D.Rewrite the Parquet files with the correct schema using a Spark job.

AnswerB

Efficient, no rewrite needed.

Why this answer

Option C is correct because schema-on-read allows you to read the struct and cast it to string without rewriting files. Option A is incorrect because rewriting is costly. Option B is incorrect because schema evolution does not change data types.

Option D is incorrect because it would create a new copy.

Practice this question →

55

Multi-Selectmedium

Which TWO of the following are built-in components of Azure Data Factory that can be used to perform data transformation without writing custom code?

Select 2 answers

A.Mapping Data Flow

B.Copy Activity

C.Data Flow Script

D.Azure Function Activity

E.Stored Procedure Activity

AnswersA, C

Mapping Data Flows allow visual data transformation without code.

Why this answer

Mapping Data Flow (A) is a built-in Azure Data Factory component that allows you to design data transformations visually using a graphical interface, without writing any code. It executes at scale on Azure Databricks clusters and supports operations like joins, aggregations, and pivots, making it a code-free transformation tool.

Exam trap

The trap here is that candidates often confuse the Copy Activity with a transformation tool because it can perform simple column mapping and data type conversions, but it lacks the ability to handle complex, multi-step transformations like those in Mapping Data Flows.

Practice this question →

56

MCQhard

You are designing a batch processing solution for a financial services company that processes transactions from multiple sources. The data is stored in Azure Data Lake Storage Gen2 in Parquet format. You need to perform complex transformations including joins, aggregations, and window functions, and then load the results into an Azure Synapse Analytics dedicated SQL pool. The transformations must be written in SQL and executed on a serverless Spark cluster to minimize costs. You want to manage the code in a Git repository and automate the deployment using Azure DevOps. Which approach should you recommend?

A.Use Azure Databricks with Auto Loader to read from Data Lake Storage, perform transformations using Spark SQL, and write to the dedicated SQL pool via JDBC.

B.Use Azure Data Factory with Mapping Data Flows to visually design transformations and write to the dedicated SQL pool.

C.Create a Spark job definition in Azure Synapse that reads Parquet files, performs transformations using PySpark, and writes to the dedicated SQL pool using the Spark Synapse connector.

D.Use a serverless SQL pool in Azure Synapse to query the Parquet files using T-SQL, then use CETAS to write the results to the dedicated SQL pool.

AnswerD

Serverless SQL pool is cost-effective for ad-hoc querying; CETAS allows moving data to dedicated pool.

Why this answer

Option B is correct because Synapse Serverless SQL pool can query Parquet files directly using T-SQL, and you can use CREATE EXTERNAL TABLE AS SELECT (CETAS) to write results to the dedicated SQL pool. This avoids Spark costs and uses serverless compute. Option A uses Spark, which costs more.

Option C uses Azure Databricks, which is external and adds complexity. Option D uses Azure Data Factory with Data Flows that run on Spark, also incurring costs.

Practice this question →

57

MCQeasy

You need to process a large dataset that contains personally identifiable information (PII). The data must be anonymized before being used for analytics. Which Azure service should you use to apply column-level masking dynamically?

A.Azure API Management policies

B.Azure Synapse Analytics dynamic data masking

C.Azure Data Lake Storage access control lists (ACLs)

D.Azure Purview classification and labeling

AnswerB

Dynamic data masking hides sensitive data from non-privileged users in query results.

Why this answer

Azure Synapse Analytics provides dynamic data masking at the column level. Azure Purview is for data governance. Azure API Management is for APIs.

Azure Data Lake Storage does not provide masking.

Practice this question →

58

MCQhard

Refer to the exhibit. The pipeline executes a Spark job definition that reads from the input path and writes to the output path. The job fails with an error indicating that the output path already exists. The job is configured with default behavior. What configuration change should you make to allow the job to overwrite the existing output?

A.Modify the Spark job definition to include 'overwrite' save mode.

B.Change the output path in the arguments to a non-existing folder.

C.Set the pipeline property 'outputMode' to 'overwrite' on the SynapseSparkJob activity.

D.Add a Delete activity before the Spark job to delete the output folder if it exists.

AnswerD

Using a Delete activity in the pipeline to remove the existing output folder before running the Spark job ensures the job can write without conflict.

Why this answer

Spark jobs by default fail if the output directory exists. To overwrite, you need to set the Spark configuration 'spark.hadoop.parquet.enable.summary-metadata' is not relevant. The correct approach is to set the Spark configuration 'spark.sql.sources.partitionOverwriteMode' to 'STATIC' or use 'overwrite' save mode in the code.

However, in the exhibit, the job is a Spark job definition that calls a file; the simplest way is to set the argument to use overwrite mode. Option B (Delete activity) is a separate activity. Option C changes the job definition.

Option D (output mode) is not a pipeline property.

Practice this question →

59

MCQmedium

You are debugging an Azure Data Factory pipeline that copies CSV files from Azure Blob Storage to Azure SQL Database. The pipeline is defined as shown in the exhibit. The first row of each CSV file contains column headers. After a run, you notice that the first data row is missing from the target table. What is the most likely cause?

A.The 'writeBehavior' is set to 'insert', which does not overwrite existing rows

B.The 'skipLineCount' is set to 1, which skips the first data row after the header

C.The 'recursive' setting is true, causing the pipeline to read files in subdirectories and duplicate data

D.The 'enableStaging' is false, which prevents PolyBase from loading data

AnswerB

skipLineCount skips lines after header, removing first data row.

Why this answer

Option B is correct because 'skipLineCount': 1 skips the first line after the header line; since header is already parsed, the first data row is skipped. Option A is wrong because 'recursive': true only affects subdirectories, not line skipping. Option C is wrong because 'writeBehavior': 'insert' is default and does not skip rows.

Option D is wrong because 'enableStaging': false is fine for direct copy.

Practice this question →

60

MCQmedium

You are developing a data processing pipeline for a gaming company that uses Azure Databricks. The pipeline processes game event data from Azure Event Hubs. You need to detect cheating patterns by analyzing events in real time. The solution must be able to handle high throughput and low latency. The output should be written to Azure Cosmos DB for real-time dashboards. Which approach should you use?

A.Use Azure Databricks Structured Streaming to read from Event Hubs, use Spark SQL and machine learning to detect cheating patterns, and write to Cosmos DB using the Azure Cosmos DB Spark connector.

B.Use Azure Functions with Event Hubs trigger to process each event and write to Cosmos DB.

C.Use Azure Data Factory with continuous copy to load data into Cosmos DB and then use Azure Synapse Analytics to detect patterns.

D.Use Azure Stream Analytics to query the stream for cheating patterns and output to Cosmos DB.

AnswerA

Supports high throughput and complex analytics.

Why this answer

Option B is correct because Azure Databricks Structured Streaming can handle high throughput and low latency, and it can write to Cosmos DB using the Azure Cosmos DB Spark connector. Option A is wrong because Azure Stream Analytics outputs to Cosmos DB but may not provide the same flexibility for complex pattern detection. Option C is wrong because Azure Data Factory is not for real-time processing.

Option D is wrong because Azure Functions would be too complex for high throughput.

Practice this question →

61

Multi-Selectmedium

You are designing a data processing solution using Azure Databricks. You need to read data from Azure Data Lake Storage Gen2, transform it using Spark SQL, and write to a Delta table. Which TWO configurations are required to ensure optimal performance for large datasets?

Select 2 answers

A.Disable automatic schema detection to reduce overhead.

B.Use Delta Lake's OPTIMIZE command to compact small files.

C.Use Delta Lake Z-order optimization on frequently filtered columns.

D.Cache the entire DataFrame in memory after reading.

E.Enable auto-compaction in Spark configuration.

AnswersB, C

Compacting small files improves read performance.

Why this answer

Options A and D are correct. Option A is correct because optimizing the Delta table with Z-order improves data skipping for columns used in queries. Option D is correct because using Delta Lake's optimize command (bin-packing) compacts small files, improving read performance.

Option B is wrong because disabling automatic schema detection can be useful but is not required for performance. Option C is wrong because caching is optional and may not help for large datasets that don't fit in memory. Option E is wrong because enabling auto compaction helps but is not a requirement; it's a feature.

Practice this question →

62

MCQeasy

Refer to the exhibit. You are deploying the ARM template to create an Azure Synapse Analytics workspace. The deployment fails because the managed resource group cannot be created. What is the most likely reason?

A.The managedResourceGroupName property is not allowed; Azure Synapse creates the resource group automatically

B.The apiVersion is invalid for this resource type

C.The location is missing the 'properties' section

D.The storage account URL must use the blob endpoint, not DFS

AnswerA

You cannot specify a custom managed resource group name.

Why this answer

Option A is correct because the `managedResourceGroupName` property is not allowed in an ARM template for Azure Synapse Analytics. Azure Synapse automatically creates and manages its own managed resource group (e.g., `managed-rg-<workspace-name>`) to hold provisioned resources like Spark pools and SQL pools. Specifying this property in the template causes a validation error, as the service requires full control over the naming and lifecycle of this group.

Exam trap

The trap here is that candidates often assume they can control the naming of all Azure resources in an ARM template, but Azure Synapse enforces automatic management of its internal resource group, and specifying it triggers a validation error rather than a runtime failure.

How to eliminate wrong answers

Option B is wrong because the `apiVersion` shown in the exhibit (e.g., `2021-06-01-preview` or similar) is valid for the `Microsoft.Synapse/workspaces` resource type; an invalid apiVersion would produce a different error, not a managed resource group creation failure. Option C is wrong because the `location` property is a required top-level property for the resource, and it does not need a nested `properties` section; the error is unrelated to location formatting. Option D is wrong because Azure Synapse supports both blob (`*.blob.core.windows.net`) and DFS (`*.dfs.core.windows.net`) endpoints for the default storage account; using DFS is actually recommended for optimal performance with Synapse pipelines and Spark, and it would not cause a managed resource group creation failure.

Practice this question →

63

MCQeasy

You need to process streaming data from Azure Event Hubs and store the results in Azure Cosmos DB for a real-time dashboard. The solution must handle duplicate events and ensure exactly-once processing. Which Azure service should you use?

A.Azure Data Factory

B.Azure Functions with Event Hubs trigger

C.Azure Stream Analytics

D.Azure Databricks Structured Streaming

AnswerC

Supports exactly-once semantics with Event Hubs.

Why this answer

Option C (Azure Stream Analytics) is correct because it provides exactly-once processing when configured with Event Hubs and Cosmos DB output. Option A (Azure Data Factory) is batch-oriented. Option B (Azure Functions) may have at-least-once guarantees.

Option D (Azure Databricks) can achieve exactly-once but requires more configuration.

Practice this question →

64

Multi-Selecthard

You are designing a data processing solution for a retail company that uses Azure Databricks. The solution needs to process streaming sales data from Event Hubs and batch data from Azure Data Lake Storage Gen2. You need to ensure that the solution can handle late-arriving data and maintain exactly-once semantics. Which TWO technologies should you use?

Select 2 answers

A.Delta Lake

B.Azure Databricks Structured Streaming

C.PolyBase

D.Azure Stream Analytics

E.Azure Data Factory

AnswersA, B

Provides ACID transactions and supports exactly-once semantics.

Why this answer

Correct answers: A and D. Delta Lake provides ACID transactions and schema enforcement, enabling exactly-once semantics. Azure Databricks Structured Streaming can handle late-arriving data using watermarking and event-time processing.

Option B is wrong because Azure Stream Analytics does not integrate natively with Delta Lake for exactly-once semantics. Option C is wrong because Azure Data Factory is not suitable for streaming processing. Option E is wrong because PolyBase is for batch loading, not streaming.

Practice this question →

65

MCQmedium

You are working with Azure Synapse Analytics serverless SQL pool. You need to query a set of Parquet files located in ADLS Gen2. The files have nested columns (structs and arrays). Which function should you use to flatten the nested data?

A.OPENJSON

B.PIVOT

C.UNNEST

D.CROSS APPLY

AnswerA

OPENJSON can parse nested JSON structures and flatten them into rows.

Why this answer

OPENJSON is the correct function because it parses JSON text and returns objects and properties from JSON input as rows and columns. In Azure Synapse serverless SQL pool, when Parquet files contain nested columns (structs and arrays), they are exposed as JSON strings, and OPENJSON can flatten these nested structures into a relational format. This allows you to query complex nested data directly without needing to pre-process the files.

Exam trap

The trap here is that candidates confuse CROSS APPLY with a flattening function, but CROSS APPLY only invokes a table-valued function (like OPENJSON) and does not flatten data by itself.

How to eliminate wrong answers

Option B (PIVOT) is wrong because PIVOT rotates table-valued expressions by turning unique values from one column into multiple columns in the output; it does not flatten nested data. Option C (UNNEST) is wrong because UNNEST is a PostgreSQL function for expanding arrays into rows; it is not supported in Azure Synapse serverless SQL pool. Option D (CROSS APPLY) is wrong because CROSS APPLY joins a table with a table-valued function, but it does not inherently flatten nested columns; it would require an additional function like OPENJSON to parse the nested data first.

Practice this question →

66

MCQeasy

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2. You want to minimize cost and complexity. Which Azure Data Factory feature should you use?

A.Use a Lookup activity to get the maximum timestamp from the sink and filter the source

B.Use the 'Incremental copy' capability with change tracking enabled on the source

C.Use a Stored Procedure activity to delete and reinsert data

D.Use a Mapping Data Flow to compare source and sink

AnswerB

This is the simplest and most cost-effective method.

Why this answer

Option B is correct because Azure Data Factory's 'Incremental copy' capability with change tracking on Azure SQL Database automatically identifies and transfers only the changed rows since the last run, using the built-in change tracking mechanism. This minimizes cost and complexity by avoiding custom logic for watermark columns or full reloads, as it handles the delta extraction natively.

Exam trap

The trap here is that candidates often confuse the 'Incremental copy' capability with manual watermark-based approaches (Option A) or assume that a Mapping Data Flow (Option D) is the only way to compare datasets, overlooking the native, cost-optimized change tracking integration.

How to eliminate wrong answers

Option A is wrong because using a Lookup activity to get the maximum timestamp from the sink and filter the source requires manual implementation of a watermark column, which adds complexity and does not leverage Azure SQL Database's native change tracking, potentially missing deletes or updates. Option C is wrong because using a Stored Procedure activity to delete and reinsert data performs a full reload of the target, which is costly and inefficient for incremental loading, and does not minimize cost or complexity. Option D is wrong because using a Mapping Data Flow to compare source and sink requires a full scan of both datasets to identify differences, which is resource-intensive and expensive, defeating the goal of minimizing cost and complexity.

Practice this question →

67

MCQmedium

You are designing a data processing pipeline in Azure Synapse Analytics. The pipeline must ingest streaming data from Azure Event Hubs, perform real-time aggregations, and store the results in a dedicated SQL pool. Which component should you use to perform the real-time transformations?

A.Azure Databricks with Structured Streaming

B.Azure Data Factory

C.Azure Synapse Pipelines

D.Azure Stream Analytics

AnswerD

Azure Stream Analytics is optimized for real-time stream processing and can output to Synapse dedicated SQL pool.

Why this answer

Azure Stream Analytics is the correct choice because it is a fully managed, real-time analytics service designed specifically for processing streaming data from sources like Azure Event Hubs. It supports SQL-based query language for performing aggregations, windowing functions, and temporal joins, and can directly output results to a dedicated SQL pool in Azure Synapse Analytics. This makes it the optimal component for ingesting streaming data, performing real-time transformations, and storing aggregated results in a Synapse dedicated SQL pool.

Exam trap

Microsoft often tests the distinction between batch and real-time processing services, and the trap here is that candidates may confuse Azure Synapse Pipelines or Azure Data Factory as capable of real-time streaming, when in fact they are batch-oriented orchestration tools.

How to eliminate wrong answers

Option A is wrong because Azure Databricks with Structured Streaming, while capable of real-time processing, introduces additional complexity and cost, and is not the native or most straightforward choice for directly outputting to a dedicated SQL pool in Synapse; it typically requires a separate write path or connector. Option B is wrong because Azure Data Factory is an orchestration and ETL service for batch data movement and transformation, not designed for real-time streaming ingestion or continuous aggregation. Option C is wrong because Azure Synapse Pipelines are built on Azure Data Factory and share the same batch-oriented architecture, lacking native support for real-time stream processing and windowed aggregations.

Practice this question →

68

MCQhard

You are a data engineer for a global e-commerce company. You need to design a data processing solution using Azure Databricks that processes real-time clickstream data from Azure Event Hubs. The solution must join the streaming data with a slowly changing dimension (SCD) table that stores product details. The SCD table is stored in Azure Data Lake Storage Gen2 as Delta format and is updated every few hours. The joined results must be written to a Delta table for near-real-time dashboards. The key requirement is to ensure that the join always uses the latest version of the SCD data without reprocessing the entire stream. The solution must minimize latency and cost. Which approach should you recommend?

A.Use Spark Structured Streaming with foreachBatch and read the SCD table inside the foreachBatch function.

B.Use Spark Structured Streaming with a static DataFrame for the SCD table and refresh it periodically using a trigger that reads the latest snapshot.

C.Use Spark Structured Streaming with a batch read of the SCD table in each micro-batch using spark.read.

D.Use Spark Structured Streaming with a streaming join on the SCD table by converting it to a stream using readStream.

AnswerB

Periodic refresh of a static DataFrame minimizes overhead and ensures latest data.

Why this answer

Option B is correct because it uses a static DataFrame for the SCD table and refreshes it periodically using a trigger (e.g., a time-based or file-based trigger). This approach ensures that the join always uses the latest version of the SCD data without reprocessing the entire stream, as the static DataFrame is re-read only when the SCD is updated. It minimizes latency and cost by avoiding the overhead of reading the SCD in every micro-batch (as in Option A) or using a streaming join (as in Option D), which is not suitable for batch-updated SCD tables.

Exam trap

The trap here is that candidates often assume that reading the SCD table in every micro-batch (Option A or C) is the simplest way to get the latest data, but they overlook the significant performance and cost penalties of full table scans in each micro-batch, especially for large SCD tables.

How to eliminate wrong answers

Option A is wrong because foreachBatch with a read of the SCD table inside the function would read the SCD in every micro-batch, causing high latency and cost due to repeated full scans of the Delta table, and it does not efficiently handle the periodic updates of the SCD. Option C is wrong because reading the SCD table in each micro-batch using spark.read would also perform a full scan of the Delta table every micro-batch, leading to excessive I/O and cost, and it does not leverage the periodic refresh pattern. Option D is wrong because converting the SCD table to a stream using readStream is inappropriate for a slowly changing dimension that is updated every few hours; streaming the SCD would require it to be a continuous stream of changes (e.g., from a change data capture feed), which is not the case here, and it would add unnecessary complexity and cost.

Practice this question →

69

MCQeasy

You are designing a batch processing solution for a data lake. Source files arrive daily in Parquet format in Azure Data Lake Storage Gen2. The data must be cleaned, aggregated, and loaded into an Azure Synapse SQL pool. The solution should minimize compute costs and management overhead. Which technology should you use for the transformation?

A.Azure HDInsight with Spark jobs scheduled in Azure Data Factory.

B.Azure Synapse Pipelines with mapping data flows.

C.Azure Data Factory with a custom SSIS package.

D.Azure Databricks with an Auto Loader pipeline.

AnswerB

Mapping data flows in Synapse Pipelines provide serverless, code-free transformation with minimal management.

Why this answer

Option C is correct because Synapse Pipelines (built on ADF) can orchestrate mapping data flows that run on serverless Spark clusters, minimizing management overhead. Option A (HDInsight) requires cluster management. Option B (Databricks) incurs higher costs for simple batch.

Option D (SSIS) is legacy and not cloud-native.

Practice this question →

70

MCQhard

You are a data engineer for a healthcare company that processes patient data. You have an Azure Databricks workspace with a cluster configured for data processing. You need to implement a solution that processes streaming data from Azure Event Hubs, enriches it with reference data stored in Azure Cosmos DB, and writes the output to Delta Lake in Azure Data Lake Storage Gen2. The solution must ensure that the data processing is fault-tolerant and can handle schema evolution. The reference data is updated infrequently. You need to choose an approach that minimizes complexity and cost. What should you do?

A.Use Azure Databricks Auto Loader with Delta Live Tables to ingest streaming data, and use Change Data Capture from Cosmos DB to update the reference data inline.

B.Use Azure Data Factory to copy data from Event Hubs to Azure Data Lake Storage Gen2 in batches, then use Azure Databricks to process and enrich with Cosmos DB.

C.Use Azure Stream Analytics to ingest from Event Hubs, join with Cosmos DB reference data, and output to Azure Data Lake Storage Gen2 in Parquet format.

D.Use Azure Databricks Structured Streaming to read from Event Hubs, use a streaming static join to enrich with reference data from Cosmos DB, and write to Delta Lake. Enable schema evolution on the Delta table.

AnswerD

Simplifies processing and handles schema evolution.

Why this answer

Option B is correct because Auto Loader can incrementally process streaming data from Event Hubs (via Event Hubs connector), and it can automatically handle schema evolution. Broadening the reference data join to use a streaming static join in Structured Streaming is efficient for infrequently updated reference data. Option A is wrong because using Change Data Capture adds complexity.

Option C is wrong because using Azure Stream Analytics would require additional service cost and integration. Option D is wrong because using batch processing for streaming data would increase latency.

Practice this question →

71

MCQmedium

You are designing a streaming data solution for IoT devices that generate 10,000 events per second. The data must be processed with sub-second latency and then stored in Azure Data Lake Storage Gen2 for archival. Which Azure service should you use for the stream processing?

A.Azure HDInsight including Spark Structured Streaming

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Event Hubs

AnswerB

Azure Stream Analytics provides sub-second latency stream processing and native output to Azure Data Lake Storage Gen2.

Why this answer

Azure Stream Analytics is designed for real-time stream processing with sub-second latency and can output to Azure Data Lake Storage Gen2. Option A (HDInsight including Spark) is batch-oriented and has higher latency. Option C (Data Factory) is for orchestration, not streaming.

Option D (Event Hubs) is an ingestion service, not a processing service.

Practice this question →

72

MCQmedium

You are designing a data processing solution for an e-commerce company that uses Azure Synapse Analytics. The solution must process clickstream data from a web application. The data arrives in JSON format through Azure Event Hubs. You need to load the data into a dedicated SQL pool every 5 minutes with minimal latency. The data volume is about 100 MB every 5 minutes. You want to use PolyBase for loading. Which approach should you use?

A.Use Azure Stream Analytics to transform the JSON data and output directly to the dedicated SQL pool.

B.Use Azure Data Factory with a Copy activity to copy data from Event Hubs to Azure Data Lake Storage Gen2 as JSON files, then use a PolyBase activity to load from ADLS Gen2 to the dedicated SQL pool.

C.Use Azure Databricks to read from Event Hubs, transform the data, and write to the dedicated SQL pool using JDBC.

D.Use PolyBase directly from Event Hubs to dedicated SQL pool by creating an external data source that points to Event Hubs.

AnswerB

Simplifies staging and leverages PolyBase for efficient loading.

Why this answer

Option A is correct because using Azure Data Factory to copy JSON data to ADLS Gen2 as a staging area, then using PolyBase to load into dedicated SQL pool, is efficient for small frequent loads. Option B is wrong because loading directly from Event Hubs using PolyBase is not supported. Option C is wrong because using Azure Stream Analytics adds unnecessary complexity.

Option D is wrong because using Azure Databricks would require a cluster running continuously, increasing cost.

Practice this question →

73

MCQhard

You are optimizing a pipeline in Azure Data Factory that copies data from Azure Blob Storage to Azure Synapse Analytics. The pipeline uses a copy activity with PolyBase. The data is partitioned by date in Blob Storage. You notice that the load is slow. What is the most likely cause?

A.The source files are stored in Azure Blob Storage instead of Data Lake Storage Gen2

B.The source files are in CSV format instead of Parquet

C.The source files are too many and too small (e.g., thousands of 1 MB files)

D.The sink table has a clustered columnstore index

AnswerC

Many small files cause overhead; PolyBase is optimized for fewer, larger files.

Why this answer

PolyBase in Azure Synapse Analytics performs best when reading large, contiguous files. When the source contains thousands of small files (e.g., 1 MB each), PolyBase must initiate a separate read operation for each file, causing excessive overhead from file open/close operations and metadata requests. This dramatically reduces throughput compared to reading fewer, larger files.

Exam trap

The trap here is that candidates often focus on file format (Parquet vs. CSV) or storage type (Blob vs. ADLS Gen2) as the primary performance factor, when in reality the number and size of files is a more common and impactful bottleneck in PolyBase loads.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage is fully supported as a PolyBase source; Data Lake Storage Gen2 offers hierarchical namespace benefits but does not inherently improve PolyBase load speed. Option B is wrong because while Parquet is more efficient for analytics due to columnar storage and compression, CSV is still a valid PolyBase source and the primary bottleneck here is file count, not format. Option D is wrong because a clustered columnstore index is actually recommended for Synapse Analytics tables to improve query performance and compression; it does not slow down the PolyBase load itself.

Practice this question →

74

Multi-Selectmedium

Which TWO Azure services can be used to perform real-time data processing on streaming data?

Select 2 answers

A.Azure Synapse Analytics (dedicated SQL pool)

B.Azure Stream Analytics

C.Azure Data Factory

D.Azure Logic Apps

E.Azure Databricks

AnswersA, B

Synapse dedicated SQL pool can ingest and query streaming data.

Why this answer

Azure Synapse Analytics (dedicated SQL pool) can ingest and process streaming data using PolyBase or the COPY INTO command to load data from Azure Stream Analytics or Event Hubs into staging tables, then perform real-time analytics via materialized views or incremental statistics. This enables near-real-time processing on high-velocity data by combining batch and streaming ingestion patterns.

Exam trap

The trap here is that candidates often confuse Azure Data Factory or Logic Apps as real-time streaming services because they can handle event-driven triggers, but they lack the continuous, low-latency query engine required for true stream processing, unlike Stream Analytics and Synapse's dedicated SQL pool with streaming ingestion.

Practice this question →

75

Multi-Selecteasy

Which TWO techniques can you use to handle schema drift in Azure Data Factory mapping data flows?

Select 2 answers

A.Enable 'Allow schema drift' in the source transformation

B.Use derived column transformation to handle each new column manually

C.Use column pattern matching to automatically map columns with similar names

D.Use assertion rules to reject rows with unknown columns

E.Use a fixed schema mapping to ignore unknown columns

AnswersA, C

Allows columns to be added without breaking the pipeline.

Why this answer

Options A and D are correct. 'Allow schema drift' allows flexible column mapping based on incoming schema. Column pattern matching can map dynamic columns to a target. Option B is wrong because using a fixed schema mapping is the opposite of handling drift.

Option C is wrong because deriving columns is for static transformations. Option E is wrong because assertion rules are for validation, not adaptation.

Practice this question →

Page 1 of 4 · 297 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Develop data processing questions.

Start 20-question session