DP-203 Develop data processing — All Questions With Answers

Question 1mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data volume is approximately 500 GB per hour with peak spikes. The pipeline must minimize data loss during transient failures. Which feature should you implement?

Question 2hardmultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing solution using Azure Databricks. The data source is a large Parquet dataset stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The processing requires joining two datasets: one with 10 billion rows and another with 1 million rows. The cluster uses Photon runtime. Which optimization should you apply to minimize shuffle?

Question 3easymultiple choice

Read the full Develop data processing explanation →

You are running a Spark job in Azure Synapse Analytics that reads from a Delta Lake table and performs multiple transformations. The job fails with an out-of-memory error on the executors. Which action should you take first to resolve the issue?

Question 4mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline in Azure Data Factory (ADF) that copies data from an on-premises SQL Server database to Azure Synapse Analytics dedicated SQL pool. The pipeline must run daily and handle incremental loads efficiently. Which sink dataset type and copy method should you use?

Question 5hardmultiple choice

Read the full network assurance explanation →

You are implementing a streaming solution using Azure Stream Analytics. The input is from an IoT Hub receiving telemetry from thousands of devices. The output is to Azure Synapse Analytics dedicated SQL pool. The requirement is to compute rolling averages over a 5-minute tumbling window and write results every minute. Which windowing function and output configuration should you use?

Question 6easymulti select

Read the full Develop data processing explanation →

You are optimizing a Spark DataFrame transformation in Azure Synapse Analytics. The DataFrame has 20 columns and 100 million rows. You notice that the job is slow due to many small files being written to the output. Which two actions can you take to reduce the number of output files? (Choose two.)

Question 7easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Databricks. The data is stored in Delta Lake format. You need to ensure that when you read the latest version of the table, you only see committed data and not uncommitted transactions. Which isolation level should you use?

Question 8hardmultiple choice

Read the full Develop data processing explanation →

You are monitoring an Azure Synapse Pipeline that uses a Mapping Data Flow. The data flow processes 2 GB of data from a CSV source and writes to a Delta sink. The pipeline fails with a 'DataFlowException: Operation aborted' error after running for 45 minutes. The cluster is configured with 8 cores. What is the most likely cause?

Question 9mediummultiple choice

Read the full Develop data processing explanation →

You are building a data pipeline that uses Azure Data Factory to copy data from a REST API to Azure Blob Storage. The REST API returns JSON data in pages of 1000 records each. The total number of records is 50,000. Which activity or feature should you use to loop through the pages?

Question 10easymultiple choice

Read the full Develop data processing explanation →

You are designing a streaming job in Azure Stream Analytics. The job needs to count the number of events per device type every 10 seconds. The input is from Event Hubs. Which query should you use?

Question 11mediummultiple choice

Read the full Develop data processing explanation →

You are working with Azure Synapse Analytics serverless SQL pool. You need to query a set of Parquet files located in ADLS Gen2. The files have nested columns (structs and arrays). Which function should you use to flatten the nested data?

Question 12hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Databricks with Delta Lake. The data is ingested from multiple sources and needs to be deduplicated based on a composite key (source_id, record_id). New data may have duplicates within the same batch. Which write mode and table property should you use to handle this efficiently?

Question 13mediummultiple choice

Read the full Develop data processing explanation →

You are configuring a data pipeline in Azure Data Factory that uses a Mapping Data Flow. The source is a SQL Server table with 50 million rows. The sink is a Delta table in ADLS Gen2. The pipeline runs slowly. You need to improve performance by reducing the number of partitions in the data flow. Which setting should you adjust?

Question 14easymultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Analytics to process streaming data from Azure Event Hubs. The data must be written to a Delta Lake table in ADLS Gen2 with exactly-once semantics. Which processing engine should you use?

Question 15hardmultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing pipeline in Azure Databricks. The data is stored in Delta Lake and you need to perform a time-series join between two tables: 'events' (100 billion rows) and 'sessions' (10 billion rows). The join condition is on 'device_id' and a timestamp range (event_time BETWEEN session_start AND session_end). Which join strategy would be most efficient?

Question 16mediummultiple choice

Read the full Develop data processing explanation →

You are monitoring an Azure Data Factory pipeline that runs every hour. The pipeline uses a Copy activity to copy data from Azure SQL Database to Azure Blob Storage. Recently, the pipeline has been failing with a 'Timeout' error. The source SQL database has a large number of records. What should you do to resolve the timeout?

Question 17easymultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline that uses Azure Data Factory to load data from an FTP server to Azure Data Lake Storage. The FTP server requires authentication with username and password. Which type of linked service should you create?

Question 18mediummultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Analytics dedicated SQL pool to run a query that joins a large fact table (10 billion rows) and a small dimension table (1 million rows). The query is slow. Which distribution strategy should you use for the dimension table to improve performance?

Question 19mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must support incremental loading of data from an Azure SQL Database to a dedicated SQL pool using PolyBase. Which approach should you use to minimize data movement and maximize performance?

Question 20hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a pipeline in Azure Data Factory that copies data from an Azure Blob Storage to an Azure Synapse Analytics dedicated SQL pool. The pipeline fails with the error: 'PolyBase requires a varchar(max) column to be less than 1 MB.' Which action should you take to resolve this issue?

Question 21easymultiple choice

Read the full Develop data processing explanation →

You are developing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. Which authentication method should you use to ensure that the queries use the identity of the caller and adhere to Azure role-based access control (RBAC) permissions?

Question 22hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have an Azure Data Factory pipeline that performs an incremental load from an Azure SQL Database source to a target Azure SQL Database. The pipeline uses a watermark column approach. After running the pipeline, you notice that the target table is empty. What is the most likely cause of this issue?

Exhibit

Refer to the exhibit.

{
  "name": "IncrementalLoad",
  "properties": {
    "activities": [
      {
        "name": "WatermarkQuery",
        "type": "Lookup",
        "typeProperties": {
          "source": {
            "type": "AzureSqlSource",
            "sqlReaderQuery": "SELECT MAX(LastModified) AS NewWatermark FROM source_table"
          },
          "dataset": {
            "referenceName": "AzureSqlTable",
            "type": "DatasetReference"
          }
        }
      },
      {
        "name": "CopyData",
        "type": "Copy",
        "dependsOn": [
          {
            "activity": "WatermarkQuery",
            "dependencyConditions": ["Succeeded"]
          }
        ],
        "typeProperties": {
          "source": {
            "type": "AzureSqlSource",
            "sqlReaderQuery": "SELECT * FROM source_table WHERE LastModified > '@{activity('WatermarkQuery').output.firstRow.NewWatermark}'"
          },
          "sink": {
            "type": "AzureSqlSink",
            "preCopyScript": "TRUNCATE TABLE target_table"
          }
        },
        "inputs": [
          {
            "referenceName": "AzureSqlTable",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "AzureSqlTable",
            "type": "DatasetReference"
          }
        ]
      }
    ]
  }
}

Question 23mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. The solution must support exactly-once semantics and handle late-arriving data. Which Azure service should you use to implement this solution?

Question 24mediummultiple choice

Read the full Develop data processing explanation →

You are developing a data processing pipeline in Azure Synapse Analytics. The pipeline uses a mapping data flow to transform data from Azure Data Lake Storage Gen2 to a dedicated SQL pool. The data flow includes a Derived Column transformation that uses the expression: `iif(isNull(Column1), 'Default', Column1)`. However, the transformation is not handling NULL values correctly. What is the most likely cause?

Question 25hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored in Delta Lake format. Which of the following statements is true regarding querying Delta Lake tables with serverless SQL pool?

Question 26easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to store fact and dimension tables. The fact table is expected to have billions of rows. Which distribution strategy should you recommend for the fact table to optimize query performance and minimize data movement?

Question 27mediummultiple choice

Read the full Develop data processing explanation →

You are developing a data processing pipeline in Azure Data Factory. The pipeline must copy data from an on-premises SQL Server to Azure Blob Storage. The data volumes are large, and the network connection is unreliable. Which configuration should you use to ensure resilience and resume capability?

Question 28mediummulti select

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a serverless SQL pool to query data in Azure Data Lake Storage Gen2. The data is stored as Parquet files partitioned by date. Which TWO of the following statements are true regarding querying this data? (Select TWO.)

Question 29hardmulti select

Read the full Develop data processing explanation →

You are developing a data processing pipeline in Azure Data Factory. The pipeline uses a mapping data flow to transform data. You need to ensure that the data flow can handle schema drift from the source. Which THREE of the following actions should you take? (Select THREE.)

Question 30easymulti select

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must use a dedicated SQL pool to support both batch and near-real-time data ingestion. Which TWO of the following methods can you use to ingest data into a dedicated SQL pool? (Select TWO.)

Question 31mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are deploying an Azure Synapse Analytics workspace using an ARM template. The template defines a managed virtual network integration runtime. You need to ensure that the integration runtime can run mapping data flows with a time-to-live (TTL) of 10 minutes. What is the purpose of the 'timeToLive' property in this configuration?

Exhibit

Refer to the exhibit.

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "resources": [
        {
            "type": "Microsoft.Synapse/workspaces/integrationRuntimes",
            "apiVersion": "2021-06-01-preview",
            "name": "[concat(parameters('workspaceName'), '/MyManagedVNetIR')]",
            "properties": {
                "type": "Managed",
                "typeProperties": {
                    "computeProperties": {
                        "location": "AutoResolve",
                        "dataFlowProperties": {
                            "computeType": "General",
                            "coreCount": 8,
                            "timeToLive": 10
                        }
                    }
                }
            }
        }
    ]
}

Question 32hardmultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. You have an Azure Data Factory dataset definition for a Parquet file stored in Azure Data Lake Storage Gen2. You attempt to use this dataset as a source in a copy activity, but the copy activity fails with an error indicating that the file is not found. The file 'sales_orders.parquet' exists at the specified path. What is the most likely cause of the error?

Exhibit

Refer to the exhibit.

{
  "name": "sales_orders",
  "properties": {
    "folder": "orders",
    "type": "AzureBlobFSLocation",
    "linkedServiceName": {
      "referenceName": "ADLSGen2",
      "type": "LinkedServiceReference"
    },
    "typeProperties": {
      "fileName": "sales_orders.parquet",
      "folderPath": "data/orders/year=2023/month=01/day=15/"
    },
    "compressionCodec": "snappy",
    "columnDelimiter": ","
  }
}

Question 33easymultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have created an external table in Azure Synapse Analytics serverless SQL pool to query Parquet files stored in Azure Data Lake Storage Gen2. When you query the external table, you get an error that the external table is not accessible. What should you check first?

Exhibit

Refer to the exhibit.

CREATE EXTERNAL DATA SOURCE myDataSource
WITH (
    LOCATION = 'https://mystorageaccount.dfs.core.windows.net/mycontainer'
);

CREATE EXTERNAL FILE FORMAT myFileFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

CREATE EXTERNAL TABLE dbo.Sales (
    SaleID int,
    ProductName varchar(100),
    SaleDate date,
    Amount decimal(10,2)
)
WITH (
    LOCATION = 'sales/',
    DATA_SOURCE = myDataSource,
    FILE_FORMAT = myFileFormat
);

Question 34mediummultiple choice

Read the full Develop data processing explanation →

Your organization uses Azure Synapse Analytics to run large-scale ETL jobs. A pipeline that loads data from Azure Data Lake Storage Gen2 into a dedicated SQL pool is failing with 'Out of memory' errors during the staging step. The source data is 500 GB in size, and the SQL pool is currently set to DW500c. What should you do to resolve the issue without changing the pipeline logic?

Question 35hardmultiple choice

Read the full Develop data processing explanation →

You are designing a near-real-time data processing solution for a retail company. The source is a Kafka cluster on-premises. The target is an Azure Synapse Dedicated SQL Pool. The solution must handle up to 10,000 events per second with less than 5-minute latency. Which Azure service should you use to ingest the data?

Question 36easymultiple choice

Read the full Develop data processing explanation →

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime and runs successfully during business hours. However, after a recent network security update, the pipeline fails with a connection error to the on-premises SQL Server. What is the most likely cause?

Question 37mediummultiple choice

Read the full Develop data processing explanation →

You are building a data processing solution in Azure Synapse Analytics. The solution requires creating a table that stores sales transactions. The table will be used for both point-of-sale lookups and large aggregation queries. The data is not updated frequently. Which table distribution should you recommend?

Question 38hardmultiple choice

Read the full Develop data processing explanation →

Your company runs a streaming job in Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job is failing with a 'Watermark delay' alert and the output to Synapse is delayed by over 30 minutes. The input rate is 5,000 events per second. The job uses a 1-minute tumbling window. What is the most likely cause of the delay?

Question 39easymultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing solution in Azure Data Factory. The source is an Azure Blob Storage container with CSV files. The target is an Azure SQL Database. The pipeline must run daily and incrementally load only new or changed rows. Which Data Factory feature should you use?

Question 40mediummultiple choice

Read the full Develop data processing explanation →

You have a dedicated SQL pool in Azure Synapse that stores a fact table with over 100 billion rows. Query performance is degrading over time. You notice that the table is hash-distributed on a column with many duplicate values. What is the most likely impact?

Question 41hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data and comply with GDPR. The data will be stored in Azure Synapse Analytics. You need to ensure that only authorized users can view specific columns (e.g., credit card numbers). Which security feature should you implement?

Question 42easymultiple choice

Read the full Develop data processing explanation →

You are processing streaming data from IoT devices using Azure Stream Analytics. The data includes temperature readings and device IDs. You need to calculate the average temperature per device over a 5-minute window, sliding every 1 minute. Which window function should you use?

Question 43mediummulti select

Read the full Develop data processing explanation →

Which TWO Azure services can be used to perform real-time data processing on streaming data?

Question 44hardmulti select

Read the full Develop data processing explanation →

Which THREE factors should you consider when choosing between Azure Data Factory Mapping Data Flows and Azure Synapse Spark pools for data transformation?

Question 45easymulti select

Read the full Develop data processing explanation →

Which TWO configuration options are required to enable change data capture (CDC) in Azure Data Factory for incremental loading from Azure SQL Database?

Question 46hardmultiple choice

Read the full Develop data processing explanation →

You are reviewing a Mapping Data Flow in Azure Data Factory that copies data from a CSV file to an Azure Synapse table. The data flow uses 'allowSchemaDrift: true' and 'validateSchema: false'. After running the pipeline, you notice that the target table does not have the expected columns. The CSV file sometimes has extra columns. What is the most likely reason?

Exhibit

Refer to the exhibit.

{
  "type": "MappingDataFlow",
  "typeProperties": {
    "sources": [
      {
        "dataset": {
          "referenceName": "DelimitedTextSource",
          "type": "DatasetReference"
        },
        "script": "source(\n  output() as (\n    col1 string,\n    col2 string\n  ),\n  allowSchemaDrift: true,\n  validateSchema: false\n) ~> Source1"
      }
    ],
    "sinks": [
      {
        "dataset": {
          "referenceName": "AzureSynapseTableSink",
          "type": "DatasetReference"
        },
        "script": "Source1 sink(\n  input() as (\n    col1 string,\n    col2 string\n  ),\n  allowSchemaDrift: true,\n  validateSchema: false\n) ~> Sink1"
      }
    ]
  }
}

Question 47mediummultiple choice

Read the full Develop data processing explanation →

You are reviewing a Spark job definition in Azure Synapse Analytics. The job aggregates sales data. The job runs successfully but takes longer than expected. You notice that dynamic allocation is disabled and the executor instances are fixed at 10. The cluster has a maximum of 20 nodes. What is the most likely reason for the slow performance?

Exhibit

Refer to the exhibit.

{
  "name": "AggregateProductSales",
  "properties": {
    "folder": {
      "name": "Sales"
    },
    "content": {
      "jobType": "SparkJob",
      "jobDefinition": {
        "file": "abfss://container@storage.dfs.core.windows.net/synapse/workspaces/workspace/sparkjobdefinitions/aggregate_sales.py",
        "conf": {
          "spark.dynamicAllocation.enabled": "false",
          "spark.executor.instances": 10,
          "spark.executor.cores": 4,
          "spark.executor.memory": "8g"
        }
      }
    }
  }
}

Question 48mediummultiple choice

Read the full Develop data processing explanation →

You are deploying an Azure Synapse workspace using an ARM template. The template includes a Managed integration runtime with 'AutoResolve' location and a TTL of 10 minutes for data flows. After deployment, you notice that the first data flow execution takes a long time to start. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "resources": [
    {
      "type": "Microsoft.Synapse/workspaces/integrationRuntimes",
      "apiVersion": "2021-06-01-preview",
      "name": "[concat(parameters('workspaceName'), '/AutoResolveIntegrationRuntime')]",
      "properties": {
        "type": "Managed",
        "typeProperties": {
          "computeProperties": {
            "location": "AutoResolve",
            "dataFlowProperties": {
              "computeType": "General",
              "coreCount": 8,
              "timeToLive": 10
            },
            "pipelineExternalComputeScaleProperties": {
              "numberOfExternalNodes": 1,
              "numberOfPipelineNodes": 1
            }
          }
        }
      }
    }
  ]
}

Question 49mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline in Azure Synapse Analytics to ingest data from Azure Blob Storage into a dedicated SQL pool. The source files are CSV with varying row lengths, and you need to ensure optimal performance for reads. Which file format and compression should you recommend?

Question 50easymultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Pipelines to perform an incremental load from Azure SQL Database to Azure Synapse Analytics. You need to identify rows that have changed since the last load. Which approach should you use?

Question 51mediummultiple choice

Read the full Develop data processing explanation →

You have an Azure Stream Analytics job that reads from an Event Hub and writes to Azure SQL Database. The job processes high-velocity IoT sensor data. You notice that the output to SQL Database is slower than expected and the job's watermark delay is increasing. What should you do to improve throughput?

Question 52easymultiple choice

Read the full Develop data processing explanation →

Your team is building a real-time dashboard in Power BI that displays sales data from Azure Stream Analytics. The data must be updated every 5 seconds with low latency. Which output type should you configure in Stream Analytics to achieve this?

Question 53mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics that uses serverless SQL pools to query Parquet files in Azure Data Lake Storage Gen2. The files are partitioned by year and month. You need to optimize query performance and reduce data scanned. What should you do?

Question 54hardmultiple choice

Read the full Develop data processing explanation →

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline uses a self-hosted integration runtime. You notice that the copy activity fails intermittently with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed'. The on-premises SQL Server is under heavy load during business hours. What is the most likely cause?

Question 55easymultiple choice

Read the full Develop data processing explanation →

You need to transform data in Azure Synapse Analytics using a language that supports procedural logic and error handling. Which option should you use?

Question 56mediummulti select

Read the full Develop data processing explanation →

Which TWO actions should you take to optimize performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Question 57hardmulti select

Read the full Develop data processing explanation →

Which THREE components are required to implement a change data capture (CDC) solution in Azure Data Factory for incremental loading from Azure SQL Database to Azure Synapse Analytics?

Question 58easymulti select

Read the full Develop data processing explanation →

Which TWO data processing activities in Azure Data Factory allow you to run custom code?

Question 59easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Databricks to transform streaming data from Azure Event Hubs. The data must be aggregated in 1-minute tumbling windows and written to Azure Synapse Analytics. Which Spark API should you use?

Question 60mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Synapse Analytics. The pipeline must ingest streaming data from Azure Event Hubs, perform real-time aggregations, and store the results in a dedicated SQL pool. Which component should you use to perform the real-time transformations?

Question 61easymultiple choice

Read the full Develop data processing explanation →

You are processing CSV files in Azure Synapse Analytics serverless SQL pool. The files contain newline characters within quoted fields, causing parsing errors. Which file format option should you specify to handle this correctly?

Question 62hardmultiple choice

Read the full Develop data processing explanation →

You have a pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection closed' error after 1 hour. The data volume is 500 GB over a slow network. What is the most likely cause and solution?

Question 63mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution that uses Azure Databricks to transform large datasets. You need to ensure that the processing is cost-effective and can scale to handle variable workloads. Which cluster configuration should you recommend?

Question 64easymultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query returns fewer rows than expected. What should you check first?

Question 65hardmultiple choice

Read the full Develop data processing explanation →

You are monitoring an Azure Stream Analytics job that processes data from an IoT hub. The job's output to Azure Synapse Analytics is experiencing high latency. The job's SU% utilization is at 90%. Which action will most likely reduce the latency?

Question 66mediummultiple choice

Read the full Develop data processing explanation →

You have an Azure Data Factory pipeline that executes a stored procedure in Azure SQL Database. The pipeline fails with an error indicating that the stored procedure ran out of memory. What change should you make to the pipeline to resolve this?

Question 67easymultiple choice

Read the full Develop data processing explanation →

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2 using Azure Data Factory. Which approach is the most efficient?

Question 68hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow with Azure Integration Runtime (IR). The pipeline runs slowly and you notice that the IR's data movement is limited by the number of cores. Which configuration should you adjust to improve performance?

Question 69mediummulti select

Read the full Develop data processing explanation →

Which TWO are valid ways to process data in Azure Synapse Analytics?

Question 70hardmulti select

Read the full Develop data processing explanation →

Which THREE are required to run a stored procedure in Azure SQL Database from Azure Data Factory?

Question 71easymulti select

Read the full Develop data processing explanation →

Which TWO are benefits of using Azure Databricks Auto Loader for incremental data ingestion?

Question 72mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline to ingest streaming data from IoT devices into Azure Synapse Analytics. The data must be available for querying with minimal latency, but you also need to handle spikes in throughput without data loss. Which service should you use as the ingestion layer?

Question 73hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a slow-running pipeline in Azure Data Factory. The pipeline copies data from an on-premises SQL Server to Azure Synapse Analytics using a self-hosted integration runtime. The copy activity is using the 'Auto' copy method. You notice that network bandwidth is limited. Which configuration change would most likely improve performance?

Question 74easymultiple choice

Read the full Develop data processing explanation →

You need to transform data in Azure Databricks using Apache Spark. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. Which method should you use to read the data into a Spark DataFrame?

Question 75mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a global company. Data must be processed in near real-time and aggregated by region. You need to minimize latency for downstream consumers. Which Azure service should you use for stream processing?

Question 76hardmultiple choice

Read the full Develop data processing explanation →

You are optimizing a pipeline in Azure Data Factory that copies data from Azure Blob Storage to Azure Synapse Analytics. The pipeline uses a copy activity with PolyBase. The data is partitioned by date in Blob Storage. You notice that the load is slow. What is the most likely cause?

Question 77easymultiple choice

Read the full Develop data processing explanation →

You need to perform incremental data loading from Azure SQL Database to Azure Data Lake Storage Gen2. You want to minimize cost and complexity. Which Azure Data Factory feature should you use?

Question 78mediummultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is slow. Which action would most likely improve performance?

Question 79hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Databricks with Delta Lake. You need to ensure ACID transactions and schema enforcement. Which feature should you enable?

Question 80mediummultiple choice

Read the full Develop data processing explanation →

You are running a pipeline in Azure Data Factory that uses a Mapping Data Flow. The data flow reads from Azure SQL Database and writes to Azure Synapse Analytics. You find that the data flow is very slow. Which configuration change would most likely improve performance?

Question 81hardmulti select

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. You need to load data from multiple sources into a dedicated SQL pool. Which THREE of the following are best practices for loading data?

Question 82mediummulti select

Read the full Develop data processing explanation →

You are using Azure Stream Analytics to process real-time data from an Event Hub. Which TWO of the following are valid output sinks?

Question 83mediummulti select

Read the full Develop data processing explanation →

You are designing an ETL process in Azure Data Factory. You need to transform data using Mapping Data Flows. Which THREE of the following transformations are available in Mapping Data Flows?

Question 84mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are monitoring the CopyDataPipeline in Azure Data Factory. The copy activity is failing with timeout errors. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
    "name": "CopyDataPipeline",
    "properties": {
        "activities": [
            {
                "name": "CopyFromBlobToSQL",
                "type": "Copy",
                "inputs": [
                    {
                        "referenceName": "BlobDataset",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "SQLDataset",
                        "type": "DatasetReference"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "00:00:30",
                        "sqlWriterUseTableLock": true
                    },
                    "enableStaging": false
                }
            }
        ]
    }
}

Question 85hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are running the KQL query in Azure Data Explorer. The query returns no results, but you know there is data in the table T. What is the most likely issue?

Exhibit

Refer to the exhibit.

-- KQL query in Azure Data Explorer
let startTime = datetime(2024-01-01);
let endTime = datetime(2024-01-02);
T
| where Timestamp between (startTime .. endTime)
| summarize count() by bin(Timestamp, 1h)
| render timechart

Question 86easymultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are deploying the ARM template to create an Azure Synapse Analytics workspace. The deployment fails because the managed resource group cannot be created. What is the most likely reason?

Exhibit

Refer to the exhibit.

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "resources": [
        {
            "type": "Microsoft.Synapse/workspaces",
            "apiVersion": "2021-06-01",
            "name": "myworkspace",
            "location": "[resourceGroup().location]",
            "properties": {
                "defaultDataLakeStorage": {
                    "accountUrl": "https://mystorage.dfs.core.windows.net",
                    "filesystem": "myfilesystem"
                },
                "managedResourceGroupName": "synapse-managed-rg"
            }
        }
    ]
}

Question 87mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline in Azure Synapse Analytics that ingests streaming taxi trip data from Azure Event Hubs. The data must be processed in near real-time and stored in a dedicated SQL pool. The pipeline should handle late-arriving data (up to 30 minutes late) without reprocessing the entire stream. Which Azure service should you use to process the streaming data?

Question 88hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a Synapse Spark notebook that fails when reading Parquet files from Azure Data Lake Storage Gen2. The error message indicates 'Permission denied'. The notebook uses a managed identity (System-assigned) for authentication. The Data Lake Storage account has a firewall enabled with 'Allow Azure services on the trusted services list' turned on. The storage account's RBAC role assignments include 'Storage Blob Data Contributor' for the managed identity. What is the most likely cause of the failure?

Question 89easymultiple choice

Read the full Develop data processing explanation →

You need to execute a T-SQL stored procedure in Azure Synapse dedicated SQL pool that performs a large data load. The stored procedure takes approximately 45 minutes to run. You want to monitor the progress and see the statement text currently being executed. Which dynamic management view (DMV) should you query?

Question 90mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data transformation solution for a retail company. The company receives daily CSV files from 200 stores via SFTP. The files must be cleaned, validated, and aggregated before loading into Azure Synapse dedicated SQL pool. The solution must minimize administrative overhead and support easy monitoring. Which approach do you recommend?

Question 91hardmultiple choice

Read the full Develop data processing explanation →

You are optimizing an Azure Synapse serverless SQL pool query that queries Parquet files in Azure Data Lake Storage. The query takes longer than expected. You notice that the query reads more data than necessary. What is the most effective way to reduce the amount of data scanned?

Question 92easymultiple choice

Read the full Develop data processing explanation →

You have a pipeline in Azure Data Factory that copies data from on-premises SQL Server to Azure Blob Storage. The pipeline fails with a 'Connection timed out' error. You have already verified that the Integration Runtime is running and the SQL Server firewall allows connections from the Integration Runtime. What should you check next?

Question 93mediummultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data from multiple sources. You need to ensure that the data is encrypted at rest and in transit, and that access to the data is audited. Which combination of Azure services should you use?

Question 94hardmultiple choice

Read the full Develop data processing explanation →

You are running a data transformation pipeline in Azure Synapse Spark that writes output to Delta tables. You notice that the job eventually slows down and then fails with an out-of-memory error. The input data size is 1 TB, and the cluster has 10 nodes with 16 GB memory each. What is the most likely cause?

Question 95easymultiple choice

Read the full Develop data processing explanation →

You are using Azure Data Factory to copy data from an Azure SQL Database to Azure Synapse dedicated SQL pool. The copy activity uses PolyBase as the copy method. The activity fails with the error 'Operation not supported: PolyBase cannot write to a table with clustered columnstore index'. What should you do to resolve this error?

Question 96mediummulti select

Read the full Develop data processing explanation →

Which TWO of the following are built-in components of Azure Data Factory that can be used to perform data transformation without writing custom code?

Question 97hardmulti select

Read the full Develop data processing explanation →

Which THREE of the following are valid ways to authenticate to Azure Data Lake Storage Gen2 from Azure Synapse serverless SQL pool?

Question 98easymulti select

Read the full Develop data processing explanation →

Which TWO of the following are required components to set up a data pipeline that uses Change Data Capture (CDC) to incrementally load data from SQL Server to Azure Synapse using Azure Data Factory?

Question 99hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from a CSV file in Blob Storage to a Synapse dedicated SQL pool table named dbo.Sales. The pipeline fails. The error message indicates that the 'Amount' column in the sink table does not allow NULLs but the source contains NULL values. What is the best way to resolve this issue without losing data?

Exhibit

{
  "name": "CopyDataFromBlobToSynapse",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobSource", "type": "DatasetReference"}],
        "outputs": [{"referenceName": "SynapseSink", "type": "DatasetReference"}],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource",
            "storeSettings": {
              "type": "AzureBlobStorageReadSettings",
              "recursive": true
            }
          },
          "sink": {
            "type": "SqlPoolSink",
            "writeBatchSize": 10000,
            "preCopyScript": "TRUNCATE TABLE dbo.Sales"
          },
          "translator": {
            "type": "TabularTranslator",
            "mappings": [
              {"source": "ProductID", "sink": "ProductID"},
              {"source": "SaleDate", "sink": "SaleDate"},
              {"source": "Quantity", "sink": "Quantity"},
              {"source": "Amount", "sink": "Amount"}
            ]
          }
        }
      }
    ]
  }
}

Question 100mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have created an external table in Azure Synapse serverless SQL pool as shown. You run a query: SELECT ProductID, SUM(Amount) FROM dbo.ExternalSales WHERE SaleDate > '2024-01-01' GROUP BY ProductID. The query is slow and scans all files in the /sales/ folder, which contains data from 2023 and 2024. The files are partitioned by year and month in the folder structure, e.g., /sales/year=2023/month=01/. What should you do to improve query performance?

Exhibit

CREATE EXTERNAL DATA SOURCE MyDataSource
WITH (
    LOCATION = 'abfss://container@storageaccount.dfs.core.windows.net',
    CREDENTIAL = MyCredential
);

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

CREATE EXTERNAL TABLE dbo.ExternalSales
(
    ProductID INT,
    SaleDate DATE,
    Quantity INT,
    Amount DECIMAL(10,2)
)
WITH (
    LOCATION = '/sales/',
    DATA_SOURCE = MyDataSource,
    FILE_FORMAT = ParquetFormat
);

Question 101hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have an Azure Synapse pipeline that runs a Spark notebook daily. The notebook uses the inputDate parameter to filter data. The notebook successfully processes data for '2024-01-01' but fails for '2024-01-02' with an error that the 'sales' table does not exist. The 'sales' table is created daily by a preceding job. What is the most likely cause?

Exhibit

{
  "name": "SalesAggregation",
  "properties": {
    "activities": [
      {
        "name": "Notebook1",
        "type": "SynapseNotebook",
        "dependsOn": [],
        "typeProperties": {
          "notebook": "SalesAggregationNotebook",
          "parameters": {}
        },
        "linkedServiceName": {
          "referenceName": "mySparkPool",
          "type": "LinkedServiceReference"
        }
      }
    ],
    "parameters": {
      "inputDate": {
        "type": "string",
        "defaultValue": "2024-01-01"
      }
    }
  }
}

Question 102mediummultiple choice

Read the full Develop data processing explanation →

You are designing a streaming data solution for IoT devices that generate 10,000 events per second. The data must be processed with sub-second latency and then stored in Azure Data Lake Storage Gen2 for archival. Which Azure service should you use for the stream processing?

Question 103hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a Synapse Pipeline that runs a Copy activity from an on-premises SQL Server to Azure Synapse Dedicated SQL Pool. The pipeline fails with the error: 'Failure happened on 'Source' side. ErrorCode=SqlOperationFailed.' The on-premises SQL Server has no firewall restrictions. What is the most likely cause?

Question 104easymultiple choice

Read the full Develop data processing explanation →

You need to transform JSON data containing nested arrays into a tabular format for analysis in Azure Synapse Analytics. Which transformation in Azure Data Factory or Synapse Pipelines should you use?

Question 105hardmultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. The pipeline fails with a 'Type mismatch' error. The source file has a column 'Name' of type String, and the destination table expects 'FullName' of type String. What is the most likely cause of the failure?

Exhibit

{
  "type": "Microsoft.DataFactory/factories/pipelines",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "policy": {
          "retry": 2,
          "timeout": "0:10:00"
        },
        "inputs": [
          {
            "referenceName": "DelimitedTextInput",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "ParquetOutput",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource",
            "storeSettings": {
              "type": "AzureBlobStorageReadSettings",
              "recursive": true
            }
          },
          "sink": {
            "type": "ParquetSink",
            "storeSettings": {
              "type": "AzureBlobFSWriteSettings"
            },
            "formatSettings": {
              "type": "ParquetWriteSettings"
            }
          },
          "translator": {
            "type": "TabularTranslator",
            "mappings": [
              {
                "source": {
                  "name": "Name",
                  "type": "String"
                },
                "sink": {
                  "name": "FullName",
                  "type": "String"
                }
              }
            ]
          }
        }
      }
    ]
  }
}

Question 106mediummultiple choice

Read the full Develop data processing explanation →

You are running a Spark notebook in Azure Synapse Analytics that reads from a Delta table and writes to a Parquet file. The job fails with the error: 'AnalysisException: Table or view not found: bronze.sales'. The table exists in the lakehouse. What is the most likely cause?

Question 107easymultiple choice

Read the full Develop data processing explanation →

You need to incrementally load new and updated records from a source SQL Server database to Azure Synapse Dedicated SQL Pool. The source table has a LastModifiedDate column. Which Azure Data Factory feature should you use to implement incremental loading efficiently?

Question 108mediummultiple choice

Read the full Develop data processing explanation →

You are partitioning a large fact table in Azure Synapse Dedicated SQL Pool by date. The table is used for queries that filter on CustomerID and Date. You want to minimize data movement. Which distribution strategy should you use?

Question 109hardmultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. The pipeline executes a Spark job definition that reads from the input path and writes to the output path. The job fails with an error indicating that the output path already exists. The job is configured with default behavior. What configuration change should you make to allow the job to overwrite the existing output?

Network Topology

Question 110easymultiple choice

Read the full Develop data processing explanation →

You need to process a large number of small files (each < 1 MB) from Azure Blob Storage in Azure Synapse Analytics. The processing is I/O-bound due to many small file operations. Which approach should you use to improve performance?

Question 111mediummulti select

Read the full Develop data processing explanation →

Which TWO of the following are valid ways to handle late-arriving data in a streaming solution with Azure Stream Analytics? (Choose two.)

Question 112hardmulti select

Read the full Develop data processing explanation →

Which THREE of the following are best practices for optimizing performance of Delta Lake tables in Azure Synapse Analytics? (Choose three.)

Question 113easymulti select

Read the full Develop data processing explanation →

Which TWO of the following are supported sources for Azure Data Factory Copy activity? (Choose two.)

Question 114hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are creating a serverless SQL table in Azure Synapse Analytics that reads Parquet files from the specified location. The folder contains multiple Parquet files with different schemas. When querying the table, you get an error about schema mismatch. What is the most likely reason?

Exhibit

{
  "type": "Microsoft.Synapse/workspaces/databases/tables",
  "properties": {
    "source": {
      "provider": "ABFS",
      "location": "abfss://container@storage.dfs.core.windows.net/data/"
    },
    "format": {
      "type": "parquet",
      "derivedModel": false
    },
    "options": {
      "recursive": true
    }
  }
}

Question 115mediummultiple choice

Read the full Develop data processing explanation →

You are building a data processing solution that requires exactly-once semantics when writing to Azure Event Hubs from Azure Stream Analytics. Which output format should you configure?

Question 116easymultiple choice

Study the full Python automation breakdown →

You need to orchestrate a data pipeline that includes a Python script and a Data Flow in Azure Synapse Analytics. The Python script must run before the Data Flow. Which activity should you use to run the Python script?

Question 117mediummultiple choice

Read the full Develop data processing explanation →

Your organization is using Azure Synapse Analytics dedicated SQL pool. You notice that queries are running slower than expected. Upon reviewing the execution plans, you see that some queries are performing table scans instead of seeks on large fact tables. What is the most likely cause?

Question 118hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data ingestion pipeline for Azure Synapse Analytics using PolyBase. The source data is in Azure Blob Storage, with files in the Parquet format. The data volume is approximately 2 TB daily. You need to minimize the time to load data into a dedicated SQL pool table. Which approach should you recommend?

Question 119easymultiple choice

Read the full Develop data processing explanation →

You are building a data pipeline in Azure Data Factory to copy data from an on-premises SQL Server database to Azure Blob Storage. The pipeline must run daily and handle incremental updates. The on-premises SQL Server table has a LastModifiedDate column that is updated when a row changes. What is the most efficient way to implement incremental loads?

Question 120mediummultiple choice

Read the full Develop data processing explanation →

You are designing a streaming solution in Azure Synapse Analytics using the serverless SQL pool to query streaming data in real-time. The data is ingested via Azure Event Hubs and processed using Azure Stream Analytics. The output of Stream Analytics is written to Azure Data Lake Storage Gen2 in Delta Lake format. You need to ensure that the serverless SQL pool can query the latest data with minimal latency. Which approach should you use?

Question 121hardmultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Synapse Analytics and has deployed a pipeline that uses a Mapping Data Flow to transform data. The data flow reads from a source in Azure Blob Storage and writes to a dedicated SQL pool. You notice that the data flow is running slowly and consuming a lot of Data Flow cluster resources. You need to improve performance without increasing the cluster size. Which action should you take?

Question 122easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Databricks. The data is stored in Azure Data Lake Storage Gen2 and you need to perform transformations using Apache Spark. The security requirements mandate that all data in transit must be encrypted and that the storage account must not be accessible from the public internet. What should you configure?

Question 123hardmultiple choice

Read the full Develop data processing explanation →

You are optimizing a data pipeline in Azure Data Factory that uses a Copy activity to transfer data from an Azure SQL Database to a dedicated SQL pool in Azure Synapse Analytics. The source table has 500 million rows and the copy operation is taking too long. You need to reduce the copy duration. Which configuration change will have the most impact?

Question 124easymultiple choice

Read the full Develop data processing explanation →

Your team is building a real-time dashboard using Azure Stream Analytics. The data source is an Azure Event Hub that receives clickstream events. You need to output aggregated data (counts per page per minute) to an Azure SQL Database for reporting. The query must handle late-arriving events and ensure exactly-once semantics. Which Stream Analytics feature should you use?

Question 125mediummultiple choice

Read the full Develop data processing explanation →

You are using Azure Data Lake Storage Gen2 as the data lake for your organization. You need to process files in the 'incoming' folder using a scheduled Azure Databricks notebook. After processing, the files should be moved to the 'processed' folder. The files are large (up to 10 GB) and you want to minimize the time to move them. Which approach should you use?

Question 126mediummulti select

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Data Factory that uses a Mapping Data Flow. You need to handle errors gracefully, such as when a row fails to convert a column value. Which TWO actions should you take? (Choose two.)

Question 127hardmulti select

Read the full Develop data processing explanation →

You are optimizing the performance of a large-scale batch processing job in Azure Databricks. The job reads data from Azure Data Lake Storage Gen2, performs transformations, and writes results back. You notice that the job is I/O bound. Which THREE strategies can improve performance? (Choose three.)

Question 128easymulti select

Read the full Develop data processing explanation →

You are using Azure Stream Analytics to process real-time data from an IoT hub. The output is sent to Azure Blob Storage for long-term storage. You need to ensure that the output files are partitioned by date and hour for easy querying. Which THREE configurations should you set? (Choose three.)

Question 129hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are an Azure data engineer responsible for ensuring that all storage accounts used in data pipelines enforce HTTPS traffic. You apply the Azure Policy definition shown above. Later, a data engineer creates a new storage account with 'Enable secure transfer' set to Disabled. What will happen when the policy is evaluated?

Exhibit

{
  "name": "Policy to enforce HTTPS on storage accounts",
  "properties": {
    "policyRule": {
      "if": {
        "allOf": [
          { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
          { "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly", "equals": "false" }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Question 130mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are querying the sys.external_tables view in an Azure Synapse Analytics serverless SQL pool. The query returns no rows, but you know that external tables have been created. What is the most likely reason?

Exhibit

{
  "name": "List all external tables in a serverless SQL pool",
  "query": "SELECT name, location FROM sys.external_tables WHERE type = 'HADOOP'"
}

Question 131easymultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are reviewing an ARM template snippet for an Azure Synapse Analytics workspace. The template defines an integration runtime. A colleague asks whether this integration runtime can be used to copy data from an on-premises SQL Server database to Azure Blob Storage. What should you answer?

Exhibit

{
  "type": "Microsoft.Synapse/workspaces/integrationRuntimes",
  "apiVersion": "2021-06-01-preview",
  "properties": {
    "type": "SelfHosted",
    "description": "Integration runtime for on-premises data sources"
  }
}

Question 132mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must process streaming data from Azure Event Hubs and store the results in a dedicated SQL pool. You need to choose the most appropriate service for near real-time ingestion with minimal latency. What should you use?

Question 133easymultiple choice

Read the full Develop data processing explanation →

In Azure Synapse Analytics serverless SQL pool, you query Parquet files stored in Azure Data Lake Storage Gen2. You notice that queries are slow. Which configuration change is most likely to improve performance?

Question 134hardmultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution for a financial services company. The solution must process sensitive customer data in Azure Databricks with encryption at rest and in transit, and must also support row-level security (RLS). Which combination of features should you use?

Question 135mediummultiple choice

Read the full Develop data processing explanation →

You have an Azure Data Factory pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with timeout errors. You need to improve reliability. What should you do?

Question 136hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a healthcare organization. The solution must process streaming data from IoT devices and store it in Azure Data Lake Storage Gen2. The data must be available for both real-time dashboards and historical analysis. You need to minimize operational overhead. What should you do?

Question 137easymultiple choice

Read the full Develop data processing explanation →

You are monitoring an Azure Synapse Analytics pipeline that runs daily. The pipeline uses a data flow to transform data. You notice that the data flow is slow and consumes a lot of compute resources. Which action can you take to optimize performance?

Question 138mediummultiple choice

Read the full Develop data processing explanation →

You need to process a large dataset stored as CSV files in Azure Data Lake Storage Gen2 using Azure Databricks. The processing involves several transformations and aggregations. You want to minimize shuffle operations. Which approach should you use?

Question 139hardmultiple choice

Read the full Develop data processing explanation →

You are developing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion into a dedicated SQL pool. You need to ensure that data from streaming sources is available for queries within 5 seconds. Which approach should you use?

Question 140easymultiple choice

Read the full Develop data processing explanation →

You are using Azure Data Factory to copy data from Azure Blob Storage to Azure SQL Database. The copy operation fails with error 'Cannot insert duplicate key'. What is the most likely cause and solution?

Question 141mediummulti select

Read the full Develop data processing explanation →

Which TWO actions can you take to optimize query performance in Azure Synapse Analytics dedicated SQL pool?

Question 142hardmulti select

Read the full Develop data processing explanation →

Which THREE components are required to implement a real-time data processing solution using Azure Stream Analytics?

Question 143easymulti select

Read the full Develop data processing explanation →

Which TWO features of Azure Databricks help manage data governance and security for sensitive data?

Question 144mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have a managed identity that needs to read data from the 'data' container in Azure Data Lake Storage Gen2. The policy currently denies access. What is the most likely cause?

Exhibit

Refer to the exhibit. The following is a JSON policy for an Azure Data Lake Storage Gen2 storage account:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "ManagedIdentity": "..."
      },
      "Action": [
        "Microsoft.Storage/storageAccounts/blobServices/containers/read"
      ],
      "Resource": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Storage/storageAccounts/.../blobServices/default/containers/data/...",
      "Condition": {
        "Bool": {
          "acs:RequestVersion": "2019-12-12"
        }
      }
    }
  ]
}

Question 145hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have an Azure Synapse Analytics workspace. You need to ensure that data processing jobs can access the Data Lake Storage Gen2 account using a managed identity. What should you do?

Exhibit

Refer to the exhibit. The following is an Azure CLI output after running a command on a Synapse Analytics workspace:

{
  "name": "myworkspace",
  "type": "Microsoft.Synapse/workspaces",
  "location": "eastus",
  "properties": {
    "defaultDataLakeStorage": {
      "accountUrl": "https://mydatalake.dfs.core.windows.net",
      "filesystem": "myfilesystem"
    },
    "sqlAdministratorLogin": "adminuser",
    "managedResourceGroupName": "managedRG",
    "provisioningState": "Succeeded",
    "privateEndpointConnections": []
  }
}

Question 146mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You run the KQL query in Azure Data Explorer. What is the output?

Exhibit

Refer to the exhibit. The following is a KQL query run in Azure Data Explorer:

let T = datatable(Id:int, Name:string, Age:int)
[
  1, 'Alice', 30,
  2, 'Bob', 25,
  3, 'Charlie', 35
];
T
| where Age > 25
| project Name, Age

Question 147mediummultiple choice

Read the full Develop data processing explanation →

You are building a data pipeline in Azure Synapse Analytics that ingests JSON files from Azure Data Lake Storage Gen2, transforms the data using a mapping data flow, and loads it into a dedicated SQL pool. The pipeline must support incremental loads based on a LastModified timestamp in the source files. Which configuration should you use in the mapping data flow source transformation?

Question 148hardmultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing solution in Azure Databricks that reads Parquet files from Azure Data Lake Storage Gen2, performs aggregations, and writes results to a Delta table. The data volume is expected to grow to 10 TB per batch. You need to minimize shuffle operations during the aggregation step. Which approach should you recommend?

Question 149easymultiple choice

Read the full Develop data processing explanation →

You are developing a real-time data processing solution using Azure Stream Analytics. The input is an Azure Event Hubs stream with JSON data containing a 'timestamp' field. You need to output the average temperature per device every minute using a tumbling window. Which query should you use?

Question 150hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a data processing job in Azure Synapse Pipelines that fails intermittently with the error: 'Operation on target Sink failed: The request was aborted: Could not create SSL/TLS secure channel.' The pipeline reads from Azure Blob Storage and writes to an Azure SQL Database. The source and sink are in the same region. What is the most likely cause?

Question 151mediummultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution in Azure using Azure Data Lake Storage Gen2 as the storage layer. You need to ensure that data ingested from various sources is immutable and can be used for both batch and streaming workloads. Which storage design pattern should you implement?

Question 152hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. The pipeline fails because the source and sink datasets do not match. The source file is a CSV with columns: CustomerID, Name, City. The sink table dbo.Customer has columns: CustomerID, Name, City, CreatedDate (default GETDATE()). The pipeline uses auto-mapping. Why does the pipeline fail?

Exhibit

{
  "name": "CopyDataFromBlobToSQL",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobSourceDataset", "type": "DatasetReference"}],
        "outputs": [{"referenceName": "SQLSinkDataset", "type": "DatasetReference"}],
        "typeProperties": {
          "source": {
            "type": "BlobSource",
            "recursive": true
          },
          "sink": {
            "type": "SqlSink",
            "writeBatchSize": 10000,
            "preCopyScript": "TRUNCATE TABLE dbo.Customer"
          }
        }
      }
    ]
  }
}

Question 153easymultiple choice

Read the full Develop data processing explanation →

You are tasked with transforming data in an Azure Synapse Analytics pipeline using a mapping data flow. The source data contains a column 'FullName' in the format 'LastName, FirstName'. You need to split this into two separate columns: 'LastName' and 'FirstName'. Which transformation should you use?

Question 154mediummultiple choice

Read the full Develop data processing explanation →

You are monitoring a production data pipeline in Azure Data Factory that runs hourly. You notice that the pipeline has been failing for the last 3 hours due to a timeout error when writing to the sink Azure SQL Database. The database is a General Purpose tier with 100 DTUs. The pipeline writes approximately 1 million rows per run. What is the most effective way to prevent the timeout?

Question 155hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You submit a Spark job in Azure Synapse Analytics using the Azure CLI. The job runs slowly during the shuffle phase. The input data is about 200 GB. Which configuration change would best improve performance for this shuffle-heavy workload?

Network Topology

Question 156mediummultiple choice

Read the full Develop data processing explanation →

You are developing a streaming solution with Azure Stream Analytics that outputs results to Azure Synapse Analytics. The output requires exactly-once semantics to avoid duplicates. Which configuration should you use?

Question 157mediummulti select

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Databricks. You need to read data from Azure Data Lake Storage Gen2, transform it using Spark SQL, and write to a Delta table. Which TWO configurations are required to ensure optimal performance for large datasets?

Question 158hardmulti select

Read the full Develop data processing explanation →

You are building a data processing pipeline in Azure Synapse Analytics that uses a mapping data flow to perform a lookup transformation. The lookup source is a dimension table with 10 million rows. You need to optimize the lookup performance. Which THREE actions should you take?

Question 159mediummulti select

Read the full Develop data processing explanation →

You are implementing a data processing solution using Azure Data Factory. You have a pipeline that copies data from Azure Blob Storage to Azure Data Lake Storage Gen2. You need to ensure that the copy activity uses managed identity for authentication and that the data is transferred securely. Which TWO configurations should you apply?

Question 160mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You deploy this ARM template to create a managed integration runtime in Azure Synapse Analytics. You notice that the integration runtime shows as 'Running' but copy activities using it are slow. The data volume is 500 GB per run. What is the most likely cause of the poor performance?

Exhibit

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "resources": [
    {
      "type": "Microsoft.Synapse/workspaces/integrationRuntimes",
      "apiVersion": "2021-06-01-preview",
      "name": "[concat(parameters('workspaceName'), '/MyIntegrationRuntime')]",
      "properties": {
        "type": "Managed",
        "typeProperties": {
          "computeProperties": {
            "location": "AutoResolve",
            "nodeSize": "Standard_D3_v2",
            "numberOfNodes": 2,
            "maxParallelExecutionsPerNode": 2
          }
        }
      }
    }
  ]
}

Question 161hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. You have Parquet files in Azure Data Lake Storage Gen2 partitioned by date and hour. You need to query the data for the last 7 days efficiently, minimizing the amount of data scanned. Which approach should you use?

Question 162mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Synapse Analytics that ingests streaming data from Azure Event Hubs and stores it in a dedicated SQL pool. The data must be available for querying within 5 minutes of ingestion. Which processing approach should you recommend?

Question 163hardmultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a slow-running Azure Synapse Pipeline that loads data from Azure Blob Storage into a dedicated SQL pool using a Copy activity. The source is a set of CSV files totaling 500 GB. The sink is a staging table with a clustered columnstore index. The pipeline takes 4 hours to complete. You need to reduce the execution time to under 1 hour. What should you do?

Question 164easymultiple choice

Read the full Develop data processing explanation →

Your team is developing a data processing solution using Azure Databricks. The data is stored in Delta Lake format in Azure Data Lake Storage Gen2. You need to ensure that when multiple jobs concurrently write to the same Delta table, the operations are atomic and consistent. Which Delta Lake feature should you use?

Question 165hardmultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution for a financial services company. They need to process sensitive customer data in Azure Databricks while complying with GDPR. The data must be encrypted at rest and in transit, and access must be audited. You need to recommend a configuration that meets these requirements. Which combination of actions should you take?

Question 166mediummultiple choice

Read the full Develop data processing explanation →

You are building a real-time dashboard that displays sales data from an Azure SQL Database. The dashboard must refresh every 30 seconds with minimal latency. You need to choose the appropriate Azure service for data processing and visualization. Which service should you use?

Question 167easymultiple choice

Read the full Develop data processing explanation →

You have an Azure Data Factory pipeline that uses a Copy activity to move data from an on-premises SQL Server to Azure Blob Storage. The pipeline fails intermittently with a timeout error. You need to improve the reliability of the data transfer. Which configuration change should you make?

Question 168hardmultiple choice

Read the full NAT/PAT explanation →

You are designing a near-real-time data processing solution that ingests millions of events per second from IoT devices. The data must be aggregated on a per-minute basis and stored in Azure Data Lake Storage Gen2 for long-term analytics. The solution must also support alerting when certain thresholds are exceeded. Which combination of Azure services should you use?

Question 169mediummultiple choice

Read the full Develop data processing explanation →

You are developing a data processing pipeline in Azure Synapse Analytics that uses a SQL script to transform data in a dedicated SQL pool. The pipeline currently runs in the built-in serverless pool, but you want to migrate it to a dedicated SQL pool for better performance. What must you change in the pipeline?

Question 170easymultiple choice

Read the full Develop data processing explanation →

You have an Azure Databricks notebook that processes data from a Delta table. The notebook runs slowly due to many small files. You need to optimize the Delta table for faster reads. Which Delta Lake operation should you run?

Question 171hardmulti select

Read the full Develop data processing explanation →

You are designing a data processing solution that must handle both batch and stream processing in a unified manner. You need to choose a technology that supports a single code base for both processing modes. Which TWO technologies meet this requirement?

Question 172mediummulti select

Read the full Develop data processing explanation →

You are building a data processing pipeline in Azure Synapse Analytics that ingests data from Azure Blob Storage and writes to a dedicated SQL pool. You need to ensure the pipeline can handle schema changes in the source data without failing. Which TWO actions should you take?

Question 173mediummulti select

Read the full Develop data processing explanation →

You are optimizing an Azure Synapse Pipeline that processes large volumes of data using a Mapping Data Flow. The pipeline is taking too long due to data skew. Which THREE techniques can help mitigate data skew?

Question 174mediummultiple choice

Read the full Develop data processing explanation →

You are building a streaming pipeline in Azure Synapse Analytics to ingest real-time sensor data from IoT devices. The data must be processed with a 2-second latency and stored in a dedicated SQL pool for reporting. The source emits JSON messages with a nested structure. Which approach should you use to ingest and transform the data?

Question 175hardmultiple choice

Read the full Develop data processing explanation →

You have a production pipeline in Azure Data Factory that copies data from an on-premises SQL Server to Azure Blob Storage using a self-hosted integration runtime. The pipeline fails intermittently with a 'Connection closed' error. The data volume is 50 GB per run. What should you first troubleshoot to resolve this issue?

Question 176easymultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing solution for a data lake. Source files arrive daily in Parquet format in Azure Data Lake Storage Gen2. The data must be cleaned, aggregated, and loaded into an Azure Synapse SQL pool. The solution should minimize compute costs and management overhead. Which technology should you use for the transformation?

Question 177mediummultiple choice

Read the full Develop data processing explanation →

You are building a real-time dashboard to monitor user activity on a website. The data is ingested via Azure Event Hubs and must be aggregated every minute with a 30-second late-arrival tolerance. The aggregated results should be stored in Azure Cosmos DB for low-latency reads. Which Azure service should you use to perform the windowed aggregation?

Question 178hardmultiple choice

Read the full Develop data processing explanation →

You are optimizing a data pipeline in Azure Synapse Analytics that loads data from a CSV file in ADLS Gen2 into a dedicated SQL pool using PolyBase. The load is slow and you need to improve performance. Which action would be MOST effective?

Question 179easymultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Data Lake Storage Gen2 as the central data lake. You need to process batch data using serverless Spark jobs that can be scheduled daily. Which Azure service should you use?

Question 180mediummultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a slow-running Azure Data Factory pipeline that copies data from an Azure SQL Database to ADLS Gen2. The pipeline uses a copy activity with the default settings. The source table has 10 million rows. Which optimization should you apply first?

Question 181hardmultiple choice

Read the full Develop data processing explanation →

You have a mission-critical pipeline that processes financial transactions in Azure Synapse Analytics. The pipeline uses Azure Data Factory with a mapping data flow to transform data. You need to ensure high availability and minimal data loss in case of a regional failure. What should you implement?

Question 182easymultiple choice

Read the full Develop data processing explanation →

You need to transform semi-structured JSON data into a tabular format for analysis in Azure Synapse Analytics. The data is stored in ADLS Gen2. Which feature should you use to query the JSON data directly without loading it into a table?

Question 183mediummulti select

Read the full Develop data processing explanation →

You are designing a data processing solution that must handle personally identifiable information (PII). The data will be ingested from multiple sources into Azure Data Lake Storage. Which TWO actions should you take to protect the PII data during processing?

Question 184hardmulti select

Read the full Develop data processing explanation →

You are optimizing an Azure Synapse Analytics pipeline that uses mapping data flows. The pipeline runs slowly when processing 100 GB of data. Which THREE settings should you adjust to improve performance?

Question 185mediummulti select

Read the full Develop data processing explanation →

You are building a real-time processing solution using Azure Stream Analytics. The solution must handle out-of-order events and late arrivals. Which THREE mechanisms should you configure in the Stream Analytics job?

Question 186mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for an e-commerce company. The company receives millions of clickstream events per hour from their website and needs to aggregate the data by product category and windowed time intervals for real-time dashboards. You need to minimize latency and cost. Which service should you use?

Question 187hardmultiple choice

Read the full Develop data processing explanation →

A financial services firm uses Azure Synapse Analytics to process daily trade data. The data is stored in a dedicated SQL pool as partitioned tables by date. Recently, queries that filter on a specific date range have become slow. You suspect that partition pruning is not working effectively. What should you do to improve query performance?

Question 188easymultiple choice

Read the full Develop data processing explanation →

You need to process a large number of CSV files stored in Azure Data Lake Storage Gen2 using Azure Databricks. The files are nested in multiple folders, and the schema varies slightly between files. You want to automatically infer the schema and handle schema evolution. Which read option should you use?

Question 189mediummultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Data Factory to load data from an on-premises SQL Server database to Azure Synapse Analytics. The pipeline runs daily and the load volume is about 50 GB per run. Recently, the pipeline started failing with timeout errors during the copy activity. You need to ensure reliable data transfer. What should you do?

Question 190hardmultiple choice

Read the full network assurance explanation →

You are designing a real-time analytics solution for IoT devices that emit telemetry data every second. The data must be aggregated every minute and stored in Azure SQL Database for historical analysis. You need to minimize latency and operational overhead. Which approach should you recommend?

Question 191mediummultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a slow-running pipeline in Azure Data Factory that uses a Copy activity to transfer data from Azure Blob Storage to Azure Synapse Analytics. The pipeline processes about 100 GB of CSV files. The copy performance is poor even though the source and sink are in the same region. What is the most likely cause?

Question 192easymultiple choice

Read the full Develop data processing explanation →

You need to process streaming data from Azure Event Hubs and store the results in Azure Cosmos DB for a real-time dashboard. The solution must handle duplicate events and ensure exactly-once processing. Which Azure service should you use?

Question 193hardmultiple choice

Read the full Develop data processing explanation →

Your team uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query performance is inconsistent, and some queries take a long time to execute. You need to improve query performance. What should you do?

Question 194mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline that ingests data from a REST API endpoint every hour. The API returns JSON data with a varying schema. You need to store the raw data in Azure Data Lake Storage Gen2 and later process it using Azure Databricks. Which file format should you use for the raw data storage?

Question 195mediummulti select

Read the full Develop data processing explanation →

Which TWO strategies reduce data movement in Azure Synapse Analytics pipelines? (Choose two.)

Question 196hardmulti select

Read the full Develop data processing explanation →

Which THREE actions improve the performance of a Delta table in Azure Databricks? (Choose three.)

Question 197easymulti select

Read the full Develop data processing explanation →

Which TWO are valid output sinks for an Azure Stream Analytics job? (Choose two.)

Question 198mediummultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. You have an Azure Data Factory pipeline that copies trade data from Azure Blob Storage to Azure SQL Database. The pipeline runs every hour and truncates the destination table before each copy. However, users report that data is missing during the copy window. What is the most likely cause?

Exhibit

{
  "name": "CopyTradeData",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [{"referenceName": "SourceDataset", "type": "DatasetReference"}],
        "outputs": [{"referenceName": "SinkDataset", "type": "DatasetReference"}],
        "typeProperties": {
          "source": {
            "type": "AzureBlobStorageSource",
            "recursive": true
          },
          "sink": {
            "type": "AzureSqlSink",
            "writeBatchSize": 10000,
            "preCopyScript": "TRUNCATE TABLE dbo.Trades"
          }
        }
      }
    ]
  }
}

Question 199hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You created an external table in Azure Synapse Analytics serverless SQL pool to query Parquet files. Queries return no rows even though the files exist. What is the most likely issue?

Exhibit

CREATE EXTERNAL DATA SOURCE SalesData WITH (
  LOCATION = 'https://mystorageaccount.dfs.core.windows.net/sales',
  CREDENTIAL = StorageCred
);
CREATE EXTERNAL FILE FORMAT ParquetFormat WITH (
  FORMAT_TYPE = PARQUET,
  DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
CREATE EXTERNAL TABLE dbo.Sales (
  SaleID INT,
  ProductID INT,
  SaleDate DATE,
  Amount DECIMAL(10,2)
) WITH (
  LOCATION = 'parquet/sales/',
  DATA_SOURCE = SalesData,
  FILE_FORMAT = ParquetFormat
);

Question 200easymultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You have a mapping data flow in Azure Data Factory that aggregates sales data. The data flow runs successfully but the sink table contains only the total sum per run instead of per product. What is missing?

Exhibit

{
  "dataflows": [
    {
      "name": "TransformSales",
      "properties": {
        "sources": [
          {
            "name": "SalesSource",
            "dataset": {
              "referenceName": "SalesDataset",
              "type": "DatasetReference"
            }
          }
        ],
        "transformations": [
          {
            "name": "AggregateSales",
            "type": "Aggregate",
            "inputs": ["SalesSource"],
            "aggregates": [
              {
                "column": "TotalAmount",
                "function": "SUM",
                "input": "Amount"
              }
            ]
          }
        ],
        "sink": {
          "name": "SalesSink",
          "dataset": {
            "referenceName": "AggregatedSalesDataset",
            "type": "DatasetReference"
          }
        }
      }
    }
  ]
}

Question 201mediummultiple choice

Read the full Develop data processing explanation →

Your organization uses Azure Synapse Analytics. You need to design a data transformation pipeline that processes streaming data from Azure Event Hubs, performs aggregations over a 5-minute tumbling window, and loads the results into a dedicated SQL pool table. Which Azure service should you use to implement the streaming transformation?

Question 202hardmultiple choice

Read the full Develop data processing explanation →

You are building a data processing pipeline in Azure Synapse Analytics. The pipeline should read data from Azure Data Lake Storage Gen2 (Parquet files), apply transformations using a mapping data flow, and write the results to a dedicated SQL pool table. The source data contains personally identifiable information (PII). You need to mask the PII columns (e.g., email) using a data masking function within the data flow. Which transformation should you use?

Question 203easymultiple choice

Read the full Develop data processing explanation →

Your team is using Azure Data Factory to orchestrate a data pipeline that copies data from an on-premises SQL Server to Azure Blob Storage. The pipeline runs successfully during testing. However, after moving to production, you notice that the pipeline fails intermittently with connectivity errors. You need to ensure reliable data transfer. What should you implement?

Question 204mediummultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution for a retail company. The solution must ingest clickstream data from a website, process it in near real-time to update user session information, and store the results in Azure Cosmos DB for low-latency queries. Which combination of Azure services should you use?

Question 205hardmultiple choice

Read the full Develop data processing explanation →

You are optimizing an Azure Synapse Analytics dedicated SQL pool. A fact table named Sales is partitioned by date and has a hash-distributed column ProductKey. You notice that queries filtering on OrderDate are performing poorly. You need to improve query performance for date range filters. What should you do?

Question 206easymultiple choice

Read the full Develop data processing explanation →

Your organization uses Azure Data Lake Storage Gen2 (ADLS Gen2) and wants to transform data using Azure Databricks. The data is stored in Parquet format. You need to read the data into a Spark DataFrame. Which DataFrame reader method should you use?

Question 207mediummultiple choice

Read the full Develop data processing explanation →

You are developing a real-time data processing solution using Azure Stream Analytics. The input is from an Azure Event Hub, and the output is to an Azure Synapse Analytics dedicated SQL pool. You need to ensure exactly-once delivery semantics to the SQL pool. What should you configure?

Question 208hardmultiple choice

Read the full Develop data processing explanation →

You are working with a Delta Lake table in Azure Databricks. The table is updated frequently with new data and occasionally with updates to existing rows. You need to optimize read performance for queries that filter on a specific date column. The table is partitioned by date. Which optimization technique should you apply?

Question 209mediummultiple choice

Read the full Develop data processing explanation →

Your organization has an Azure Data Factory pipeline that executes a series of activities to transform data. One of the activities is an Azure Databricks notebook that should run only if the previous activity succeeds. You need to configure the pipeline to handle failures gracefully and send an email alert if the Databricks activity fails. What should you do?

Question 210mediummulti select

Read the full Develop data processing explanation →

Which TWO components are required to set up a streaming data pipeline using Azure Synapse Analytics? (Select two.)

Question 211hardmulti select

Read the full Develop data processing explanation →

Which THREE considerations are important when designing a data processing solution using Azure Databricks for ETL workloads? (Select three.)

Question 212easymulti select

Read the full Develop data processing explanation →

Which TWO Azure services can be used to perform data transformation in a data pipeline? (Select two.)

Question 213mediummultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. You have an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The copy activity uses a preCopyScript to truncate the destination table before writing. During a recent run, the copy activity failed after the truncation, leaving the destination table empty. You need to prevent data loss in future failures. What should you modify?

Exhibit

Refer to the exhibit.

{
  "name": "CopyFromBlobToSQL",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "BlobInput",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "SQLOutput",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "BlobSource",
            "recursive": true
          },
          "sink": {
            "type": "SqlSink",
            "writeBatchSize": 10000,
            "preCopyScript": "TRUNCATE TABLE dbo.Destination"
          }
        }
      }
    ]
  }
}

Question 214hardmultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are deploying an Azure Synapse Analytics dedicated SQL pool using the provided ARM template snippet. After deployment, you need to adjust the performance level to DW200c to handle increased workload. Which parameter should you modify?

Exhibit

Refer to the exhibit.

{
  "type": "Microsoft.Synapse/workspaces/sqlPools",
  "properties": {
    "createMode": "Default",
    "storageAccountType": "GRS",
    "collation": "SQL_Latin1_General_CP1_CI_AS",
    "maxSizeBytes": 263882790666240,
    "sku": {
      "name": "DW100c",
      "tier": "ServiceObjective"
    },
    "restorePointInTime": null,
    "sourceDatabaseId": null
  }
}

Question 215easymultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. You are reviewing an Azure Stream Analytics job query. The job has a stream input and a reference data input. The job is failing with the error 'Reference data input must be of type Reference, not Stream'. What is the cause of the error?

Exhibit

Refer to the exhibit.

{
  "inputAlias": "input",
  "type": "Stream",
  "output": {
    "outputAlias": "output",
    "type": "ReferenceData"
  },
  "query": "SELECT input.* FROM input JOIN output ON input.ProductId = output.ProductId"
}

Question 216mediummultiple choice

Read the full Develop data processing explanation →

Your organization uses Azure Synapse Analytics dedicated SQL pool to store sales data. You need to design a data loading process for a nightly batch that inserts new rows and updates existing rows based on the business key. The table has a clustered columnstore index. Which approach minimizes table fragmentation?

Question 217hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Synapse Analytics that reads streaming data from Azure Event Hubs, performs aggregations in real time, and writes results to Azure Cosmos DB for a dashboard. The data volume is 10,000 events per second with 2 KB each. The latency requirement is under 5 seconds from event ingestion to dashboard visibility. Which technology should you use for the real-time aggregation?

Question 218easymultiple choice

Read the full Develop data processing explanation →

Your team is developing a data processing solution in Azure Synapse Analytics. You need to ensure that the solution can automatically scale compute resources based on workload demand for serverless SQL pools. Which feature should you configure?

Question 219mediummultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Data Lake Storage Gen2 as a data lake. You need to process CSV files that arrive in a 'raw' container, transform them into Parquet format, and write them to a 'curated' container. The transformation includes filtering out rows with null values in the 'customer_id' column and adding a partition column 'year' based on the 'order_date'. You use Azure Synapse Pipelines. Which activity should you use for the transformation?

Question 220hardmultiple choice

Read the full Develop data processing explanation →

You are building a batch processing solution in Azure Synapse Analytics that reads data from a dedicated SQL pool, applies complex transformations using Synapse Spark, and writes the results back to the dedicated SQL pool. The pipeline must run on a schedule and handle transient failures with retries. Which approach should you use?

Question 221easymultiple choice

Read the full Develop data processing explanation →

Your team is developing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs. The output must be written to Azure SQL Database. You need to ensure that the processing guarantees exactly-once semantics for the output. Which output configuration should you use?

Question 222mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Data Factory that ingests data from an on-premises SQL Server database to Azure Data Lake Storage Gen2. The data volume is large (500 GB). The network connection between on-premises and Azure is limited to 100 Mbps. You need to minimize the time to transfer the initial full load while ensuring data integrity. Which approach should you recommend?

Question 223hardmultiple choice

Read the full Develop data processing explanation →

Your organization uses Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slow when filtering on a date column. You need to improve query performance without increasing costs. What should you do?

Question 224easymultiple choice

Read the full Develop data processing explanation →

Your team is developing a data processing solution that uses Azure Databricks to transform streaming data from Azure Event Hubs. The transformation includes joining the stream with a static reference table stored in Azure Data Lake Storage Gen2. You need to implement the join efficiently. Which approach should you use?

Question 225mediummulti select

Read the full Develop data processing explanation →

Which TWO actions can you take to optimize the performance of a dedicated SQL pool in Azure Synapse Analytics when loading large volumes of data?

Question 226hardmulti select

Read the full Develop data processing explanation →

Which THREE Azure services can you use together to build a serverless, event-driven data processing pipeline that ingests data from Azure Blob Storage, transforms it using custom code, and loads it into Azure Cosmos DB?

Question 227easymulti select

Read the full Develop data processing explanation →

Which TWO techniques can you use to handle schema drift in Azure Data Factory mapping data flows?

Question 228mediummultiple choice

Read the full Develop data processing explanation →

You are debugging an Azure Data Factory pipeline that copies CSV files from Azure Blob Storage to Azure SQL Database. The pipeline is defined as shown in the exhibit. The first row of each CSV file contains column headers. After a run, you notice that the first data row is missing from the target table. What is the most likely cause?

Exhibit

Refer to the exhibit.

```json
{
  "name": "CopyDataPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobInputDataset", "type": "DatasetReference"}],
        "outputs": [{"referenceName": "SQLOutputDataset", "type": "DatasetReference"}],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource",
            "storeSettings": {
              "type": "AzureBlobStorageReadSettings",
              "recursive": true
            },
            "formatSettings": {
              "type": "DelimitedTextReadSettings",
              "skipLineCount": 1
            }
          },
          "sink": {
            "type": "SqlSink",
            "writeBehavior": "insert",
            "sqlWriterUseTableLock": false
          },
          "enableStaging": false
        }
      }
    ]
  }
}
```

Question 229hardmultiple choice

Read the full Develop data processing explanation →

You are analyzing a Kusto query in Azure Data Explorer that calculates total sales per product for January 2024 and filters for products with sales over 10,000. The query uses the materialize() function. You notice that the query runs slower than expected. What is the primary reason the materialize() function may not be providing the expected performance benefit in this query?

Exhibit

Refer to the exhibit.

```kusto
let StartDate = datetime(2024-01-01);
let EndDate = datetime(2024-01-31);
let TotalSales = materialize(
    Sales
    | where OrderDate between (StartDate .. EndDate)
    | summarize TotalAmount = sum(Amount) by ProductID
);
TotalSales
| where TotalAmount > 10000
| join kind=inner (Products) on ProductID
| project ProductName, TotalAmount
| order by TotalAmount desc
```

Question 230hardmulti select

Read the full NAT/PAT explanation →

You are a Data Engineer at Contoso Ltd. The company operates an e-commerce platform that generates streaming data from user interactions (clicks, page views, purchases) at a rate of 50,000 events per second. The data is ingested into Azure Event Hubs. You need to design a data processing solution that meets the following requirements:

1. Real-time dashboards must display aggregated metrics (e.g., total sales, active users) with a latency of less than 2 seconds. 2. Historical data must be stored in Azure Data Lake Storage Gen2 in Delta Lake format for batch analytics. 3. The solution must support exactly-once semantics for the streaming output to both the dashboard and the data lake. 4. You must use Azure Stream Analytics as the primary processing engine. 5. The output to the dashboard should use Azure Power BI, while the output to the data lake should use Azure Synapse Analytics serverless SQL pool for querying.

You have configured an Azure Stream Analytics job with Event Hubs as input. For output, you added a Power BI dataset and an Azure Data Lake Storage Gen2 output. However, you discover that the Power BI dataset is being updated with duplicate records, and the data lake output sometimes misses small windows of data during job restarts. Which two actions should you take to meet the requirements? (Choose two.)

A. Change the Power BI output to use the 'Exactly once' output mode. B. Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once. C. Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output. D. Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.

Question 231mediummultiple choice

Read the full Develop data processing explanation →

A manufacturing company uses Azure Data Lake Storage Gen2 to store IoT sensor data. The data arrives in JSON format with a nested structure. You need to transform the data into a tabular format for downstream analytics using Azure Synapse Pipelines. Which data flow transformation should you use?

Question 232hardmultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Synapse Analytics to run a large-scale batch processing job every night. The job currently runs on a dedicated SQL pool and takes 4 hours. Management wants to reduce the runtime to under 2 hours without increasing cost. The job involves heavy compute operations with no data movement limitations. What should you do?

Question 233easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Synapse Analytics. The solution must support both batch and streaming data ingestion. Which Azure service should you use to ingest streaming data into Synapse Analytics?

Question 234mediummultiple choice

Read the full Develop data processing explanation →

You are troubleshooting a failed Azure Synapse Pipeline execution. The pipeline uses a Copy activity to load data from an on-premises SQL Server to Azure Data Lake Storage Gen2. The error indicates a 'Connection timeout' to the on-premises source. The Integration Runtime is Self-Hosted and has been running successfully for months. What is the most likely cause?

Question 235hardmultiple choice

Read the full Develop data processing explanation →

You are building a data processing solution using Azure Databricks. The solution must process streaming data from Azure Event Hubs, join it with a static reference table stored in Azure Data Lake Storage Gen2 (Parquet format), and write the output to Azure Synapse Analytics. The reference table is updated daily. Which approach minimizes latency and ensures data consistency?

Question 236easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing pipeline in Azure Data Factory. The pipeline must copy data from Azure Blob Storage to Azure SQL Database and transform the data using a mapping data flow. The data flow includes a Derived Column transformation. What is the purpose of the Derived Column transformation?

Question 237hardmultiple choice

Read the full Develop data processing explanation →

You are using Azure Stream Analytics to process real-time temperature data from IoT devices. The output must be written to Azure SQL Database. The job has been running successfully for weeks, but recently you notice that the output data has duplicate rows. The input events are unique. The job uses a windowed aggregation (TumblingWindow). What is the most likely cause of duplicates?

Question 238mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution using Azure Synapse Analytics serverless SQL pool. The solution must query data stored in Parquet files in Azure Data Lake Storage Gen2. The queries are ad-hoc and vary greatly. Which feature should you use to optimize query performance for frequently accessed data partitions?

Question 239easymultiple choice

Read the full Develop data processing explanation →

You are implementing a data processing solution in Azure Databricks. The solution must read data from Azure Data Lake Storage Gen2, transform it using PySpark, and write the results back to a different location in the same storage account. You need to authenticate to the storage account securely without storing secrets in the notebook. What should you use?

Question 240mediummulti select

Read the full Develop data processing explanation →

You are designing a batch processing solution in Azure Synapse Analytics using pipelines. The solution must load data from multiple sources (Azure Blob Storage, Azure SQL Database, and REST API) into a dedicated SQL pool. After loading, you need to run a stored procedure to aggregate the data. Which two activities should you include in the pipeline? (Choose two.)

Question 241hardmulti select

Read the full Develop data processing explanation →

You are tuning a Spark job in Azure Synapse Analytics that processes large Parquet files. The job currently takes too long due to data skew. Which three actions can improve performance? (Choose three.)

Question 242easymulti select

Read the full Develop data processing explanation →

You are designing a data processing solution in Azure Data Factory that uses mapping data flows. You need to perform type conversions on incoming data. Which three transformations can be used to change data types? (Choose three.)

Question 243hardmultiple choice

Read the full Develop data processing explanation →

You are a data engineer for a global e-commerce company. You need to design a data processing solution using Azure Databricks that processes real-time clickstream data from Azure Event Hubs. The solution must join the streaming data with a slowly changing dimension (SCD) table that stores product details. The SCD table is stored in Azure Data Lake Storage Gen2 as Delta format and is updated every few hours. The joined results must be written to a Delta table for near-real-time dashboards. The key requirement is to ensure that the join always uses the latest version of the SCD data without reprocessing the entire stream. The solution must minimize latency and cost. Which approach should you recommend?

Question 244mediummultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Synapse Analytics to run a data warehouse. You have a dedicated SQL pool with a hash-distributed fact table named Sales. The distribution column is ProductID. You notice that queries against the Sales table are slow due to data skew. After analysis, you find that a few products (e.g., ProductID 100, 200) account for 80% of the rows. You need to optimize query performance without redesigning the entire table. You also need to minimize data movement during queries. Which action should you take?

Question 245easymultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing pipeline using Azure Data Factory. The pipeline must ingest data from an HTTP endpoint that returns a JSON array. The data must be transformed by flattening nested arrays and then loaded into an Azure SQL Database table. The pipeline should be triggered daily. You need to choose the appropriate activities and transformations. The solution must be cost-effective and easy to maintain. Which combination of activities should you use?

Question 246easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a real-time dashboard. The source data is streaming from IoT devices at a rate of 10,000 events per second. Each event is under 1 KB. You need to store the raw data in a staging area before processing, and you must minimize storage costs. Which storage solution should you use?

Question 247mediummultiple choice

Read the full Develop data processing explanation →

You are running a batch processing job using Azure Data Factory. The job reads from Azure Blob Storage, transforms data with a Data Flow, and writes to Azure Synapse Analytics. The job fails intermittently with the error: 'Operation on target WriteToSynapse failed: Cannot bulk load because the file could not be opened.' You need to resolve the issue with minimal downtime. What should you do?

Question 248hardmultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. You have a Mapping Data Flow in Azure Data Factory that reads JSON files from a folder partitioned by year/month/day. The source setting includes a row limit of 10,000. The sink writes Parquet files with a file pattern and partition columns. You notice that the job processes only the first 10,000 rows from the entire dataset instead of 10,000 rows per partition. How should you modify the data flow to achieve row limit per partition?

Exhibit

{
  "name": "dataflow",
  "properties": {
    "type": "MappingDataFlow",
    "typeProperties": {
      "sources": [
        {
          "name": "source1",
          "dataset": {
            "referenceName": "JsonDataset",
            "type": "DatasetReference"
          },
          "sourceSettings": {
            "rowLimit": 10000,
            "partitionRootPath": "year=2024/month=01/day=15"
          }
        }
      ],
      "sinks": [
        {
          "name": "sink1",
          "dataset": {
            "referenceName": "ParquetDataset",
            "type": "DatasetReference"
          },
          "sinkSettings": {
            "filePattern": "output_*.parquet",
            "partitionBy": [
              {
                "name": "year",
                "type": "Dynamic",
                "value": "2024"
              },
              {
                "name": "month",
                "type": "Dynamic",
                "value": "01"
              }
            ]
          }
        }
      ],
      "transformations": [
        {
          "name": "filter1",
          "type": "Filter",
          "inputs": ["source1"]
        }
      ]
    }
  }
}

Question 249easymultiple choice

Read the full Develop data processing explanation →

You are developing a data processing solution that requires aggregating sales data from multiple CSV files stored in Azure Data Lake Storage Gen2. The data should be cleansed and transformed before loading into Azure Synapse Analytics. Which Azure service should you use to implement a code-free transformation pipeline?

Question 250mediummultiple choice

Read the full Develop data processing explanation →

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. You notice that queries are slower than expected. The files are large (500 MB each) and not partitioned. You need to improve query performance without moving data. What should you do?

Question 251hardmultiple choice

Read the full Develop data processing explanation →

You have a streaming pipeline using Azure Stream Analytics that ingests data from Event Hubs and outputs to Azure Synapse Analytics. The job has a high watermark delay and is falling behind. You need to reduce the latency. Which action should you take?

Question 252easymultiple choice

Read the full Develop data processing explanation →

You need to process a large dataset that contains personally identifiable information (PII). The data must be anonymized before being used for analytics. Which Azure service should you use to apply column-level masking dynamically?

Question 253mediummultiple choice

Study the full Python automation breakdown →

You are designing a data processing solution that requires running custom Python scripts for data transformation. The scripts have dependencies on specific libraries that are not pre-installed in the environment. You need a fully managed, serverless compute option that allows you to install custom libraries. Which service should you use?

Question 254hardmultiple choice

Read the full Develop data processing explanation →

You have a Data Factory pipeline that runs a U-SQL script in Azure Data Lake Analytics. The script processes terabytes of data and outputs to a CSV file. The pipeline is failing with the error: 'The job failed with UserError: Script execution failed.' You need to troubleshoot the issue. Which approach should you take first?

Question 255easymulti select

Read the full Develop data processing explanation →

Which TWO options are correct for configuring a copy activity in Azure Data Factory to load data from Azure Blob Storage to Azure SQL Database?

Question 256mediummulti select

Read the full Develop data processing explanation →

Which THREE options are valid ways to transform data in Azure Synapse Analytics?

Question 257hardmulti select

Read the full Develop data processing explanation →

Which TWO options are correct about using Azure Stream Analytics for real-time data processing?

Question 258mediummulti select

Read the full Develop data processing explanation →

Which TWO options are correct approaches to handle schema drift in Azure Data Factory Mapping Data Flows?

Question 259mediummultiple choice

Read the full Develop data processing explanation →

You are a data engineer at a retail company. You need to design a batch processing solution that ingests daily sales data from multiple stores. Each store uploads a CSV file to a dedicated folder in Azure Data Lake Storage Gen2. The files have the same schema but may have minor variations in column order and include null values. After ingestion, you must clean the data by removing rows with null values in the 'SalesAmount' column, convert the 'Date' column from string to date type, and aggregate sales by product category. The output should be stored as Parquet files partitioned by year and month in the same Data Lake. You need to choose a compute service and implement the transformation with minimal coding effort. The solution must be cost-effective and require no cluster management. What should you do?

Question 260hardmultiple choice

Read the full Develop data processing explanation →

You are developing a real-time data processing solution for a financial services company. The system ingests stock trade data from Azure Event Hubs at 50,000 events per second. Each event is a JSON object with fields: TradeID, Symbol, Price, Quantity, Timestamp. You need to calculate a 5-minute rolling average of the trade price per symbol and store the result in Azure Cosmos DB for low-latency queries. Additionally, you need to detect anomalies where the price deviates more than 10% from the rolling average within the same window, and send alerts to Azure Event Grid. You must minimize latency and ensure that the processing is stateful across multiple partitions. What should you do?

Question 261mediummultiple choice

Read the full Develop data processing explanation →

A company uses Azure Synapse Analytics dedicated SQL pool. The data engineering team notices that queries against a large fact table are running slowly. The table uses round-robin distribution and has a columnstore index. The team wants to improve query performance without adding more resources. Which action should the team take?

Question 262hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data pipeline in Azure Data Factory to load incremental changes from an on-premises SQL Server database to Azure Synapse Analytics. The source table has over 1 billion rows and a datetime column 'LastUpdated' that is indexed but not always increasing. The requirement is to capture all changes with minimal latency and no missed rows. Which approach should you recommend?

Question 263easymultiple choice

Read the full Develop data processing explanation →

You are building a data transformation in Azure Databricks using PySpark. The data includes a column 'timestamp' in string format 'yyyy-MM-dd HH:mm:ss'. You need to convert this to a timestamp type and extract the date part for partitioning. Which code snippet should you use?

Question 264mediummultiple choice

Read the full Develop data processing explanation →

A data engineer is tasked with optimizing a Spark job in Azure Synapse Analytics that processes 10 TB of data daily. The job currently uses 50 executors with 4 cores each. The performance is bottlenecked by shuffle operations. The engineer wants to reduce shuffle data size. Which technique should be applied?

Question 265hardmultiple choice

Read the full Develop data processing explanation →

Your company uses Azure Data Lake Storage Gen2 and Azure Databricks for data processing. Some Parquet files in the lake are written with a schema that includes a column 'address' of struct type. A downstream process expects 'address' to be a string. You need to transform the data in a way that minimizes read overhead and does not rewrite the entire dataset. Which approach should you use?

Question 266easymultiple choice

Read the full Develop data processing explanation →

You are monitoring an Azure Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL Database. The pipeline fails intermittently with the error: 'Operation on target SQL table failed: String or binary data would be truncated.' Which action should you take to resolve this issue?

Question 267mediummultiple choice

Read the full Develop data processing explanation →

A company is ingesting streaming data from IoT devices into Azure Event Hubs. The data must be processed in near real-time and stored in Azure Synapse Analytics for reporting. The solution must handle late-arriving data and ensure exactly-once semantics. Which Azure service should you use for stream processing?

Question 268hardmultiple choice

Read the full NAT/PAT explanation →

You are designing a data processing solution for a large e-commerce company. The data includes sensitive customer information that must be encrypted at rest and in transit. The solution uses Azure Data Lake Storage Gen2 and Azure Databricks. Compliance requires that all access to the data is logged and monitored. Which combination of security features should you implement?

Question 269easymultiple choice

Study the full Python automation breakdown →

You are running a Python script in Azure Databricks that reads a CSV file from DBFS. The script runs successfully in an interactive notebook but fails when executed as a job with the error: 'Path does not exist: dbfs:/tmp/data.csv'. What is the most likely cause?

Question 270mediummulti select

Read the full Develop data processing explanation →

A data engineer is building a pipeline in Azure Data Factory to load data from Azure Blob Storage to Azure Synapse Analytics using PolyBase. The pipeline must handle large files efficiently and minimize load time. Which TWO actions should the engineer take? (Choose two.)

Question 271hardmulti select

Read the full Develop data processing explanation →

You are designing a real-time data processing solution using Azure Stream Analytics. The input is from Azure Event Hubs, and the output is to Azure Synapse Analytics. The solution must guarantee exactly-once delivery to Synapse. Which THREE configurations are required? (Choose three.)

Question 272easymulti select

Read the full Develop data processing explanation →

A company uses Azure Data Lake Storage Gen2 as the data lake. The data engineering team needs to ensure that sensitive data such as credit card numbers are masked when queried by non-admin users. The solution must be implemented within the data lake without moving data to another store. Which TWO features should they use? (Choose two.)

Question 273mediummultiple choice

Read the full Develop data processing explanation →

Refer to the exhibit. A data engineer is reviewing a Mapping Data Flow in Azure Data Factory. The source reads 1000 rows from a CSV file. The sink writes to a Delta dataset with optimizeWrite enabled. The data flow currently has no transformations. The engineer runs the data flow and notices that the sink writes only 500 rows. What is the most likely cause?

Exhibit

{
  "name": "transform-sales",
  "properties": {
    "type": "MappingDataFlow",
    "typeProperties": {
      "sources": [
        {
          "name": "salesSource",
          "dataset": {"referenceName": "SalesCSV", "type": "DatasetReference"},
          "sourceSettings": {
            "rowLimit": 1000,
            "partitionOption": "none"
          }
        }
      ],
      "sinks": [
        {
          "name": "salesSink",
          "dataset": {"referenceName": "SalesDelta", "type": "DatasetReference"},
          "sinkSettings": {
            "optimizeWrite": true,
            "allowSchemaDrift": true
          }
        }
      ],
      "scriptLines": [
        "source(output( \n\t\tProductID as integer, \n\t\tSaleDate as string, \n\t\tAmount as decimal(10,2)\n\t), \n\tallowSchemaDrift: true, \n\tvalidateSchema: false) ~> salesSource",
        "salesSource sink(input(), \n\tallowSchemaDrift: true, \n\tvalidateSchema: false, \n\tpartitionBy('hash', 1)) ~> salesSink"
      ]
    }
  }
}

Question 274hardmultiple choice

Read the full Develop data processing explanation →

You are a data engineer for a global retail company. The company has a hybrid architecture with on-premises SQL Server databases and Azure Synapse Analytics. You need to design a data processing solution that ingests incremental changes from the on-premises SQL Server database (source) into Azure Synapse Analytics (sink) with low latency (under 15 minutes) and high reliability. The source database is 5 TB and experiences high transaction volume during business hours. The solution must minimize impact on the source system and handle schema changes automatically. You have the following options:

Option A: Use Azure Data Factory with a copy activity that uses a watermark column to query incremental changes every 10 minutes. The copy activity writes directly to the Synapse table using PolyBase.

Option B: Use Azure Data Factory with a mapping data flow that reads from the source using a SQL query with a watermark, performs transformations, and writes to Synapse using staging via Blob Storage and PolyBase.

Option C: Use SQL Server Integration Services (SSIS) running on Azure-SSIS Integration Runtime to extract data using change data capture (CDC) and load into Synapse.

Option D: Use Azure Databricks with Auto Loader to ingest files from a staging area that is populated by a separate log-shipping process from the source.

Which option should you choose?

Question 275mediummultiple choice

Read the full Develop data processing explanation →

A data engineering team is building a batch processing solution for a financial services company. Data is ingested daily from multiple sources into Azure Data Lake Storage Gen2 in CSV format. The data must be transformed (filtered, aggregated, joined) and loaded into Azure Synapse Analytics dedicated SQL pool. The team must optimize for cost and performance. The total data volume is 2 TB per day. The team has the following options:

Option A: Use Azure Data Factory pipelines with copy activity to load raw CSV files into Synapse staging tables, then use T-SQL stored procedures in Synapse to perform transformations.

Option B: Use Azure Databricks with Auto Loader to incrementally ingest CSV files, perform transformations in Spark, and write the results to Synapse using the Spark Synapse connector.

Option C: Use Azure Data Factory with mapping data flows to transform the data in a serverless environment and then write to Synapse.

Option D: Use Azure Synapse Pipelines (built on ADF) with a notebook activity that runs a PySpark notebook in Synapse Spark pool to transform and load data.

Which option should the team choose to minimize cost and management overhead while meeting performance requirements?

Question 276easymultiple choice

Read the full Develop data processing explanation →

A small business uses Azure SQL Database as their operational data store. They want to build a simple reporting solution that refreshes daily. The data volume is about 10 GB. They have limited budget and no dedicated data engineering team. They need to extract data from Azure SQL Database, perform basic transformations (e.g., aggregations, joins), and load into a reporting table in the same database. They want to minimize administrative effort and cost. Which approach should they take?

Option A: Use Azure Data Factory with a copy activity to copy data from the source to a staging table, then use a stored procedure activity to run T-SQL transformations.

Option B: Use Azure Logic Apps to periodically query the source and insert transformed data into the reporting table.

Option C: Use Azure Databricks notebooks scheduled via Azure Data Factory to perform the ETL.

Option D: Use Azure SQL Database elastic job to run a T-SQL script that performs the ETL within the database.

Question 277mediummultiple choice

Read the full Develop data processing explanation →

You are implementing a data pipeline in Azure Synapse Analytics that uses PolyBase to load data from Azure Data Lake Storage Gen2 into a dedicated SQL pool. The pipeline runs nightly and processes approximately 500 GB of data. You notice that the load operation is slow and frequently times out. What should you do to improve performance?

Question 278hardmulti select

Read the full Develop data processing explanation →

You are designing a data processing solution for a retail company that uses Azure Databricks. The solution needs to process streaming sales data from Event Hubs and batch data from Azure Data Lake Storage Gen2. You need to ensure that the solution can handle late-arriving data and maintain exactly-once semantics. Which TWO technologies should you use?

Question 279easymulti select

Read the full Develop data processing explanation →

You are developing a data pipeline in Azure Data Factory that ingests data from multiple on-premises SQL Server databases to Azure Data Lake Storage Gen2. The data volume is about 1 TB per day. You need to ensure the pipeline can handle the volume and provide monitoring and alerting. Which THREE components should you include?

Question 280hardmultiple choice

Read the full Develop data processing explanation →

You are reviewing a mapping data flow in Azure Data Factory that reads a CSV file from ADLS Gen2 and writes to an Azure Synapse Analytics dedicated SQL pool. The data flow includes a Derived Column transformation with the expression: `column1 == "Error" ? toString(column1) : column1`. The pipeline fails with an error indicating that the sink table could not be created. What is the most likely cause?

Exhibit

Refer to the exhibit.
{
  "type": "MappingDataFlow",
  "typeProperties": {
    "sources": [
      {
        "dataset": {
          "referenceName": "ADLS_Source",
          "type": "DatasetReference"
        },
        "sourceSettings": {
          "sourceType": "DelimitedText",
          "columnDelimiter": ",",
          "rowDelimiter": "\n"
        }
      }
    ],
    "sinks": [
      {
        "dataset": {
          "referenceName": "Synapse_Sink",
          "type": "DatasetReference"
        },
        "sinkSettings": {
          "sinkType": "AzureSynapse",
          "allowCopyCommand": true,
          "tableOption": "autoCreate"
        }
      }
    ],
    "transformations": [
      {
        "name": "DerivedColumn",
        "type": "DerivedColumn",
        "script": "column1 == "Error" ? toString(column1) : column1"
      }
    ]
  }
}

Question 281hardmultiple choice

Read the full NAT/PAT explanation →

You are a data engineer for a healthcare company that processes patient data. You have an Azure Databricks workspace with a cluster configured for data processing. You need to implement a solution that processes streaming data from Azure Event Hubs, enriches it with reference data stored in Azure Cosmos DB, and writes the output to Delta Lake in Azure Data Lake Storage Gen2. The solution must ensure that the data processing is fault-tolerant and can handle schema evolution. The reference data is updated infrequently. You need to choose an approach that minimizes complexity and cost. What should you do?

Question 282mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for an e-commerce company that uses Azure Synapse Analytics. The solution must process clickstream data from a web application. The data arrives in JSON format through Azure Event Hubs. You need to load the data into a dedicated SQL pool every 5 minutes with minimal latency. The data volume is about 100 MB every 5 minutes. You want to use PolyBase for loading. Which approach should you use?

Question 283easymultiple choice

Read the full Develop data processing explanation →

You are a data engineer at a financial services company. You are developing a data processing pipeline that uses Azure Data Factory to copy transactional data from an Azure SQL Database to Azure Data Lake Storage Gen2. The pipeline runs daily and processes about 10 GB of data. You need to implement error handling for the pipeline. Specifically, if the copy activity fails due to a transient error, the pipeline should retry automatically. If the retry fails, the pipeline should log the error and send an email alert to the operations team. What should you do?

Question 284mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data lakehouse architecture in Azure using Delta Lake. The solution needs to process batch and streaming data from multiple sources, including IoT devices and CRM systems. You need to ensure data quality by enforcing schema validation and handling schema evolution. You also need to provide a unified catalog for querying. Which service should you use?

Question 285hardmultiple choice

Read the full Develop data processing explanation →

You are a data engineer working for a logistics company. You have an existing Azure Data Factory pipeline that ingests data from a REST API to Azure Data Lake Storage Gen2. The API has rate limiting that can cause failures. You need to implement a solution that can handle rate limiting by retrying with exponential backoff. The pipeline should also log the number of retries for each API call. What should you do?

Question 286mediummultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a media company that uses Azure Synapse Analytics. The solution must process video metadata stored in Azure Cosmos DB and combine it with user interaction data from Azure Data Lake Storage Gen2. The combined data must be stored in a dedicated SQL pool for reporting. The data volume is moderate, and the processing should be done using T-SQL. Which approach should you use?

Question 287easymultiple choice

Read the full Develop data processing explanation →

You are developing a data processing pipeline in Azure Databricks that processes streaming data from Azure Event Hubs. You need to ensure that the pipeline can recover from failures and process data exactly once. The pipeline writes to Delta Lake. Which approach should you use?

Question 288hardmultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a retail company that uses Azure Synapse Analytics. The solution must process point-of-sale (POS) data from multiple stores. The data arrives in CSV files in Azure Data Lake Storage Gen2. Each store sends a file every hour. You need to process the files as they arrive and load the data into a dedicated SQL pool. The solution must handle late-arriving files (files that arrive after the scheduled processing time) and ensure that the data is consistent. Which approach should you use?

Question 289mediummultiple choice

Read the full Develop data processing explanation →

You are a data engineer at a manufacturing company. You need to process sensor data from IoT devices that arrive in real time. The data is sent to Azure Event Hubs. You need to aggregate the data over 5-minute windows and store the results in Azure Data Lake Storage Gen2 in Parquet format. The solution should minimize cost and use serverless components. Which solution should you use?

Question 290easymultiple choice

Read the full Develop data processing explanation →

You are designing a data processing solution for a marketing company that uses Azure Synapse Analytics. The solution needs to process customer data from multiple sources, including CRM and web analytics. The data must be cleansed and transformed before loading into a dedicated SQL pool. The transformations include string manipulations, date conversions, and lookups. You need to choose a serverless transformation approach that integrates with Azure Synapse pipelines. Which approach should you use?

Question 291mediummultiple choice

Read the full NAT/PAT explanation →

You are developing a data processing pipeline for a gaming company that uses Azure Databricks. The pipeline processes game event data from Azure Event Hubs. You need to detect cheating patterns by analyzing events in real time. The solution must be able to handle high throughput and low latency. The output should be written to Azure Cosmos DB for real-time dashboards. Which approach should you use?

Question 292easymulti select

Read the full Develop data processing explanation →

Which TWO actions should you take to optimize a Spark job in Azure Synapse Analytics that is experiencing excessive shuffling and long execution times?

Question 293mediummulti select

Read the full Develop data processing explanation →

Which THREE factors should you consider when designing a real-time streaming solution using Azure Stream Analytics to process IoT data from thousands of devices?

Question 294hardmulti select

Read the full Develop data processing explanation →

Which TWO techniques should you use to secure sensitive data in Azure Synapse Analytics dedicated SQL pools when implementing column-level security?

Question 295easymultiple choice

Read the full Develop data processing explanation →

You are a data engineer at a retail company. You need to develop a data processing solution in Azure Synapse Analytics that reads sales transactions from Parquet files stored in Azure Data Lake Storage Gen2, transforms the data by aggregating daily sales per store, and writes the results to a dedicated SQL pool table for reporting. The transformation logic must be reusable and maintained in a source control system. You want to minimize administrative overhead and leverage serverless resources where possible. Which approach should you recommend?

Question 296mediummultiple choice

Read the full NAT/PAT explanation →

Your company is building a real-time dashboard for monitoring website traffic. The data is ingested from web servers into Azure Event Hubs. You need to design a stream processing solution using Azure Stream Analytics that computes the number of unique visitors per minute, per country, and outputs the results to Azure Synapse Analytics for reporting. The solution must handle out-of-order events with a maximum late arrival of 10 seconds. You also need to ensure exactly-once semantics for the output. Which combination of settings should you use?

Question 297hardmultiple choice

Read the full Develop data processing explanation →

You are designing a batch processing solution for a financial services company that processes transactions from multiple sources. The data is stored in Azure Data Lake Storage Gen2 in Parquet format. You need to perform complex transformations including joins, aggregations, and window functions, and then load the results into an Azure Synapse Analytics dedicated SQL pool. The transformations must be written in SQL and executed on a serverless Spark cluster to minimize costs. You want to manage the code in a Git repository and automate the deployment using Azure DevOps. Which approach should you recommend?