DP-203 Develop data processing • Complete Question Bank
Complete DP-203 Develop data processing question bank — all 0 questions with answers and detailed explanations.
Refer to the exhibit.
{
"name": "IncrementalLoad",
"properties": {
"activities": [
{
"name": "WatermarkQuery",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT MAX(LastModified) AS NewWatermark FROM source_table"
},
"dataset": {
"referenceName": "AzureSqlTable",
"type": "DatasetReference"
}
}
},
{
"name": "CopyData",
"type": "Copy",
"dependsOn": [
{
"activity": "WatermarkQuery",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM source_table WHERE LastModified > '@{activity('WatermarkQuery').output.firstRow.NewWatermark}'"
},
"sink": {
"type": "AzureSqlSink",
"preCopyScript": "TRUNCATE TABLE target_table"
}
},
"inputs": [
{
"referenceName": "AzureSqlTable",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlTable",
"type": "DatasetReference"
}
]
}
]
}
}Refer to the exhibit.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"resources": [
{
"type": "Microsoft.Synapse/workspaces/integrationRuntimes",
"apiVersion": "2021-06-01-preview",
"name": "[concat(parameters('workspaceName'), '/MyManagedVNetIR')]",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "AutoResolve",
"dataFlowProperties": {
"computeType": "General",
"coreCount": 8,
"timeToLive": 10
}
}
}
}
}
]
}Refer to the exhibit.
{
"name": "sales_orders",
"properties": {
"folder": "orders",
"type": "AzureBlobFSLocation",
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"typeProperties": {
"fileName": "sales_orders.parquet",
"folderPath": "data/orders/year=2023/month=01/day=15/"
},
"compressionCodec": "snappy",
"columnDelimiter": ","
}
}Refer to the exhibit.
CREATE EXTERNAL DATA SOURCE myDataSource
WITH (
LOCATION = 'https://mystorageaccount.dfs.core.windows.net/mycontainer'
);
CREATE EXTERNAL FILE FORMAT myFileFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
CREATE EXTERNAL TABLE dbo.Sales (
SaleID int,
ProductName varchar(100),
SaleDate date,
Amount decimal(10,2)
)
WITH (
LOCATION = 'sales/',
DATA_SOURCE = myDataSource,
FILE_FORMAT = myFileFormat
);Refer to the exhibit.
{
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "DelimitedTextSource",
"type": "DatasetReference"
},
"script": "source(\n output() as (\n col1 string,\n col2 string\n ),\n allowSchemaDrift: true,\n validateSchema: false\n) ~> Source1"
}
],
"sinks": [
{
"dataset": {
"referenceName": "AzureSynapseTableSink",
"type": "DatasetReference"
},
"script": "Source1 sink(\n input() as (\n col1 string,\n col2 string\n ),\n allowSchemaDrift: true,\n validateSchema: false\n) ~> Sink1"
}
]
}
}Refer to the exhibit.
{
"name": "AggregateProductSales",
"properties": {
"folder": {
"name": "Sales"
},
"content": {
"jobType": "SparkJob",
"jobDefinition": {
"file": "abfss://container@storage.dfs.core.windows.net/synapse/workspaces/workspace/sparkjobdefinitions/aggregate_sales.py",
"conf": {
"spark.dynamicAllocation.enabled": "false",
"spark.executor.instances": 10,
"spark.executor.cores": 4,
"spark.executor.memory": "8g"
}
}
}
}
}Refer to the exhibit.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"resources": [
{
"type": "Microsoft.Synapse/workspaces/integrationRuntimes",
"apiVersion": "2021-06-01-preview",
"name": "[concat(parameters('workspaceName'), '/AutoResolveIntegrationRuntime')]",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "AutoResolve",
"dataFlowProperties": {
"computeType": "General",
"coreCount": 8,
"timeToLive": 10
},
"pipelineExternalComputeScaleProperties": {
"numberOfExternalNodes": 1,
"numberOfPipelineNodes": 1
}
}
}
}
}
]
}Refer to the exhibit.
{
"name": "CopyDataPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SQLDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "00:00:30",
"sqlWriterUseTableLock": true
},
"enableStaging": false
}
}
]
}
}Refer to the exhibit. -- KQL query in Azure Data Explorer let startTime = datetime(2024-01-01); let endTime = datetime(2024-01-02); T | where Timestamp between (startTime .. endTime) | summarize count() by bin(Timestamp, 1h) | render timechart
Refer to the exhibit.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"resources": [
{
"type": "Microsoft.Synapse/workspaces",
"apiVersion": "2021-06-01",
"name": "myworkspace",
"location": "[resourceGroup().location]",
"properties": {
"defaultDataLakeStorage": {
"accountUrl": "https://mystorage.dfs.core.windows.net",
"filesystem": "myfilesystem"
},
"managedResourceGroupName": "synapse-managed-rg"
}
}
]
}{
"name": "CopyDataFromBlobToSynapse",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [{"referenceName": "BlobSource", "type": "DatasetReference"}],
"outputs": [{"referenceName": "SynapseSink", "type": "DatasetReference"}],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "SqlPoolSink",
"writeBatchSize": 10000,
"preCopyScript": "TRUNCATE TABLE dbo.Sales"
},
"translator": {
"type": "TabularTranslator",
"mappings": [
{"source": "ProductID", "sink": "ProductID"},
{"source": "SaleDate", "sink": "SaleDate"},
{"source": "Quantity", "sink": "Quantity"},
{"source": "Amount", "sink": "Amount"}
]
}
}
}
]
}
}CREATE EXTERNAL DATA SOURCE MyDataSource
WITH (
LOCATION = 'abfss://container@storageaccount.dfs.core.windows.net',
CREDENTIAL = MyCredential
);
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
CREATE EXTERNAL TABLE dbo.ExternalSales
(
ProductID INT,
SaleDate DATE,
Quantity INT,
Amount DECIMAL(10,2)
)
WITH (
LOCATION = '/sales/',
DATA_SOURCE = MyDataSource,
FILE_FORMAT = ParquetFormat
);{
"name": "SalesAggregation",
"properties": {
"activities": [
{
"name": "Notebook1",
"type": "SynapseNotebook",
"dependsOn": [],
"typeProperties": {
"notebook": "SalesAggregationNotebook",
"parameters": {}
},
"linkedServiceName": {
"referenceName": "mySparkPool",
"type": "LinkedServiceReference"
}
}
],
"parameters": {
"inputDate": {
"type": "string",
"defaultValue": "2024-01-01"
}
}
}
}{
"type": "Microsoft.DataFactory/factories/pipelines",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"policy": {
"retry": 2,
"timeout": "0:10:00"
},
"inputs": [
{
"referenceName": "DelimitedTextInput",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "ParquetOutput",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "ParquetWriteSettings"
}
},
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Name",
"type": "String"
},
"sink": {
"name": "FullName",
"type": "String"
}
}
]
}
}
}
]
}
}{
"type": "Microsoft.Synapse/workspaces/databases/tables",
"properties": {
"source": {
"provider": "ABFS",
"location": "abfss://container@storage.dfs.core.windows.net/data/"
},
"format": {
"type": "parquet",
"derivedModel": false
},
"options": {
"recursive": true
}
}
}{
"name": "Policy to enforce HTTPS on storage accounts",
"properties": {
"policyRule": {
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
{ "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly", "equals": "false" }
]
},
"then": {
"effect": "deny"
}
}
}
}{
"name": "List all external tables in a serverless SQL pool",
"query": "SELECT name, location FROM sys.external_tables WHERE type = 'HADOOP'"
}{
"type": "Microsoft.Synapse/workspaces/integrationRuntimes",
"apiVersion": "2021-06-01-preview",
"properties": {
"type": "SelfHosted",
"description": "Integration runtime for on-premises data sources"
}
}Refer to the exhibit. The following is a JSON policy for an Azure Data Lake Storage Gen2 storage account:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"ManagedIdentity": "..."
},
"Action": [
"Microsoft.Storage/storageAccounts/blobServices/containers/read"
],
"Resource": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Storage/storageAccounts/.../blobServices/default/containers/data/...",
"Condition": {
"Bool": {
"acs:RequestVersion": "2019-12-12"
}
}
}
]
}Refer to the exhibit. The following is an Azure CLI output after running a command on a Synapse Analytics workspace:
{
"name": "myworkspace",
"type": "Microsoft.Synapse/workspaces",
"location": "eastus",
"properties": {
"defaultDataLakeStorage": {
"accountUrl": "https://mydatalake.dfs.core.windows.net",
"filesystem": "myfilesystem"
},
"sqlAdministratorLogin": "adminuser",
"managedResourceGroupName": "managedRG",
"provisioningState": "Succeeded",
"privateEndpointConnections": []
}
}Refer to the exhibit. The following is a KQL query run in Azure Data Explorer: let T = datatable(Id:int, Name:string, Age:int) [ 1, 'Alice', 30, 2, 'Bob', 25, 3, 'Charlie', 35 ]; T | where Age > 25 | project Name, Age
{
"name": "CopyDataFromBlobToSQL",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [{"referenceName": "BlobSourceDataset", "type": "DatasetReference"}],
"outputs": [{"referenceName": "SQLSinkDataset", "type": "DatasetReference"}],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"preCopyScript": "TRUNCATE TABLE dbo.Customer"
}
}
}
]
}
}{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"resources": [
{
"type": "Microsoft.Synapse/workspaces/integrationRuntimes",
"apiVersion": "2021-06-01-preview",
"name": "[concat(parameters('workspaceName'), '/MyIntegrationRuntime')]",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "AutoResolve",
"nodeSize": "Standard_D3_v2",
"numberOfNodes": 2,
"maxParallelExecutionsPerNode": 2
}
}
}
}
]
}{
"name": "CopyTradeData",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [{"referenceName": "SourceDataset", "type": "DatasetReference"}],
"outputs": [{"referenceName": "SinkDataset", "type": "DatasetReference"}],
"typeProperties": {
"source": {
"type": "AzureBlobStorageSource",
"recursive": true
},
"sink": {
"type": "AzureSqlSink",
"writeBatchSize": 10000,
"preCopyScript": "TRUNCATE TABLE dbo.Trades"
}
}
}
]
}
}CREATE EXTERNAL DATA SOURCE SalesData WITH ( LOCATION = 'https://mystorageaccount.dfs.core.windows.net/sales', CREDENTIAL = StorageCred ); CREATE EXTERNAL FILE FORMAT ParquetFormat WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' ); CREATE EXTERNAL TABLE dbo.Sales ( SaleID INT, ProductID INT, SaleDate DATE, Amount DECIMAL(10,2) ) WITH ( LOCATION = 'parquet/sales/', DATA_SOURCE = SalesData, FILE_FORMAT = ParquetFormat );
{
"dataflows": [
{
"name": "TransformSales",
"properties": {
"sources": [
{
"name": "SalesSource",
"dataset": {
"referenceName": "SalesDataset",
"type": "DatasetReference"
}
}
],
"transformations": [
{
"name": "AggregateSales",
"type": "Aggregate",
"inputs": ["SalesSource"],
"aggregates": [
{
"column": "TotalAmount",
"function": "SUM",
"input": "Amount"
}
]
}
],
"sink": {
"name": "SalesSink",
"dataset": {
"referenceName": "AggregatedSalesDataset",
"type": "DatasetReference"
}
}
}
}
]
}Refer to the exhibit.
{
"name": "CopyFromBlobToSQL",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobInput",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SQLOutput",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"preCopyScript": "TRUNCATE TABLE dbo.Destination"
}
}
}
]
}
}Refer to the exhibit.
{
"type": "Microsoft.Synapse/workspaces/sqlPools",
"properties": {
"createMode": "Default",
"storageAccountType": "GRS",
"collation": "SQL_Latin1_General_CP1_CI_AS",
"maxSizeBytes": 263882790666240,
"sku": {
"name": "DW100c",
"tier": "ServiceObjective"
},
"restorePointInTime": null,
"sourceDatabaseId": null
}
}Refer to the exhibit.
{
"inputAlias": "input",
"type": "Stream",
"output": {
"outputAlias": "output",
"type": "ReferenceData"
},
"query": "SELECT input.* FROM input JOIN output ON input.ProductId = output.ProductId"
}Refer to the exhibit.
```json
{
"name": "CopyDataPipeline",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [{"referenceName": "BlobInputDataset", "type": "DatasetReference"}],
"outputs": [{"referenceName": "SQLOutputDataset", "type": "DatasetReference"}],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "DelimitedTextReadSettings",
"skipLineCount": 1
}
},
"sink": {
"type": "SqlSink",
"writeBehavior": "insert",
"sqlWriterUseTableLock": false
},
"enableStaging": false
}
}
]
}
}
```Refer to the exhibit.
```kusto
let StartDate = datetime(2024-01-01);
let EndDate = datetime(2024-01-31);
let TotalSales = materialize(
Sales
| where OrderDate between (StartDate .. EndDate)
| summarize TotalAmount = sum(Amount) by ProductID
);
TotalSales
| where TotalAmount > 10000
| join kind=inner (Products) on ProductID
| project ProductName, TotalAmount
| order by TotalAmount desc
```You are a Data Engineer at Contoso Ltd. The company operates an e-commerce platform that generates streaming data from user interactions (clicks, page views, purchases) at a rate of 50,000 events per second. The data is ingested into Azure Event Hubs. You need to design a data processing solution that meets the following requirements:
1. Real-time dashboards must display aggregated metrics (e.g., total sales, active users) with a latency of less than 2 seconds. 2. Historical data must be stored in Azure Data Lake Storage Gen2 in Delta Lake format for batch analytics. 3. The solution must support exactly-once semantics for the streaming output to both the dashboard and the data lake. 4. You must use Azure Stream Analytics as the primary processing engine. 5. The output to the dashboard should use Azure Power BI, while the output to the data lake should use Azure Synapse Analytics serverless SQL pool for querying.
You have configured an Azure Stream Analytics job with Event Hubs as input. For output, you added a Power BI dataset and an Azure Data Lake Storage Gen2 output. However, you discover that the Power BI dataset is being updated with duplicate records, and the data lake output sometimes misses small windows of data during job restarts. Which two actions should you take to meet the requirements? (Choose two.)
A. Change the Power BI output to use the 'Exactly once' output mode. B. Change the Event Hubs compatibility level to 1.0 to guarantee exactly-once. C. Enable checkpointing in the Stream Analytics job and use the 'Exactly once' output mode for the Azure Data Lake Storage Gen2 output. D. Use a second Stream Analytics job: one for Power BI with 'At least once' mode, and another for the data lake with 'Exactly once' mode.
{
"name": "dataflow",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"name": "source1",
"dataset": {
"referenceName": "JsonDataset",
"type": "DatasetReference"
},
"sourceSettings": {
"rowLimit": 10000,
"partitionRootPath": "year=2024/month=01/day=15"
}
}
],
"sinks": [
{
"name": "sink1",
"dataset": {
"referenceName": "ParquetDataset",
"type": "DatasetReference"
},
"sinkSettings": {
"filePattern": "output_*.parquet",
"partitionBy": [
{
"name": "year",
"type": "Dynamic",
"value": "2024"
},
{
"name": "month",
"type": "Dynamic",
"value": "01"
}
]
}
}
],
"transformations": [
{
"name": "filter1",
"type": "Filter",
"inputs": ["source1"]
}
]
}
}
}{
"name": "transform-sales",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"name": "salesSource",
"dataset": {"referenceName": "SalesCSV", "type": "DatasetReference"},
"sourceSettings": {
"rowLimit": 1000,
"partitionOption": "none"
}
}
],
"sinks": [
{
"name": "salesSink",
"dataset": {"referenceName": "SalesDelta", "type": "DatasetReference"},
"sinkSettings": {
"optimizeWrite": true,
"allowSchemaDrift": true
}
}
],
"scriptLines": [
"source(output( \n\t\tProductID as integer, \n\t\tSaleDate as string, \n\t\tAmount as decimal(10,2)\n\t), \n\tallowSchemaDrift: true, \n\tvalidateSchema: false) ~> salesSource",
"salesSource sink(input(), \n\tallowSchemaDrift: true, \n\tvalidateSchema: false, \n\tpartitionBy('hash', 1)) ~> salesSink"
]
}
}
}You are a data engineer for a global retail company. The company has a hybrid architecture with on-premises SQL Server databases and Azure Synapse Analytics. You need to design a data processing solution that ingests incremental changes from the on-premises SQL Server database (source) into Azure Synapse Analytics (sink) with low latency (under 15 minutes) and high reliability. The source database is 5 TB and experiences high transaction volume during business hours. The solution must minimize impact on the source system and handle schema changes automatically. You have the following options:
Option A: Use Azure Data Factory with a copy activity that uses a watermark column to query incremental changes every 10 minutes. The copy activity writes directly to the Synapse table using PolyBase.
Option B: Use Azure Data Factory with a mapping data flow that reads from the source using a SQL query with a watermark, performs transformations, and writes to Synapse using staging via Blob Storage and PolyBase.
Option C: Use SQL Server Integration Services (SSIS) running on Azure-SSIS Integration Runtime to extract data using change data capture (CDC) and load into Synapse.
Option D: Use Azure Databricks with Auto Loader to ingest files from a staging area that is populated by a separate log-shipping process from the source.
Which option should you choose?
A data engineering team is building a batch processing solution for a financial services company. Data is ingested daily from multiple sources into Azure Data Lake Storage Gen2 in CSV format. The data must be transformed (filtered, aggregated, joined) and loaded into Azure Synapse Analytics dedicated SQL pool. The team must optimize for cost and performance. The total data volume is 2 TB per day. The team has the following options:
Option A: Use Azure Data Factory pipelines with copy activity to load raw CSV files into Synapse staging tables, then use T-SQL stored procedures in Synapse to perform transformations.
Option B: Use Azure Databricks with Auto Loader to incrementally ingest CSV files, perform transformations in Spark, and write the results to Synapse using the Spark Synapse connector.
Option C: Use Azure Data Factory with mapping data flows to transform the data in a serverless environment and then write to Synapse.
Option D: Use Azure Synapse Pipelines (built on ADF) with a notebook activity that runs a PySpark notebook in Synapse Spark pool to transform and load data.
Which option should the team choose to minimize cost and management overhead while meeting performance requirements?
A small business uses Azure SQL Database as their operational data store. They want to build a simple reporting solution that refreshes daily. The data volume is about 10 GB. They have limited budget and no dedicated data engineering team. They need to extract data from Azure SQL Database, perform basic transformations (e.g., aggregations, joins), and load into a reporting table in the same database. They want to minimize administrative effort and cost. Which approach should they take?
Option A: Use Azure Data Factory with a copy activity to copy data from the source to a staging table, then use a stored procedure activity to run T-SQL transformations.
Option B: Use Azure Logic Apps to periodically query the source and insert transformed data into the reporting table.
Option C: Use Azure Databricks notebooks scheduled via Azure Data Factory to perform the ETL.
Option D: Use Azure SQL Database elastic job to run a T-SQL script that performs the ETL within the database.
Refer to the exhibit.
{
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "ADLS_Source",
"type": "DatasetReference"
},
"sourceSettings": {
"sourceType": "DelimitedText",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
}
],
"sinks": [
{
"dataset": {
"referenceName": "Synapse_Sink",
"type": "DatasetReference"
},
"sinkSettings": {
"sinkType": "AzureSynapse",
"allowCopyCommand": true,
"tableOption": "autoCreate"
}
}
],
"transformations": [
{
"name": "DerivedColumn",
"type": "DerivedColumn",
"script": "column1 == "Error" ? toString(column1) : column1"
}
]
}
}