DP-900Chapter 11 of 101Objective 3.2

Azure Data Factory

This chapter covers Azure Data Factory (ADF), a cloud-based data integration service that orchestrates and automates data movement and transformation. For the DP-900 exam, understanding ADF's core components — pipelines, activities, datasets, linked services, triggers, and integration runtimes — is critical as it represents a key part of the 'Analytics' domain (Objective 3.2). Expect 2-3 questions on ADF, focusing on its role in extract-transform-load (ETL) and extract-load-transform (ELT) workflows, how it differs from other Azure data services, and its primary use cases.

25 min read
Intermediate
Updated May 31, 2026

Azure Data Factory as a Data Pipeline Orchestrator

Think of Azure Data Factory (ADF) as a sophisticated logistics company that manages the movement of goods (data) between warehouses (data stores). The company has a central dispatcher (the ADF orchestrator) that plans routes (pipelines). Each route consists of multiple legs (activities): first, a truck picks up raw materials from a supplier's warehouse (source data store), then it may stop at a processing plant (data transformation) where materials are cleaned and packaged, and finally it delivers the finished goods to a distribution center (destination data store). The dispatcher can run multiple trucks simultaneously (parallel execution) and can schedule pickups at specific times (triggers). If a road is closed (source unavailable), the dispatcher can reroute (error handling). The company also keeps a log of every shipment (monitoring and logging). Just as a logistics company doesn't own the warehouses but only coordinates movement, ADF doesn't store data but orchestrates its movement and transformation across various Azure and external data services.

How It Actually Works

What is Azure Data Factory?

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create, schedule, and orchestrate data pipelines at scale. It is designed to handle complex hybrid ETL/ELT scenarios, moving data between various on-premises and cloud data stores, and transforming it using compute services like Azure HDInsight, Azure Databricks, or Azure Synapse Analytics. ADF is serverless, meaning you don't need to manage any infrastructure; it scales automatically based on your workload.

How ADF Works Internally

ADF operates on a metadata-driven architecture. When you define a pipeline, ADF stores the metadata in Azure SQL Database or Azure Cosmos DB (depending on region). The execution is handled by the Integration Runtime (IR), which provides the compute environment for data movement and transformation.

- Pipeline: A logical grouping of activities that together perform a unit of work. Pipelines can be triggered or run on demand. - Activity: A single processing step in a pipeline. There are three main types: - Data movement activities: Copy data from a source to a sink. - Data transformation activities: Execute transformations using HDInsight, Databricks, etc. - Control activities: Orchestrate pipeline flow (e.g., ForEach, If Condition, Wait). - Linked Service: Defines connection information to a data store or compute resource (like connection strings, authentication). - Dataset: A named view of data that points to the data you want to use as input or output of an activity. It references a linked service and defines the structure (e.g., table name, file path). - Trigger: Defines when a pipeline run should be executed. Types include schedule (time-based), tumbling window (for incremental processing), event-based (e.g., on blob creation), and manual. - Integration Runtime (IR): The compute infrastructure used by ADF to perform activities. There are three types: - Azure IR: For data movement and transformation within Azure. - Self-hosted IR: For on-premises or network-restricted environments (requires installing a gateway on a local machine). - Azure-SSIS IR: For lifting and shifting SQL Server Integration Services (SSIS) packages to Azure.

Data Flow Execution Details

When a pipeline executes, ADF goes through several phases: 1. Parsing: ADF reads the pipeline JSON definition and validates it. 2. Scheduling: Based on triggers, ADF creates pipeline runs. 3. Allocation: For each activity, ADF allocates the appropriate IR. For copy activities, it uses the source IR to read data and the sink IR to write. 4. Execution: Activities run in the order defined by dependencies. Data movement activities use the Copy Data engine, which can scale out to multiple Data Movement Units (DMUs). By default, a copy activity uses up to 256 DMUs for intra-Azure copies, but you can set a maximum. 5. Monitoring: ADF emits logs to Azure Monitor, and you can view pipeline runs, activity runs, and errors in the ADF Monitor tab.

Key Configuration Parameters

Data Movement Units (DMUs): Used by copy activities. Each DMU can process data up to 1 Gbps. Default is 4 (Auto) but can be set from 2 to 256 for Azure-to-Azure copies.

Parallel Copies: The number of parallel reads from the source. Default is 8 for file-based sources.

Staging: For cross-region copies or when the sink is behind a firewall, you can use staging via Azure Blob Storage to improve performance.

Fault Tolerance: You can configure skip incompatible rows, or use transient error handling with retry count (default 0) and timeout (default 7 days for pipeline run).

Interaction with Related Technologies

ADF integrates closely with: - Azure Monitor & Log Analytics: For monitoring pipeline runs and setting alerts. - Azure Key Vault: To securely store credentials used in linked services. - Azure DevOps/GitHub: For source control and CI/CD of ADF artifacts. - Azure Synapse Analytics: ADF can load data into Synapse using PolyBase or COPY statement for high throughput. - Power BI: ADF can trigger Power BI data refreshes after data load.

Command Examples

While ADF is primarily a GUI-based service, you can interact with it via PowerShell, Azure CLI, or REST API. Example PowerShell to create a linked service:

Set-AzDataFactoryV2LinkedService -ResourceGroupName "rg-adf" -DataFactoryName "MyADF" -Name "AzureBlobStorageLinkedService" -File .\AzureBlobStorageLinkedService.json

Example Azure CLI to trigger a pipeline run:

az datafactory pipeline run --factory-name "MyADF" --resource-group "rg-adf" --pipeline-name "MyPipeline"

Monitoring with REST API

To get a list of pipeline runs:

GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/queryPipelineRuns?api-version=2018-06-01

Common Defaults and Limits

Maximum activities per pipeline: 40

Maximum pipelines per factory: 1000 (can be increased via support ticket)

Maximum datasets per factory: 5000

Maximum linked services per factory: 500

Pipeline run timeout: 7 days by default (can be changed)

Retry interval: 30 seconds (default)

Copy activity timeout: 7 days (default)

Data Flow (Mapping Data Flows)

ADF also supports Mapping Data Flows, which are visual data transformation designs that run on Spark clusters. These are billed separately based on cluster size and duration. Data flows are stateless and use a cluster that spins up on demand (typically takes 2-5 minutes).

Walk-Through

1

Define Linked Services

First, you create linked services that store connection details to your data stores (e.g., Azure Blob Storage, SQL Server, or an on-premises Oracle database). For each linked service, you specify the type, connection string, authentication method (e.g., account key, service principal, or managed identity), and optionally reference secrets from Azure Key Vault. For self-hosted IR, you also need to install the integration runtime gateway on a machine that can access the on-premises source.

2

Define Datasets

Next, you create datasets that represent the structure of the data you want to read or write. A dataset references a linked service and defines the location (e.g., file path, table name) and format (e.g., CSV, Parquet, JSON). For relational sources, you specify the schema mapping. Datasets are not the actual data but pointers; they allow ADF to understand how to interact with the data.

3

Create Pipeline with Activities

You then create a pipeline and add activities. For a copy activity, you configure the source (dataset) and sink (dataset). You can also set transformation activities like 'Data Flow' or 'Stored Procedure'. Control activities like 'If Condition' or 'ForEach' allow branching and looping. Each activity can have dependencies on previous activities, forming a directed acyclic graph (DAG).

4

Configure Trigger and Publish

After building the pipeline, you create a trigger to automate execution. Triggers can be schedule-based (e.g., every hour) or event-based (e.g., when a new blob appears). You must publish the pipeline (using 'Publish All' button or CI/CD) to deploy the changes to the ADF service. Until published, changes are only in the live mode or Git branch.

5

Monitor and Troubleshoot

Once the pipeline runs, you can monitor it in the ADF Monitor tab. You see pipeline runs, activity runs, and error details. For each run, you can view input/output JSON, duration, and logs. If a copy activity fails, you can check the error message, which often indicates source/sink connectivity issues or schema mismatches. You can also set up alerts via Azure Monitor.

What This Looks Like on the Job

Enterprise Scenario 1: Incremental Data Load from On-Premises SQL Server to Azure Synapse Analytics

A retail company needs to load daily sales data from their on-premises SQL Server to Azure Synapse Analytics for reporting. They use ADF with a self-hosted integration runtime installed on a VM in their local network. The pipeline uses a tumbling window trigger to run every hour, copying only new or modified rows using a watermark column (e.g., LastModifiedDate). The copy activity uses staging via Azure Blob Storage to improve performance (PolyBase). In production, they handle 500 GB of data per day. Common issues include the self-hosted IR going offline due to network changes, causing pipeline failures; they mitigate this by deploying multiple nodes in a high-availability cluster.

Enterprise Scenario 2: Orchestrating a Complex ETL with Databricks

A financial services firm uses ADF to orchestrate a complex ETL pipeline that transforms data using Azure Databricks. The pipeline starts with a copy activity to ingest raw CSV files from Azure Blob Storage into Azure Data Lake Storage Gen2. Then a Databricks notebook activity runs a Python script that cleans and aggregates the data. Finally, a stored procedure activity loads the transformed data into Azure SQL Database. ADF handles dependencies and retries: if the Databricks cluster fails to start, ADF retries up to 3 times with a 5-minute interval. They use Azure DevOps for CI/CD, storing ADF artifacts in Git. Misconfiguration often occurs when the Databricks linked service uses an expired access token, causing authentication failures.

Scenario 3: Event-Driven Data Processing with Blob Storage

A media company uses ADF to process video files uploaded to Azure Blob Storage. They set up an event-based trigger that fires when a new blob is created. The pipeline then copies the blob to a processing folder, runs a custom .NET activity (via Azure Batch) to transcode the video, and moves the result to an output container. The event trigger uses Azure Event Grid subscription. In production, they process thousands of files daily. A common pitfall is that the event trigger fires multiple times if the blob is overwritten; they mitigate by using a deduplication pattern with a metadata flag.

How DP-900 Actually Tests This

DP-900 Objective 3.2: Identify components of Azure Data Factory

The exam tests your understanding of ADF's role in data integration and its core components. Specifically, you should know: - Pipeline, Activity, Dataset, Linked Service, Trigger, Integration Runtime definitions and purposes. - When to use ADF vs. other services: ADF is for orchestration; Azure Synapse Pipelines are for analytics; SSIS is for lift-and-shift. - Self-hosted IR vs. Azure IR: Self-hosted is for on-premises data sources; Azure IR is for cloud sources. - Copy activity vs. Data Flow: Copy is for simple movement; Data Flow is for transformations (uses Spark).

Common Wrong Answers

1.

Choosing 'Azure Data Factory is a data warehouse' — ADF is not a storage service; it orchestrates movement. The correct answer is 'data integration service'.

2.

Confusing Linked Service with Dataset — A linked service defines connection, a dataset defines data structure. Many candidates mix these up.

3.

Thinking ADF can run on-premises without any gateway — For on-premises sources, you need a self-hosted integration runtime.

4.

Selecting 'Azure Data Factory stores data permanently' — ADF only moves/transforms data; it does not store it (except for staging blobs).

Exam-Specific Numbers

Maximum activities per pipeline: 40

Default copy activity DMU: 4 (Auto)

Self-hosted IR installation: Requires a machine with internet access to Azure.

Trigger types: Schedule, tumbling window, event, manual.

Edge Cases

Cross-region copy: Use staging for better performance.

Empty datasets: ADF can handle empty files; you can configure skip behavior.

Pipeline run timeout: Default 7 days; if exceeded, run is marked as timed out.

Eliminating Wrong Answers

If a question asks about moving data from on-premises SQL Server to Azure, any answer that doesn't mention self-hosted IR is likely wrong. For transformation questions, if the answer mentions 'Data Flow' or 'HDInsight', it's correct; if it says 'Copy Data', it's for movement only.

Key Takeaways

ADF is a cloud-based ETL/ELT orchestration service, not a data store.

Core components: Pipeline, Activity, Dataset, Linked Service, Trigger, Integration Runtime.

Use self-hosted IR for on-premises data sources; Azure IR for cloud sources.

Copy activity moves data; Data Flow transforms data using Spark.

Maximum 40 activities per pipeline; pipeline run timeout default is 7 days.

Triggers can be schedule, tumbling window, event-based, or manual.

ADF integrates with Azure Monitor for logging and alerts.

For cross-region copies, use staging via Azure Blob Storage.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Data Factory

Cloud-native, serverless, no infrastructure management.

Supports hybrid (on-premises and cloud) via self-hosted IR.

Pricing based on data movement and orchestration (activity runs).

Visual authoring with code-free pipelines.

Integrates natively with Azure services like Databricks, Synapse.

SSIS (SQL Server Integration Services)

On-premises, requires SQL Server installation.

Designed for lift-and-shift of existing SSIS packages.

Licensing costs based on SQL Server edition.

Traditional developer environment with Visual Studio.

Limited cloud integration; requires additional connectors.

Watch Out for These

Mistake

Azure Data Factory is a data storage service.

Correct

ADF is a data integration and orchestration service. It does not store data permanently; it only moves and transforms data between stores.

Mistake

You need a self-hosted IR for all data sources.

Correct

Self-hosted IR is only required for on-premises or network-restricted sources. For Azure cloud sources, you use Azure IR.

Mistake

Datasets contain the actual data.

Correct

Datasets are references or pointers to data, not the data itself. They define the structure and location.

Mistake

ADF can only copy data, not transform it.

Correct

ADF supports transformations via Data Flows, HDInsight, Databricks, and other compute services.

Mistake

Pipelines run automatically after creation without triggers.

Correct

Pipelines need to be triggered (manual, schedule, or event) to execute. Just publishing does not run them.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a linked service and a dataset in Azure Data Factory?

A linked service defines the connection information to a data store (like connection string, authentication), while a dataset represents the structure and location of the data within that store. For example, a linked service for Azure Blob Storage contains the storage account key, and a dataset points to a specific container and folder. In the exam, remember that linked services are like 'connection managers', datasets are like 'data pointers'.

When should I use a self-hosted integration runtime?

Use a self-hosted IR when your source or sink data store is located on-premises or in a virtual network that cannot be accessed directly from the public internet. You install the IR gateway on a machine that can reach the data store, and it acts as a bridge. For cloud-only scenarios, use Azure IR.

Can Azure Data Factory transform data without using external compute?

Yes, Azure Data Factory has Mapping Data Flows, which are visual transformations that run on Spark clusters managed by ADF. However, these are still compute-intensive and incur separate costs. For simple transformations like column mapping or data type conversion, you can use the Copy Data activity with mapping options.

How do I monitor pipeline runs in Azure Data Factory?

You can monitor pipeline runs in the ADF Monitor tab in the Azure portal. It shows run status, duration, and errors. For advanced monitoring, you can route logs to Azure Monitor Log Analytics and create custom dashboards. You can also set up alerts on pipeline failures using Azure Monitor.

What is the maximum number of activities per pipeline?

The maximum is 40 activities per pipeline. This includes all activity types (copy, transformation, control). If you need more, consider splitting into multiple pipelines or using nested pipelines (execute pipeline activity).

Can I use Azure Data Factory with GitHub for source control?

Yes, ADF supports Git integration with Azure Repos and GitHub. You can configure a Git repository for your factory, allowing version control, collaboration, and CI/CD. Any changes made in the UI are committed to a branch, and you can publish to the ADF service after merging.

What is a tumbling window trigger?

A tumbling window trigger is a schedule-based trigger that runs at a fixed interval (e.g., every 15 minutes) and can handle backfill. It ensures that each run processes a specific time window (e.g., the last 15 minutes of data). It is commonly used for incremental data loads.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Data Factory — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?