This chapter covers Azure Data Factory (ADF) activities and pipelines, the core orchestration components for data movement and transformation in Azure. Understanding these topics is critical for the DP-900 exam, as approximately 15–20% of questions relate to data integration and orchestration. You will learn the types of activities, how pipelines control execution flow, and how to monitor and troubleshoot pipeline runs. This knowledge directly supports exam objective 3.2 (Analytics) and is essential for designing scalable, reliable data integration solutions.
Jump to a section
Imagine a custom manufacturing facility that builds furniture. The facility has a set of blueprints (pipelines) that define the steps to create a specific piece of furniture. Each blueprint consists of multiple stations (activities) arranged in a sequence or parallel. At each station, a worker performs a specific action: one station cuts wood (Copy activity), another drills holes (Data Flow), a third applies varnish (Stored Procedure activity), and a fourth inspects the piece (Validation activity). The facility manager (Azure Data Factory) orchestrates the entire process: it decides which blueprints to run, when to start them, how to handle errors (e.g., if the drill breaks, skip to a repair station), and how to pass partially finished pieces between stations via conveyor belts (pipeline dependencies). The facility also has a central control room (Monitor & Manage) that shows real-time status of each station and each piece. If a blueprint fails, the manager can rerun only the failed station without restarting the entire process. The facility sources raw materials from various warehouses (data sources) and delivers finished products to different distribution centers (data sinks). This mechanistic analogy directly mirrors how ADF pipelines orchestrate activities: activities are the individual steps, pipelines are the orchestration units that execute activities in a defined order with conditions, and the control flow (like triggers and dependencies) manages the overall execution. Just as the manufacturing facility can run multiple blueprints concurrently, ADF can run multiple pipelines in parallel. And just as the manager can pause a blueprint if a station is down, ADF supports activity retry policies and timeouts.
What Are Activities and Pipelines?
Azure Data Factory (ADF) is a cloud-based ETL (Extract, Transform, Load) and data integration service. The fundamental building blocks are pipelines and activities. A pipeline is a logical grouping of activities that together perform a unit of work. An activity is a single step in a pipeline, representing an action such as copying data, running a transformation, or executing a stored procedure.
Activities can be classified into three categories:
- Data Movement Activities: Move data between source and sink data stores. The primary activity is Copy activity, which can handle many connectors and perform staging via PolyBase or Copy Command.
- Data Transformation Activities: Transform or process data using Azure HDInsight, Azure Databricks, Azure Machine Learning, Stored Procedure, or Data Flow (native transformation).
- Control Flow Activities: Orchestrate the execution of other activities, including branching, looping, and error handling. Examples include If Condition, ForEach, Until, Wait, Execute Pipeline, Set Variable, Append Variable, Validation, Web, and Fail.
Pipeline Execution and Dependencies
Pipelines define the order of activity execution through dependencies. An activity can depend on the status (success, failure, skipped, completion) of one or more preceding activities. This creates a directed acyclic graph (DAG). ADF does not support cycles; if you need looping, use the Until control flow activity.
When a pipeline is triggered (manually, by schedule, or by event), each activity runs on Azure compute resources. Activities can run in parallel if no dependencies exist. The pipeline run status is determined by the status of all activities. If any activity fails, the pipeline can be configured to either continue (if Continue on error is set) or stop.
Copy Activity Deep Dive
The Copy activity is the most commonly used activity. It moves data from a source to a sink. Internally, it uses an Integration Runtime (IR) to perform the data movement. The IR can be Azure (managed), Self-hosted (on-premises), or Azure-SSIS (for SQL Server Integration Services).
Key properties: - Source: dataset reference or inline source object with file path, query, etc. - Sink: dataset reference or inline sink object. - Mapping: defines how source columns map to sink columns. - Settings: includes fault tolerance (skip incompatible rows), staging (for PolyBase or Copy Command), data consistency verification, and performance settings (parallel copies, degree of copy parallelism).
Default values: - Data Integration Units (DIU): used for Azure IR, defaults to 4, can be set from 2 to 256. - Parallel copies: default is 1, can be increased for better throughput. - Fault tolerance: disabled by default; when enabled, you can skip incompatible rows or log them. - Timeout: 7 days for individual activity run (configurable up to 7 days).
Data Flow Activity
Data Flow is a transformation activity that runs natively on ADF-managed Spark clusters. It provides a visual designer to build transformation logic without coding. Key properties: - Data Flow: reference to a Data Flow object that defines the transformation logic. - Integration Runtime: choose an Azure IR with appropriate cluster size (number of cores, time-to-live). - Staging: for PolyBase or Copy Command when loading to Azure Synapse.
Data Flows execute on Spark, and the cluster is automatically spun up. The time-to-live (TTL) setting keeps the cluster warm for subsequent runs. Default TTL is 60 minutes; you can set it from 0 to 120 minutes. The cluster size is defined by the number of cores (default 8, min 4, max 256).
Control Flow Activities
Control flow activities manage the execution order and logic:
- If Condition: evaluates a Boolean expression (using pipeline expressions) and runs either the true or false branch. Both branches contain a list of activities.
- ForEach: iterates over a list of items (from an array) and runs a set of activities for each item. By default, iterations are sequential; you can set batchCount for parallel execution (default 1, max 50).
- Until: repeats a set of activities until a condition evaluates to true. The condition is checked at the end of each iteration.
- Wait: pauses the pipeline for a specified duration (in seconds).
- Execute Pipeline: invokes another pipeline from within a pipeline. The child pipeline runs synchronously (waits for completion) or asynchronously (fire-and-forget). The default is synchronous.
- Set Variable: sets the value of an existing pipeline variable.
- Append Variable: appends a value to an existing array variable.
- Validation: ensures that a dataset exists and is valid before proceeding. It can check for file existence, table presence, etc.
- Web: calls an external REST endpoint. Supports authentication (anonymous, managed identity, service principal, basic).
- Fail: terminates the pipeline by raising a custom error. You can specify the error message and error code.
Integration Runtime
Activities execute on an Integration Runtime (IR). There are three types: - Azure IR: fully managed, used for cloud data movement and Data Flows. It can be configured with a specific location (auto-resolve or explicit region). - Self-hosted IR: installed on-premises or in a VM, used to connect to on-premises data stores. It must be registered with ADF. - Azure-SSIS IR: dedicated cluster for running SQL Server Integration Services (SSIS) packages.
For the Copy activity, the IR determines the network path. If source and sink are both in Azure, Azure IR is used. If either is on-premises, a Self-hosted IR is required.
Triggers
Triggers determine when a pipeline runs. Types: - Schedule trigger: runs at a specific time or recurring interval (e.g., every 5 minutes, daily at 8 AM). - Tumbling window trigger: runs at a periodic interval from a start time, with built-in retry and backfill support. Commonly used for batch processing. - Event trigger: runs when a blob is created or deleted in a storage account. - Manual trigger: on-demand via portal, CLI, or API.
Each trigger can be associated with one or more pipelines. You can also chain triggers (trigger on completion of another trigger).
Monitoring and Alerts
ADF provides monitoring in the portal under Monitor & Manage. You can view pipeline runs, activity runs, and trigger runs. Each run has a status: Succeeded, Failed, In Progress, Cancelled, or Queued.
Key metrics:
Pipeline run duration
Activity run duration
Data read/written
Integration Runtime metrics (CPU, memory, queue length)
Alerts can be set on metrics (e.g., failed runs, latency) using Azure Monitor.
Expressions and Variables
ADF uses expressions based on the Common Expression Language (CEL). They are enclosed in @ prefix. Examples:
- @pipeline().RunId - Run ID of the current pipeline.
- @utcnow() - current UTC time.
- @concat('prefix_', variables('myVar')) - string concatenation.
Variables are scoped to the pipeline. You can define variables of type string, integer, boolean, array, or object. They are set using Set Variable and Append Variable activities.
Activity Retry and Timeout
Each activity has a retry policy: - retry: number of retry attempts (default 0, max 1000). - retryIntervalInSeconds: interval between retries (default 30 seconds, min 30). - timeout: maximum time the activity can run before being marked as failed (default 7 days, min 1 hour).
These settings are crucial for handling transient failures.
Pipeline Parameters and Global Parameters
Pipelines can accept parameters that are passed at runtime. Parameters are defined in the pipeline and can be used in expressions. Global parameters are defined at the factory level and are available to all pipelines.
Security
ADF uses managed identity for authentication to Azure resources. You can also use service principals, user-assigned managed identities, or connection strings. Data stores are accessed via linked services that store credentials securely.
Common Patterns
Incremental Load: Use the Lookup activity to get the last watermark, then Copy activity with a query filtering on that watermark.
Metadata-driven Pipeline: Use a control table to define source/sink mappings and iterate with ForEach.
Error Handling: Use If Condition to check activity output and Fail activity to stop on critical errors.
Performance Considerations
For Copy activity, use parallel copies and appropriate DIU.
For Data Flow, choose cluster size based on data volume and complexity.
Avoid tight loops in ForEach; use batchCount for parallel execution.
Use staging for large data loads to Synapse.
Default Values and Limits
Maximum activities per pipeline: 40
Maximum pipelines per factory: 500
Maximum concurrent pipeline runs per factory: 100 (default, can be increased)
Maximum concurrent activity runs per pipeline: 40
Maximum data size per Copy activity: 100 TB
Maximum number of connectors: 100+
Define Linked Services and Datasets
First, create linked services to connect to source and sink data stores (e.g., Azure Blob Storage, SQL Database). Linked services contain connection strings and authentication information. Then create datasets that represent the specific data structures (e.g., a specific table or file). Datasets reference linked services and define schema, file format, and location. This step is essential because activities reference datasets. Without them, the activity cannot know where to read from or write to.
Create Activities and Define Dependencies
Add activities to the pipeline canvas. For each activity, configure its properties: source and sink datasets for Copy activity, transformation logic for Data Flow, or condition expression for If Condition. Then define dependencies by connecting the output of one activity to the input of another. The dependency status can be Success, Failure, Skipped, or Completion. This creates the DAG that determines execution order. ADF validates that no cycles exist.
Configure Integration Runtime
For each activity that requires data movement or transformation, select the appropriate Integration Runtime (IR). For cloud-to-cloud, use Azure IR; for on-premises, use Self-hosted IR. For Data Flow, choose a Data Flow-specific Azure IR with appropriate cluster settings (core count, TTL). The IR must be up and running before the pipeline executes. Self-hosted IR requires installation and registration on a Windows machine.
Set Trigger and Parameters
Create a trigger (schedule, tumbling window, or event) and associate it with the pipeline. Pass pipeline parameters if needed. For example, a schedule trigger can pass the current time as a parameter to filter data. Triggers can be started, stopped, or paused. You can also manually trigger a pipeline with custom parameter values.
Monitor Pipeline and Activity Runs
After triggering, monitor the pipeline run in the ADF Monitor tab. You can see each activity run status, duration, and input/output. For failed runs, drill into the activity to view error details. Use Azure Monitor alerts for proactive notifications. You can also rerun a failed pipeline from the point of failure (if using rerun policy) or rerun the entire pipeline. Monitoring helps identify performance bottlenecks and errors.
Enterprise Scenario 1: Incremental Data Load from On-Premises SQL Server to Azure Synapse
A financial services company needs to load transactional data from an on-premises SQL Server to Azure Synapse Analytics every 15 minutes. They deploy a Self-hosted Integration Runtime on a VM in their data center, registered with ADF. The pipeline uses a Lookup activity to retrieve the last maximum timestamp from a control table in Azure SQL Database. Then a Copy activity copies new rows from on-premises SQL Server using a query with a WHERE clause filtering on that timestamp. The sink is an Azure Synapse table. For performance, they set parallel copies to 4 and DIU to 16. They also enable staging via PolyBase for faster loads. If the Copy activity fails due to a network glitch, the retry policy (3 retries, 60-second interval) handles it. The pipeline runs on a tumbling window trigger with a 15-minute window and built-in backfill for missed windows. Common issues: Self-hosted IR goes offline due to machine restarts, causing pipeline failures. They set up Azure Monitor alerts on IR availability. Also, if the control table is not updated correctly, they get duplicate loads; they use a stored procedure to update the watermark after successful load.
Enterprise Scenario 2: Data Transformation Using Data Flow
A retail company transforms raw sales data from Azure Blob Storage into aggregated reports. They use a Data Flow activity to join sales transactions with product catalogs, filter out returns, and compute daily sales by store. The Data Flow runs on an Azure IR with 32 cores and a TTL of 60 minutes to reduce warm-up time for near-real-time processing. The pipeline is triggered by an event trigger when a new file arrives in Blob Storage. The Data Flow writes the output to a SQL Database table. They monitor the pipeline using ADF metrics and set up a log analytics workspace for long-term retention. A common misconfiguration is not setting the TTL, causing cold-start delays of 5-10 minutes for each run. They also use the 'Skip incompatible rows' option in Copy activity for upstream data quality issues.
Enterprise Scenario 3: Orchestrating Multiple Pipelines with Execute Pipeline
A logistics company has separate pipelines for data ingestion, validation, transformation, and loading. They use an orchestrator pipeline that calls each child pipeline using Execute Pipeline activity. The child pipelines run synchronously, meaning the orchestrator waits for each to complete before starting the next. If a child pipeline fails, the orchestrator uses an If Condition to decide whether to continue or fail. They pass global parameters like business date to all pipelines. This modular approach allows separate development and testing. Performance consideration: if child pipelines have many activities, the orchestrator may hit the 40-activity limit per pipeline. They break the orchestrator into multiple layers. A common issue is circular dependency when pipeline A calls pipeline B and B calls A, which is not allowed. They enforce a strict hierarchy.
Exam Objective 3.2: Analytics
The DP-900 exam tests your understanding of Azure Data Factory activities and pipelines in the context of data integration. Specifically, you should be able to:
Distinguish between pipeline and activity.
Identify the purpose of common activities: Copy, Data Flow, Stored Procedure, If Condition, ForEach, Execute Pipeline, Wait, Validation, Web, Fail.
Understand the role of Integration Runtime (Azure, Self-hosted, Azure-SSIS).
Know the difference between schedule, tumbling window, and event triggers.
Recognize the monitoring capabilities: pipeline runs, activity runs, statuses.
Common Wrong Answers and Why Candidates Choose Them
1. "Data Flow activities run on the same Integration Runtime as Copy activities." Wrong. Data Flow requires a dedicated Azure IR with Spark compute. Copy activities can use a different Azure IR. Candidates confuse the general-purpose Azure IR with the Data Flow-specific IR.
2. "A pipeline can contain up to 100 activities." Wrong. The limit is 40 activities per pipeline. Candidates often overestimate based on other Azure service limits.
3. "The ForEach activity runs iterations in parallel by default." Wrong. Default is sequential; you must set batchCount > 1 for parallel. Candidates assume parallel because of cloud scalability.
4. "You can use the Wait activity to pause a pipeline indefinitely." Wrong. Wait has a maximum duration of 7 days (configurable timeout). Candidates think it can wait forever.
Specific Numbers and Terms That Appear on the Exam
40 activities per pipeline
1000 max retry attempts
30 seconds default retry interval
7 days default activity timeout
4 default DIU for Copy activity
50 max batchCount for ForEach
60 minutes default TTL for Data Flow IR
3 types of IR: Azure, Self-hosted, Azure-SSIS
3 types of triggers: schedule, tumbling window, event
Edge Cases and Exceptions
Tumbling window trigger backfill: If the window is set to 15 minutes and the trigger is paused for an hour, it will automatically run 4 backfill runs when resumed.
Data Flow TTL: Setting TTL to 0 means no warm cluster; each run spins up a new cluster, increasing latency.
Execute Pipeline activity: By default, it runs synchronously. To run asynchronously, set waitOnCompletion to false.
Validation activity: Can check if a file exists or if a table has rows. It does not validate data content.
How to Eliminate Wrong Answers
Focus on the fundamental definitions: pipeline = unit of work, activity = step. If a question asks about orchestrating multiple pipelines, look for Execute Pipeline. If it asks about moving data, look for Copy. If it asks about transformation without coding, look for Data Flow. For control flow, look for If Condition, ForEach, Until. Remember that Data Flow is a transformation activity, not a movement activity. Also, Integration Runtime is the compute where activities run; a Self-hosted IR is required for on-premises data sources.
A pipeline is a logical grouping of activities; an activity is a single execution step.
Copy activity moves data; Data Flow transforms data; Control Flow activities orchestrate execution.
Maximum activities per pipeline: 40.
Default retry interval for activities: 30 seconds; default timeout: 7 days.
ForEach default sequential; set batchCount up to 50 for parallel.
Data Flow requires a dedicated Azure IR with Spark; set TTL to avoid cold starts.
Three types of triggers: schedule, tumbling window, event.
Self-hosted IR is required for on-premises data sources.
Execute Pipeline runs child pipelines synchronously by default.
Use Validation activity to check dataset existence before proceeding.
These come up on the exam all the time. Here's how to tell them apart.
Copy Activity
Moves data from source to sink with optional mapping.
Uses Integration Runtime (Azure, Self-hosted, or Azure-SSIS).
Supports staging via PolyBase or Copy Command.
Best for bulk data movement without transformation.
Lower cost per GB moved compared to Data Flow.
Data Flow Activity
Transforms data using visual designer or code (Spark).
Runs only on Azure IR with Spark compute.
Can perform complex transformations like joins, aggregations, pivots.
Best for data preparation and cleansing.
Higher cost due to Spark cluster usage.
Schedule Trigger
Runs at specified times (e.g., every 5 minutes, daily at 8 AM).
No built-in retry for missed runs.
Can be used for any pipeline.
Simple to configure.
Does not support backfill.
Tumbling Window Trigger
Runs at periodic intervals from a start time (e.g., every 15 minutes).
Built-in retry and backfill for missed windows.
Commonly used for batch processing with windowing.
Supports dependency on other triggers.
Can automatically backfill historical windows when paused and resumed.
Azure IR
Fully managed by Azure.
Used for cloud-to-cloud data movement.
No installation required.
Supports Data Flow activities.
Region can be auto-resolve or specific.
Self-hosted IR
Installed on-premises or in a VM.
Used for on-premises or network-restricted data sources.
Requires manual installation and registration.
Does not support Data Flow activities.
Supports proxy for Azure IR in hybrid scenarios.
Mistake
A pipeline must contain at least one activity.
Correct
A pipeline can have zero activities (empty pipeline). It will run and succeed immediately. This is useful for placeholder or organizational purposes.
Mistake
The Copy activity can only copy data between Azure data stores.
Correct
Copy activity supports over 100 connectors, including on-premises sources via Self-hosted IR, and SaaS sources like Salesforce, Amazon S3, Google Cloud Storage.
Mistake
Data Flow activities can be used to move data without transformation.
Correct
Data Flow is designed for transformation. For simple copy without transformation, use Copy activity, which is more performant and cost-effective.
Mistake
All activities run on the same Integration Runtime.
Correct
Each activity can use a different IR. For example, Copy activity can use Azure IR, while a Stored Procedure activity can use a Self-hosted IR if the database is on-premises.
Mistake
Triggers are required to run a pipeline.
Correct
Pipelines can be run manually (on-demand) without a trigger. Triggers are only for automated execution.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A pipeline is a logical container that groups a set of activities. An activity is a single step that performs a specific action, such as copying data, running a transformation, or executing a stored procedure. Pipelines orchestrate the execution order of activities using dependencies. Multiple activities can run in parallel if they have no dependencies. The pipeline status depends on the status of its activities.
Yes, you can manually trigger a pipeline on-demand via the Azure portal, PowerShell, CLI, or SDK. Triggers are only needed for automated scheduled or event-based execution. Manual triggers allow you to pass custom parameter values.
The maximum is 40 activities per pipeline. If you need more, consider breaking the logic into multiple pipelines and using Execute Pipeline activity to chain them.
You can use the 'Continue on error' property on activities to allow the pipeline to continue despite failures. Use If Condition activity to check the status of previous activities and take corrective actions like sending an email or invoking a fail activity. You can also set retry policies on activities to automatically retry on transient failures.
You must use a Self-hosted Integration Runtime installed on a Windows machine in your on-premises network. It registers with ADF and provides a secure connection. For cloud-only scenarios, use Azure IR.
No, Data Flow activities require an Azure Integration Runtime with Spark compute. Self-hosted IR does not support Data Flow. Use Azure IR for Data Flow.
The default timeout is 7 days. You can configure a shorter timeout (minimum 1 hour) via activity settings. If the activity runs longer than the timeout, it is marked as failed.
You've just covered Azure Data Factory Activities and Pipelines — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?