AZ-305Chapter 54 of 103Objective 2.4

Azure Data Factory for Data Integration Design

This chapter covers Azure Data Factory (ADF) as the primary data integration and orchestration service in Azure, focusing on its role in designing scalable data pipelines. For the AZ-305 exam, understanding ADF's capabilities for data movement, transformation, and scheduling is critical because it appears in roughly 10-15% of questions related to Data Storage and Analytics objectives. This chapter will equip you with the technical depth needed to design ADF solutions that meet enterprise data integration requirements, including hybrid scenarios, security, and performance optimization.

25 min read
Intermediate
Updated May 31, 2026

The Automated Assembly Line for Data

Think of Azure Data Factory (ADF) as an automated assembly line in a large manufacturing plant. Raw materials (data) arrive from various sources—some come on conveyor belts from suppliers (on-premises databases), others from trucks (cloud storage), and some from smaller internal workshops (SaaS applications). The assembly line has multiple stations (activities) that process the materials: one station cleans and sorts (Data Flow), another transforms shapes (Mapping Data Flow), and a third packages the final product (Copy Activity). The entire line is controlled by a programmable logic controller (pipeline) that dictates the sequence and timing of each station. A central control room (Azure Data Factory portal) monitors the entire operation, showing real-time status, alerts, and logs. The line can be scheduled to run daily at midnight (schedule trigger), or it can start automatically when a new shipment arrives (event trigger). If a station jams, the line can retry the operation up to three times (retry policy). The key insight is that ADF orchestrates these activities in a specific order, handling dependencies, retries, and errors, just as an assembly line coordinates physical steps. The data never leaves the factory floor until the final product is shipped to its destination.

How It Actually Works

What is Azure Data Factory and Why Does It Exist?

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows (pipelines) at scale. It was designed to solve the challenge of moving and transforming data across heterogeneous sources—on-premises, cloud, and SaaS—without writing complex code. ADF provides a code-free visual interface for building pipelines, but also supports code-first development via SDKs and Git integration.

How It Works Internally: The Mechanism

ADF operates on a set of core concepts: pipelines, activities, datasets, linked services, integration runtimes (IR), and triggers.

- Pipelines: A logical group of activities that perform a unit of work. Activities can be chained with dependencies (e.g., success, failure, completion). - Activities: The actual work performed, such as Copy Data, Data Flow, Execute Pipeline, or a custom .NET activity. - Datasets: Named views of data that point to the data you want to use as input or output. They reference linked services and define structure (e.g., table name, file path). - Linked Services: Connection strings that define how to connect to external resources (e.g., SQL Server, Blob Storage, REST API). They contain credentials stored securely in Azure Key Vault or encrypted in ADF. - Integration Runtime (IR): The compute infrastructure used by ADF to perform data movement, activity dispatch, and SSIS package execution. There are three types: - Azure IR: Fully managed, scales automatically, used for cloud-to-cloud operations. - Self-hosted IR: Installed on-premises or in a VM, used for hybrid scenarios (e.g., connecting to on-premises databases). - Azure-SSIS IR: A dedicated cluster for running SQL Server Integration Services (SSIS) packages in the cloud. - Triggers: Define when a pipeline runs. Types include: - Schedule trigger: Runs at specified intervals (e.g., every 5 minutes, daily at 2 AM). - Tumbling window trigger: Runs at fixed intervals from a start time, with stateful execution (e.g., every hour, processing data for that hour). - Event trigger: Runs when a blob is created or deleted in Azure Blob Storage.

Key Components, Values, Defaults, and Timers

Copy Activity: The most common activity. It can copy data between over 90 connectors. Default parallelism: 4 parallel copies per activity (configurable). Default retry: 0 (can be set up to 1000). Timeout: 7 days (max).

Mapping Data Flow: Visual data transformation using Spark clusters. Each data flow runs on a Spark cluster with configurable compute size (default: 8 cores). Time-to-live (TTL) for clusters: 60 minutes (can be set to 0 for immediate shutdown).

Pipeline runs: Each pipeline run has a run ID and a status (Queued, InProgress, Succeeded, Failed, Cancelled). Maximum concurrent pipeline runs per factory: 100 (default).

Activity runs: Each activity within a pipeline has its own run ID and status. Activity retry interval: default 30 seconds (configurable).

Linked Service: Supports connection strings, service principals, managed identities. For Azure SQL Database, default authentication is SQL authentication (user/password).

Self-hosted IR: Minimum version: 5.0 (as of 2025). Auto-update enabled by default. Heartbeat interval: 30 seconds. Maximum concurrent jobs per node: 100 (default).

Configuration and Verification Commands

To create a pipeline using Azure CLI:

az datafactory pipeline create --factory-name MyFactory \
    --resource-group MyRG \
    --name MyPipeline \
    --pipeline @pipeline.json

To trigger a pipeline run:

az datafactory pipeline create-run --factory-name MyFactory \
    --resource-group MyRG \
    --name MyPipeline

To monitor pipeline runs:

az datafactory pipeline-run query-by-factory --factory-name MyFactory \
    --resource-group MyRG \
    --filters "[{\"operand\": \"Status\", \"operator\": \"Equals\", \"values\": [\"Failed\"]}]"

How It Interacts with Related Technologies

Azure Key Vault: ADF integrates tightly with Key Vault to store credentials. When creating a linked service, you can reference a secret from Key Vault instead of hardcoding the password. This is a best practice for security.

Azure Monitor & Log Analytics: ADF sends diagnostic logs to Azure Monitor. You can set up alerts for failed pipeline runs, long-running activities, or specific error codes.

Azure DevOps & GitHub: ADF supports Git integration for source control. You can clone a factory into a Git repository, enabling CI/CD pipelines. The Azure DevOps task for ADF deployment is AzureDataFactoryDeployment.

Azure Functions: ADF can call Azure Functions via the Azure Function activity, allowing custom C# or JavaScript code to run as part of a pipeline.

Azure Databricks: ADF can trigger Databricks notebooks as a pipeline activity, combining orchestration with advanced analytics.

Power BI: ADF can load data into Power BI datasets, enabling automated refreshes.

Performance Considerations

Copy Activity Performance: For large data volumes, use staging copy (via Blob Storage) to reduce load on source/destination. Enable parallel copies and tune the parallelCopies property. For on-premises sources, ensure the self-hosted IR has enough memory and CPU.

Data Flow Performance: Choose the correct compute size (General Purpose vs. Memory Optimized). Use optimized Spark partitioning (e.g., repartitioning after a filter). Avoid wide transformations that cause shuffling of large datasets.

Pipeline Concurrency: If you have many parallel pipeline runs, you may hit the factory concurrency limit (100). Increase the limit by contacting support or distribute workloads across multiple factories.

Security

Authentication: Use managed identities for Azure resources where possible. For on-premises, use Windows authentication or SQL authentication via self-hosted IR.

Encryption: Data in transit is encrypted using TLS 1.2 (minimum). Data at rest encrypted using Azure Storage encryption.

Network Isolation: Use Azure Private Link to connect ADF to your VNet, preventing data from traversing the public internet. For self-hosted IR, use a VPN or ExpressRoute.

Monitoring and Alerts

ADF Monitor: Built-in monitoring in the Azure portal shows pipeline runs, activity runs, and trigger runs. You can filter by status, time range, and annotations.

Alert Rules: Create alerts based on metrics like 'Failed Pipeline Runs Count' (threshold > 0) or 'Data Flow Duration' (threshold > 30 minutes).

Diagnostic Settings: Send logs to Log Analytics for advanced querying. Example KQL query to find failed pipelines:

ADFActivityRun
| where Status == 'Failed'
| summarize count() by PipelineName, ActivityName

Walk-Through

1

Define Linked Services

First, create linked services for each data source and destination. For example, a linked service to Azure SQL Database requires the server name, database name, and authentication method (e.g., SQL auth or managed identity). The connection string is stored encrypted in ADF or retrieved from Key Vault. The self-hosted IR is selected if the source is on-premises. This step is critical because without correct linked services, no data movement can occur.

2

Create Datasets

Datasets represent the structure of the data within the linked service. For a SQL table, you specify the table name. For a Blob Storage file, you specify the container, folder path, and file format (e.g., CSV, Parquet). Datasets are referenced by activities as inputs or outputs. ADF uses datasets to determine the schema and location of data.

3

Build Pipeline with Activities

Create a pipeline and add activities. For a simple copy, add a Copy activity, set the source dataset and sink dataset. Configure mapping if the schemas differ. For transformations, add a Data Flow activity, define the data flow with source, transformation steps (e.g., select, filter, aggregate), and sink. Activities can be chained: for example, after a copy completes successfully, trigger a stored procedure activity.

4

Configure Triggers

Assign a trigger to the pipeline. Choose schedule trigger for periodic runs (e.g., every 15 minutes). For event-driven runs, use event trigger (e.g., when a new CSV file lands in Blob Storage). Tumbling window triggers are used for windowed processing (e.g., process data for the last hour every hour). Triggers can be paused and resumed.

5

Publish and Monitor

Publish the pipeline to save changes. Then monitor runs in the ADF Monitor. Check pipeline run status, activity run details, and error messages. For failed runs, use the error output to debug. You can also set up alerts via Azure Monitor to notify on failures. Use the 'rerun' option to rerun a failed pipeline from the beginning or from a specific activity.

What This Looks Like on the Job

Enterprise Scenario 1: Hybrid Data Integration for Retail

A global retailer needs to consolidate sales data from thousands of on-premises SQL Server databases in stores worldwide into Azure Synapse Analytics for reporting. The challenge: each store has a different schema version, and data must be cleansed and transformed before loading. ADF is deployed with a self-hosted IR installed on a VM in each region (e.g., North America, Europe, Asia) to handle the on-premises connections. The pipeline uses a Copy activity to stage raw data into Azure Blob Storage (Parquet format), then a Mapping Data Flow to apply schema mapping, remove duplicates, and aggregate sales by product. The pipeline runs on a tumbling window trigger every hour. In production, the self-hosted IR nodes are scaled to 8 cores each to handle peak loads. A common misconfiguration is not enabling parallel copies, causing slow data movement. The solution is to set parallelCopies to 10 and use staging copy to reduce load on the source SQL Server.

Enterprise Scenario 2: Real-Time IoT Data Ingestion

A manufacturing company streams sensor data from factory equipment to Azure Event Hubs. They need to process this data in near real-time and load it into Azure Data Lake Storage Gen2 for analytics. ADF is used with an event trigger that fires every time a new blob is created in the landing zone (Blob Storage). The pipeline runs a Data Flow to parse JSON sensor data, filter out anomalies, and write to Delta Lake format. The challenge is handling high throughput (thousands of blobs per minute). The solution uses multiple pipelines with different triggers, and the self-hosted IR is not needed because all sources are cloud-based. Performance tuning involves increasing the data flow compute size to 16 cores and setting the cluster TTL to 10 minutes to avoid cold starts. A common pitfall is hitting the concurrent pipeline run limit (100), which requires distributing pipelines across multiple factories.

Enterprise Scenario 3: SaaS Data Integration for Marketing

A marketing agency uses Salesforce, Marketo, and Google Analytics. They need to pull data daily into Azure SQL Database for cross-channel analysis. ADF provides built-in connectors for Salesforce and Marketo. The pipeline uses a Copy activity to extract data from Salesforce (using the Salesforce connector) and Marketo (using REST API linked service). The data is then transformed using a Data Flow to join and aggregate. The schedule trigger runs at 2 AM daily to avoid business hours. The key challenge is handling API rate limits from SaaS providers. ADF's retry policy is configured with 3 retries and a 30-second interval. Additionally, the 'Enable incremental copy' option is used to only copy new records, reducing API calls. A common mistake is not setting the 'Query' property for Salesforce to filter only recent records, causing full table scans and hitting API limits.

How AZ-305 Actually Tests This

AZ-305 Exam Focus on Azure Data Factory

The AZ-305 exam (Design Data Storage Solutions) tests your ability to recommend ADF for data integration scenarios. The relevant objective is "Design data integration" under Domain 2: Data Storage. Specifically, you need to understand when to use ADF vs. other services (e.g., Azure Logic Apps, Azure Data Lake Analytics, SSIS).

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing Azure Logic Apps over ADF for complex data transformations: Candidates often pick Logic Apps because it's simpler for small workflows. However, Logic Apps is designed for business process automation, not large-scale data movement. ADF supports hundreds of connectors, parallel copies, and Spark-based transformations. The exam will test that ADF is for data integration, Logic Apps for workflow orchestration.

2.

Selecting Azure Data Lake Analytics (ADLA) for ETL: ADLA is for big data analytics using U-SQL, not for ETL orchestration. Candidates confuse it with ADF. The correct answer is ADF for orchestration, possibly using Data Lake Storage as a destination.

3.

Using self-hosted IR for cloud-only scenarios: Candidates may think self-hosted IR is needed for all data movement. Actually, for cloud-to-cloud, Azure IR is sufficient. Self-hosted IR is only required when the source or destination is on-premises or in a VNet without Private Link.

4.

Forgetting to configure staging copy for cross-region data movement: Candidates may assume direct copy between regions works efficiently. ADF uses staging copy (via Blob Storage) to improve performance and avoid throttling. The exam may ask how to optimize cross-region copy.

Specific Numbers and Terms to Memorize

Default parallel copies: 4 (range 1-256).

Maximum retry count: 1000.

Activity timeout: 7 days.

Self-hosted IR heartbeat interval: 30 seconds.

Concurrent pipeline runs per factory: 100 (default).

Data flow TTL: 60 minutes (default).

Supported connectors: Over 90.

Edge Cases and Exceptions

Incremental copy: ADF supports incremental copy using watermark columns or change tracking. The exam may test that you can use a Lookup activity to get the last watermark, then a Copy activity with a query using that watermark.

Event trigger limitations: Event trigger only supports Blob Storage (create/delete events). For other sources, use schedule triggers.

Self-hosted IR scaling: You can add multiple nodes for high availability and scaling. The exam may ask about installing the self-hosted IR on an Azure VM in the same region as the data source.

Data Flow vs. Copy Activity: Data Flow is for transformations; Copy Activity is for data movement. The exam will test when to use each. For simple copy with no transformation, use Copy Activity.

How to Eliminate Wrong Answers

1.

Identify the requirement: If the question asks for "orchestrating data workflows" or "ETL/ELT", the answer is likely ADF. If it's about "connecting SaaS apps" or "simple automation", consider Logic Apps.

2.

Check the data volume: For large volumes (>1 TB), ADF with parallel copies is appropriate. For small volumes, Logic Apps may suffice.

3.

Look for hybrid: If the scenario involves on-premises data, the answer must include self-hosted IR. If all cloud, use Azure IR.

4.

Security requirements: If the question mentions "managed identity" or "Key Vault", ADF supports these natively. If it mentions "private endpoints", use Azure Private Link with ADF.

Key Takeaways

ADF is the primary data integration service in Azure for ETL/ELT workflows.

Use self-hosted IR for on-premises or VNet-connected data sources.

Copy Activity is for data movement; Mapping Data Flow is for transformations.

Default parallel copies per activity: 4; maximum retry count: 1000.

Triggers: Schedule, Tumbling Window, and Event (blob create/delete only).

Data Flow runs on Spark clusters with configurable TTL (default 60 min).

Secure credentials using Azure Key Vault linked services.

Monitor pipeline runs via Azure Monitor and set alerts on failures.

For cross-region copy, use staging copy via Blob Storage.

ADF supports Git integration for CI/CD (Azure DevOps, GitHub).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Data Factory (ADF)

Designed for large-scale data integration and ETL/ELT.

Supports over 90 connectors, including on-premises via self-hosted IR.

Activities can run in parallel with configurable concurrency.

Built-in monitoring and alerting for pipeline runs.

Supports code-first development (SDKs, Git).

Azure Logic Apps

Designed for business process automation and small workflows.

Supports 400+ connectors, but limited for large data volumes.

Workflow steps run sequentially by default; parallelism limited.

Monitoring via Azure Monitor, but less granular for data movement.

Primarily designer-based, limited code integration.

Watch Out for These

Mistake

Azure Data Factory can only copy data between Azure services.

Correct

ADF supports over 90 connectors including on-premises databases (SQL Server, Oracle), SaaS apps (Salesforce, Dynamics 365), and other clouds (Amazon S3, Google BigQuery). Use self-hosted IR for on-premises sources.

Mistake

Data Flow activities run on the same compute as Copy activities.

Correct

Copy activities use ADF's internal compute (Azure IR), while Data Flow activities run on separate Azure Databricks Spark clusters. You can configure the cluster size and TTL independently.

Mistake

ADF triggers can execute pipelines immediately without scheduling.

Correct

Triggers define when pipelines run. To run a pipeline immediately, use 'Trigger Now' in the portal or the `create-run` CLI command. Triggers are not required for manual runs.

Mistake

Self-hosted IR must be installed on a Windows server only.

Correct

Self-hosted IR can be installed on Windows (x64) or Linux (x64) as of version 5.0. It supports both operating systems.

Mistake

ADF cannot handle real-time data streaming.

Correct

ADF is primarily for batch and near-real-time (minutes) processing. For sub-second streaming, use Azure Stream Analytics or Event Hubs with Azure Functions. ADF can trigger on blob creation events for near-real-time.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Copy Activity and Data Flow in Azure Data Factory?

Copy Activity is used for moving data between sources and destinations with optional schema mapping. It does not perform complex transformations. Data Flow (Mapping Data Flow) is a visual transformation tool that runs on Spark clusters, allowing you to perform joins, aggregations, and pivots. Use Copy Activity for simple data movement, and Data Flow for complex transformations. On the exam, if the scenario requires 'transforming data' or 'cleansing', think Data Flow.

When should I use self-hosted integration runtime instead of Azure integration runtime?

Use self-hosted IR when your data source or destination is on-premises (e.g., SQL Server in a corporate network) or in a VNet that is not accessible via public endpoints. Azure IR is used for cloud-to-cloud operations. If you need to connect to an Azure VM that is not exposed to the internet, you can also use self-hosted IR installed on that VM. The exam will test this distinction in hybrid scenarios.

How can I schedule a pipeline to run every hour in Azure Data Factory?

Create a schedule trigger with recurrence set to '1 Hour'. You can also specify start time and time zone. For windowed processing (e.g., process data for the last hour), use a tumbling window trigger. The tumbling window trigger ensures that each run processes a specific time window and handles state (e.g., if a run fails, it can be retried for that window).

What is the maximum number of concurrent pipeline runs in a single Azure Data Factory?

The default maximum is 100 concurrent pipeline runs per factory. You can request an increase by contacting Azure support. If you have many parallel triggers, you may hit this limit. To avoid it, distribute pipelines across multiple factories or increase the limit.

Can Azure Data Factory connect to Amazon S3 or Google Cloud Storage?

Yes, ADF has built-in connectors for Amazon S3 and Google Cloud Storage. For Amazon S3, you need to provide access key ID and secret access key. For Google Cloud Storage, you use a service account key. These connectors allow you to copy data from these clouds into Azure, enabling multi-cloud data integration.

How do I monitor failed pipeline runs and set up alerts?

In ADF Monitor, you can view failed runs and error messages. To set up alerts, go to Azure Monitor > Alerts > Create alert rule. Select ADF as the resource, choose a signal like 'Failed Pipeline Runs Count' (metric), set a threshold (e.g., >0), and define an action group (e.g., email, SMS). You can also send logs to Log Analytics for advanced querying.

What is the purpose of staging copy in Azure Data Factory?

Staging copy is used when copying data between two databases (e.g., SQL Server to SQL Database) or across regions. It temporarily stores data in Azure Blob Storage, which reduces load on the source and destination, improves performance, and allows for parallel copying. Enable it by setting the 'StagingSettings' in the Copy activity.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Data Factory for Data Integration Design — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?