AZ-305Chapter 7 of 103Objective 2.4

Designing Data Integration Solutions

This chapter covers designing data integration solutions in Azure, focusing on Azure Data Factory (ADF), Azure Synapse Pipelines, and related services like Azure Databricks and Azure Logic Apps. For the AZ-305 exam, data integration is a core part of the Data Storage domain (objective 2.4), and questions typically account for 10-15% of the exam. You will need to choose the right integration tool based on scenarios involving on-premises/cloud hybrid, real-time vs. batch, code-free vs. code-heavy, and orchestration vs. transformation requirements. We'll dive deep into the mechanics of ADF, its components, pricing models, security, and monitoring, along with comparisons to other services.

25 min read
Intermediate
Updated May 31, 2026

Azure Data Factory as a Package Delivery Hub

Think of Azure Data Factory as a global package delivery hub like FedEx. You have various senders (on-premises SQL Server, Azure Blob Storage, Salesforce) needing to send packages (data) to various recipients (Azure Synapse, Azure SQL Database, Power BI). The hub receives packages at its 'landing dock' (Copy Activity) and then sorts them onto conveyor belts (pipelines) that move through sorting machines (data flows) where packages can be combined, filtered, or repackaged. Each package has a tracking label (metadata) that records its source, destination, and any transformations applied. The hub operates on a schedule (triggers) and can handle millions of packages per day at scale. If a package is damaged (data error), it goes to a 'damaged goods' bin (error handling) for review. Importantly, the hub does not store packages long-term — it only moves them. This mirrors ADF's design: it is a serverless orchestration service that copies and transforms data between supported sources and sinks without storing the data itself, using linked services (address books), datasets (package types), and activities (machines).

How It Actually Works

What is Azure Data Factory and Why Does It Exist?

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation. It was built to solve the problem of moving data between diverse sources and destinations in a scalable, reliable, and cost-effective manner, especially in hybrid environments where data may reside on-premises, in other clouds, or in various Azure services. ADF is serverless, meaning you don't provision any infrastructure; it scales automatically based on workload.

Core Components of Azure Data Factory

Pipeline: A logical grouping of activities that perform a unit of work. Pipelines can be scheduled or triggered by events. Each pipeline is a set of steps (activities) executed in sequence or in parallel.

Activity: The smallest unit of work in a pipeline. There are three main types: data movement activities (Copy Data), data transformation activities (Data Flow, Stored Procedure, etc.), and control activities (ForEach, If Condition, Wait, etc.).

Dataset: A named view of data that points to the data you want to use in an activity as input or output. Datasets reference linked services and define the structure (e.g., table name, file path).

Linked Service: Essentially a connection string that defines the connection information to an external resource (SQL Server, Blob Storage, etc.). It's like a data source adapter.

Integration Runtime (IR): The compute infrastructure used by ADF to perform activities. There are three types: Azure IR (for cloud data movement and dispatch), Self-hosted IR (for on-premises or network-restricted environments), and Azure-SSIS IR (for running SQL Server Integration Services packages).

Trigger: Defines when a pipeline run is executed. Types include schedule trigger, tumbling window trigger, storage event trigger (e.g., blob created), and custom event trigger.

How Data Movement Works: The Copy Activity

The Copy Activity is the workhorse for data ingestion. It copies data from a source to a sink. Under the hood, the Copy Activity uses the ADF copy engine, which can read from a source, optionally apply transformations (e.g., column mapping, partitioning), and write to a sink. The engine can use parallel copies (up to 256 for Azure Blob to Azure Synapse) to improve throughput. The default copy behavior is to copy as-is, but you can specify column mappings and use staging for large copies (e.g., copy from on-premises to Azure Synapse via Blob staging).

Data Flows: Code-Free Transformations

Mapping Data Flows allow you to transform data at scale without writing code. They are visually designed transformation pipelines that run on Spark clusters managed by ADF. Data Flows support transformations like Aggregate, Join, Lookup, Derived Column, Pivot, and more. They are billed based on the number of cores used per hour (minimum 8 cores, but you can scale up). Data Flows are ideal for ETL (Extract, Transform, Load) scenarios where you need to clean and reshape data before loading into a data warehouse.

Orchestration vs. Transformation

A key distinction: ADF is primarily an orchestration service. It orchestrates activities that may run on other services (e.g., Azure Databricks, HDInsight, SQL Server). For heavy transformations, you can use Data Flows or call external compute. This is different from Azure Databricks, which is a compute engine for data engineering and data science, or Azure Synapse Pipelines, which are essentially the same as ADF but integrated into Synapse Studio.

Integration Runtime Deep Dive

The Integration Runtime (IR) is the backbone of connectivity. The Azure IR is fully managed and can access Azure endpoints over the public internet. The Self-hosted IR is installed on a Windows machine (on-premises or Azure VM) and acts as a proxy to access on-premises data sources. It must have outbound internet access to Azure, and it communicates over HTTPS (port 443). The Azure-SSIS IR is a dedicated cluster of Azure VMs that run SSIS packages; it can be scaled up to 100 nodes and supports custom setup scripts.

Security and Authentication

ADF supports various authentication methods: Azure AD managed identity, service principal, access keys, SQL authentication, and OAuth. For on-premises sources, the Self-hosted IR uses Windows authentication or SQL authentication. Data is encrypted in transit using TLS (minimum 1.2). For storage accounts, you can use firewall rules and virtual network service endpoints to restrict access. ADF can also use Azure Key Vault to store secrets (e.g., connection strings, passwords) and reference them in linked services.

Monitoring and Alerts

ADF provides built-in monitoring in the Azure portal and via Azure Monitor. You can view pipeline runs, activity runs, and trigger runs. For each run, you can see duration, status, and error details. You can also set up alerts on failed runs or long-running pipelines. The monitoring experience includes a visual timeline and the ability to rerun failed activities. For advanced monitoring, you can send logs to Log Analytics and create custom dashboards.

Pricing Model

ADF pricing is based on the number of operations (pipeline runs, activity runs) and the compute used for Data Flows and Azure-SSIS IR. Copy Activity runs are charged per DIU (Data Integration Unit) hour — each DIU is a combination of CPU, memory, and network resources. The default DIU is 4, but you can set a maximum. Data Flows are charged per vCore-hour. Self-hosted IR has no additional cost beyond the VM it runs on. Triggers are charged per execution. There is also a charge for Azure-SSIS IR based on the VM size and number of nodes.

Integration with Other Services

ADF integrates deeply with Azure DevOps and GitHub for CI/CD. You can export pipelines as ARM templates and deploy them via Azure Pipelines. It also integrates with Azure Purview for data lineage and cataloging. For real-time scenarios, you would use Azure Stream Analytics or Event Hubs with Azure Functions, not ADF, as ADF is batch-oriented.

Common Configurations and Defaults

Copy Activity: default DIU = 4, max DIU = 256 (for Azure Blob to Azure Synapse via PolyBase).

Data Flow: default compute type = General Purpose, default core count = 8, max = 256.

Self-hosted IR: minimum version = 5.0, default heartbeat interval = 60 seconds.

Pipeline run timeout: default 7 days, max 30 days.

Trigger: schedule trigger supports recurrence up to once per minute.

Verification Commands (PowerShell/CLI)

# Get a list of pipelines
Get-AzDataFactoryV2Pipeline -ResourceGroupName "rg" -DataFactoryName "adf"

# Trigger a pipeline run
Invoke-AzDataFactoryV2Pipeline -ResourceGroupName "rg" -DataFactoryName "adf" -PipelineName "pipeline"

# Monitor a pipeline run
Get-AzDataFactoryV2PipelineRun -ResourceGroupName "rg" -DataFactoryName "adf" -PipelineRunId "run_id"
# Azure CLI equivalent
az datafactory pipeline list --resource-group rg --factory-name adf
az datafactory pipeline create-run --resource-group rg --factory-name adf --pipeline-name pipeline
az datafactory pipeline-run show --resource-group rg --factory-name adf --run-id run_id

Walk-Through

1

Create Linked Services

First, define the connection information for each data source and destination. For an on-premises SQL Server, you create a linked service that uses a Self-hosted Integration Runtime. You specify the server name, database, authentication type (e.g., SQL authentication), and optionally store the password in Azure Key Vault. For Azure Blob Storage, you create a linked service using managed identity or account key. Each linked service is a JSON object containing connection string, authentication type, and IR assignment. The Self-hosted IR must be installed on a machine that can reach the on-premises SQL Server and has outbound internet to Azure. The linked service is tested at creation time using the IR to ensure connectivity.

2

Define Datasets

Datasets represent the structure of the data within the linked services. For a SQL table, you specify the schema, table name, and optionally a query. For a blob file, you specify the container, folder path, file name (or pattern), and format (e.g., Parquet, CSV, JSON). Datasets are typed and can be used as input or output in activities. They also support parameters, allowing you to pass dynamic values at runtime (e.g., table name). The dataset definition includes a reference to the linked service and a 'type' (e.g., AzureSqlTable, DelimitedText). During pipeline execution, the dataset is resolved to actual data.

3

Build Pipeline with Activities

Create a pipeline and add activities. For a simple copy, add a Copy Data activity. Configure the source dataset (e.g., on-premises SQL table) and sink dataset (e.g., Azure Synapse table). Optionally, enable staging via Azure Blob for large copies. Set column mappings if the schema differs. For transformations, add a Data Flow activity. Connect the Data Flow to a source dataset, add transformation steps (e.g., Aggregate, Join), and sink to a dataset. You can also add control activities like ForEach to loop over a list of tables, or If Condition to branch based on a parameter. Each activity can have dependencies, creating a DAG (Directed Acyclic Graph).

4

Configure Triggers

Triggers define when the pipeline runs. For a daily schedule, create a schedule trigger with recurrence every 1 day. For event-based ingestion, create a storage event trigger that fires when a new blob is created in a specific container. The trigger can pass blob metadata as parameters to the pipeline. Tumbling window triggers are used for windowed processing (e.g., process all data from the last hour). You can also use custom event triggers with Azure Event Grid. Triggers can be paused and resumed. When a trigger fires, it creates a pipeline run. Multiple pipelines can be attached to one trigger.

5

Monitor and Validate Runs

After deployment, monitor pipeline runs in the ADF Monitor tab. You can see each run's status (Succeeded, Failed, In Progress), duration, and activity-level details. For failed runs, click into the activity to see error messages. Use Azure Monitor alerts to get notified on failures. You can also enable diagnostic settings to send logs to Log Analytics for advanced querying. Validate data integrity by comparing row counts between source and sink. For Data Flows, inspect the output to ensure transformations are correct. Use 'Debug' mode during development to test pipelines without waiting for triggers.

What This Looks Like on the Job

Enterprise Scenario 1: On-Premises SQL Server to Azure Synapse Analytics

A large retail company needs to move daily sales data from an on-premises SQL Server to Azure Synapse Analytics for reporting. They deploy a Self-hosted Integration Runtime on a Windows VM in their data center. They create a linked service for SQL Server (using SQL authentication) and another for Azure Synapse (using managed identity). A pipeline with a Copy Activity runs nightly via a schedule trigger (12 AM). To handle large volume (500 GB), they enable staging copy: data is first copied to Azure Blob Storage (intermediate), then loaded into Synapse using PolyBase for high throughput. They set the copy DIU to 64 for parallelism. They monitor using ADF alerts and Log Analytics. Common issues: network latency causing timeouts (solved by increasing DIU or using parallel copy) and authentication failures (resolved by rotating passwords in Key Vault).

Enterprise Scenario 2: Real-Time Log Ingestion with Event Hubs and ADF

A financial services company ingests real-time transaction logs from multiple sources into Azure Data Lake Storage Gen2. They use Azure Event Hubs to capture streaming data, then use a storage event trigger in ADF to process files as they land. ADF runs a Data Flow to parse JSON, apply schema validation, and enrich with lookup data from Azure SQL Database. The transformed data is written to Parquet files in a curated zone. They set the Data Flow compute to 16 cores to handle burst loads. They use Azure DevOps for CI/CD to deploy changes. Misconfiguration: if the trigger fires too frequently (every few seconds), it overwhelms the pipeline — they add a tumbling window trigger instead to batch files every 5 minutes.

Scenario 3: Hybrid Data Movement with Azure Databricks

A manufacturing company uses ADF to orchestrate an ETL process that involves heavy transformations (e.g., time-series aggregation, machine learning scoring). They use a pipeline with a Databricks Notebook activity that runs a Python script on a Databricks cluster. ADF copies raw data from IoT Hub to Blob Storage, then triggers the Databricks notebook, which reads the data, transforms it, and writes to Azure SQL Database. ADF then runs a stored procedure to update reporting tables. This hybrid approach leverages ADF's orchestration and monitoring while offloading compute-intensive work to Databricks. They use a Self-hosted IR for on-premises data sources (e.g., legacy databases). Performance bottleneck: the Databricks cluster startup time (2-3 minutes) — they use a cluster pool to reduce cold start latency.

How AZ-305 Actually Tests This

What AZ-305 Tests on Data Integration (Objective 2.4)

The exam focuses on your ability to recommend the appropriate data integration solution based on requirements. Key topics include:

Choosing between ADF, Azure Synapse Pipelines, Azure Logic Apps, and Azure Databricks for different scenarios (batch vs. real-time, code-free vs. code-heavy, orchestration vs. transformation).

Understanding the role of Integration Runtime (Self-hosted vs. Azure IR) and when each is required.

Security considerations: managed identity, Key Vault integration, firewall and VNet service endpoints.

Monitoring and alerting capabilities.

Cost optimization: DIU settings, Data Flow compute, and trigger frequency.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing Azure Logic Apps for batch data movement: Candidates see 'integration' and pick Logic Apps, but Logic Apps is for workflows and API orchestration, not high-volume data copy. ADF is the right tool for batch ETL.

2.

Selecting Azure Data Factory for real-time streaming: ADF is batch-oriented. For real-time, use Azure Stream Analytics or Event Hubs with Azure Functions.

3.

Assuming Self-hosted IR is always needed: Many think on-premises data always requires Self-hosted IR, but if the data can be accessed via public endpoint (e.g., cloud SaaS), Azure IR suffices. Self-hosted IR is only needed when the source is behind a firewall or on a private network.

4.

Confusing ADF with Azure Synapse Pipelines: They are essentially the same service. Synapse Pipelines are ADF pipelines within Synapse Studio. The exam may test that Synapse Pipelines are optimized for Synapse workspaces.

Specific Numbers and Terms That Appear on the Exam

DIU range: 2 to 256 (default 4).

Data Flow core count: 8 to 256 (default 8).

Self-hosted IR communication port: 443 (HTTPS).

Pipeline run timeout: 7 days default, max 30 days.

Schedule trigger minimum recurrence: 1 minute.

Tumbling window trigger: supports window sizes from 5 minutes to 30 days.

Azure-SSIS IR max nodes: 100.

Supported file formats for Data Flows: Parquet, CSV, JSON, Avro, ORC.

Copy Activity staging: required for large copies to Synapse via PolyBase.

Edge Cases and Exceptions

When copying from on-premises to Azure Synapse, staging via Blob is required if the data size exceeds 100 MB or if you want PolyBase performance.

Data Flows cannot be used in pipelines that are triggered by storage events if the source is on-premises; Data Flows require Azure IR.

ADF does not support incremental data loading natively; you must use watermark columns or change tracking.

For high-security environments, you can use a managed virtual network with private endpoints to avoid public internet.

How to Eliminate Wrong Answers

If the scenario mentions 'real-time' or 'streaming', eliminate ADF and consider Stream Analytics.

If the scenario mentions 'orchestration of on-premises ETL', look for Self-hosted IR.

If the scenario requires 'code-free transformations', Data Flows or Synapse Pipelines are correct.

If the scenario involves 'running SSIS packages', Azure-SSIS IR is the answer.

Key Takeaways

Azure Data Factory is a serverless data integration service for batch ETL/ELT, not real-time streaming.

The three Integration Runtime types are Azure IR, Self-hosted IR, and Azure-SSIS IR.

Self-hosted IR is required for on-premises or network-restricted data sources.

Copy Activity default DIU is 4, max is 256 (for PolyBase to Synapse).

Data Flows run on Spark and are billed per vCore-hour (default 8 cores).

Pipeline run timeout defaults to 7 days, max 30 days.

Schedule trigger minimum recurrence is 1 minute; tumbling window minimum is 5 minutes.

Use staging copy for large data transfers to Azure Synapse via PolyBase.

ADF supports CI/CD via ARM templates and Azure DevOps.

For real-time data integration, use Azure Stream Analytics or Event Hubs with Azure Functions.

Security best practices: use managed identity, Azure Key Vault, and private endpoints.

Monitoring: use ADF Monitor, Azure Monitor alerts, and Log Analytics.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Data Factory

Standalone service for data integration and ETL.

Can be used independently of Synapse Analytics.

Pipelines are managed in the ADF portal or SDK.

Supports all integration runtimes (Azure, Self-hosted, SSIS).

Ideal for heterogeneous data movement across many sources.

Azure Synapse Pipelines

Built into Azure Synapse Analytics workspace.

Tightly integrated with Synapse SQL and Spark pools.

Pipelines are managed in Synapse Studio.

Same underlying engine as ADF, but optimized for Synapse.

Best for ETL workflows that land in Synapse Analytics.

Azure Data Factory

Orchestration and code-free transformations via Data Flows.

Serverless, no cluster management.

Billed per DIU and vCore-hour.

Better for simple ETL and scheduled data movement.

Supports over 90 built-in connectors.

Azure Databricks

Compute engine for custom code (Python, Scala, SQL, R).

Requires cluster provisioning and management (auto-scaling available).

Billed per DBU (Databricks Unit) per hour.

Better for complex transformations, machine learning, and advanced analytics.

Supports Delta Lake for ACID transactions and time travel.

Watch Out for These

Mistake

Azure Data Factory can handle real-time streaming data.

Correct

ADF is a batch-oriented data integration service. It can process data in near-real-time using tumbling window triggers (minimum 5 minutes), but it is not designed for sub-second streaming. For real-time, use Azure Stream Analytics or Event Hubs with Azure Functions.

Mistake

You must use a Self-hosted Integration Runtime for all on-premises data sources.

Correct

Self-hosted IR is needed only if the data source is behind a firewall or on a private network that cannot be accessed directly from Azure. If the on-premises source is exposed via a public endpoint (e.g., a cloud-hosted database), you can use Azure IR.

Mistake

Azure Logic Apps and Azure Data Factory are interchangeable.

Correct

Logic Apps is for workflow automation and API integration, with over 200 connectors. It is suitable for orchestrating business processes and low-volume data movement. ADF is for enterprise-scale data integration, ETL, and data movement with high throughput. They are complementary, not interchangeable.

Mistake

Data Flows run on the same compute as Copy Activity.

Correct

Copy Activity runs on ADF's managed compute (DIU), while Data Flows run on Spark clusters. Data Flows are billed per vCore-hour, separate from Copy Activity DIU charges. They also have different scaling and performance characteristics.

Mistake

ADF can directly connect to on-premises data sources without any gateway.

Correct

ADF cannot directly access on-premises resources. It requires a Self-hosted Integration Runtime installed on a machine that can reach the on-premises source. The IR acts as a bridge, and all communication goes through HTTPS to Azure.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

When should I use Azure Data Factory vs. Azure Logic Apps?

Use ADF for high-volume, batch data movement and ETL with over 90 connectors. It supports complex transformations via Data Flows and can handle petabytes of data. Use Logic Apps for low-volume, event-driven workflows and API orchestration, such as sending emails or updating CRM records when a file lands. Logic Apps has 200+ connectors but is not designed for large data transfers. On the exam, if the scenario involves moving gigabytes of data or transforming data, choose ADF. If it involves a business process with human approval steps, choose Logic Apps.

How does the Self-hosted Integration Runtime connect to on-premises data?

The Self-hosted IR is installed on a Windows machine that has network access to the on-premises data source (e.g., SQL Server). It communicates with Azure over outbound HTTPS (port 443). The IR registers with ADF and polls for tasks. When a pipeline runs, the IR reads data from the source and sends it to the destination via HTTPS. The IR can be scaled out by adding more nodes (up to 4 for high availability). It supports Windows authentication and SQL authentication. It is required for any data source that cannot be accessed over the public internet.

What is a DIU and how does it affect copy performance?

DIU stands for Data Integration Unit, a measure of compute, memory, and network resources allocated to a Copy Activity. The default is 4 DIUs, but you can set a maximum up to 256. Higher DIU values increase parallelism and throughput, especially when copying from Azure Blob to Azure Synapse using PolyBase. However, not all sources benefit from high DIU — for example, an on-premises SQL Server may be bottlenecked by its own resources. You can set the DIU in the Copy Activity settings. The exam may ask you to recommend increasing DIU to improve performance.

Can Azure Data Factory handle incremental data loading?

ADF does not have a built-in incremental load feature. You must implement it yourself using watermark columns (e.g., a LastModified timestamp) or change tracking. A common pattern is to use a Lookup activity to get the last high watermark, then copy data where the timestamp is greater than that watermark, and finally update the watermark in a control table. Alternatively, you can use Azure Synapse Pipelines with change data capture (CDC) via Azure Data Factory's mapping data flows. The exam may test your ability to design an incremental load solution.

What is the difference between a tumbling window trigger and a schedule trigger?

A schedule trigger fires at a fixed time (e.g., daily at 8 AM) and does not process windows of data. A tumbling window trigger fires periodically (e.g., every 5 minutes) and processes a fixed-size window of time (e.g., the last 5 minutes of data). Tumbling window triggers are useful for processing streaming data in micro-batches. They support dependencies (e.g., wait for previous window to complete) and can be used with watermark columns. The minimum window size is 5 minutes. The exam may ask you to choose a tumbling window trigger for near-real-time processing.

How do I secure sensitive information in ADF linked services?

Use Azure Key Vault to store connection strings, passwords, and account keys. In the linked service JSON, reference the secret using the syntax: '@{Microsoft.DataFactory.Connectors.KeyVault.SecretUrl('https://myvault.vault.azure.net/secrets/mysecret')}'. ADF also supports managed identity for Azure services (e.g., Azure SQL Database, Blob Storage) to avoid storing credentials entirely. For Self-hosted IR, you can use Windows authentication or store credentials in the IR's local credential store (encrypted). The exam emphasizes using managed identity and Key Vault for security.

Can I use ADF to transform data without writing code?

Yes, using Mapping Data Flows. They provide a visual interface to design transformations like joins, aggregations, pivots, and lookups. Data Flows run on Spark clusters and are code-free. However, for very complex transformations, you may need to use a Databricks Notebook activity or a custom activity. The exam may ask you to recommend Data Flows for code-free ETL.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Designing Data Integration Solutions — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?