DP-900Chapter 69 of 101Objective 3.4

Power BI Datasets and Dataflows

This chapter covers Power BI datasets and dataflows, two critical components for managing and preparing data in Power BI. For the DP-900 exam, approximately 10–15% of questions touch on analytics topics, with a subset specifically on Power BI datasets and dataflows. Understanding how datasets store data models and how dataflows enable reusable ETL processes is essential for answering questions about data preparation, storage modes, and the differences between datasets and dataflows. This chapter provides a deep dive into both concepts, including their architecture, configuration, and common exam scenarios.

25 min read
Intermediate
Updated May 31, 2026

Power BI Datasets and Dataflows: Like a Recipe Box and Prep Kitchen

Imagine you run a restaurant chain and need to create weekly reports on sales, inventory, and customer satisfaction. Your raw data is like fresh ingredients: sales transactions from each store (like crates of tomatoes and onions), inventory counts (like bags of flour and cheese), and customer surveys (like feedback cards). You can't serve these raw ingredients directly to customers—you need to prepare them. A dataflow is like your prep kitchen. You take the crates of tomatoes, wash them, chop them, and store them in labeled containers in the walk-in cooler. This prepped ingredient is now ready to be used in multiple recipes. The dataset is like a recipe book. Each recipe (like a Margherita pizza) specifies exactly which prepped ingredients to use, in what quantities, and how to combine them. When you want to make a pizza, you don't wash and chop tomatoes again—you grab the prepped tomatoes from the cooler. In Power BI, dataflows extract, transform, and load (ETL) data into a reusable, cleaned form stored in the Power BI service (the cooler). Datasets then connect to these dataflows and define the model—relationships, measures, and calculations—that reports and dashboards use. If you change the prepped ingredients (update the dataflow), all recipes using it automatically reflect the change. Without dataflows, each recipe would need to prep its own ingredients, leading to duplicated effort and inconsistency. Just as a well-organized prep kitchen saves time and ensures consistency across dishes, dataflows centralize data preparation, while datasets provide the analytical structure for visualization.

How It Actually Works

What Are Power BI Datasets and Dataflows?

Power BI datasets and dataflows are two distinct but complementary components in the Power BI ecosystem. A dataset is a collection of data that you import or connect to, including the data model, relationships, measures, and calculated columns. It is the foundation for reports and dashboards. A dataflow is a cloud-based ETL (Extract, Transform, Load) tool that allows you to ingest, clean, transform, and load data into a reusable storage layer in the Power BI service. Dataflows are built using Power Query Online and store data in Azure Data Lake Storage Gen2 (ADLS Gen2) under the Power BI service's managed storage.

Why They Exist

Datasets exist to provide a structured, optimized data model that Power BI can query efficiently. They support different storage modes: Import, DirectQuery, and Composite (which combines Import and DirectQuery). The Import mode loads data into memory, providing fast query performance but requiring periodic refreshes. DirectQuery queries the source directly, ensuring real-time data but with potential performance trade-offs. Composite mode allows a dataset to have some tables imported and others queried directly.

Dataflows exist to solve data preparation challenges. Without dataflows, each dataset would need to independently connect to source systems, apply transformations, and manage refresh schedules. This leads to duplication of effort, inconsistent transformations, and increased load on source systems. Dataflows centralize data preparation: you define transformations once in a dataflow, and multiple datasets can reference that dataflow as a source. This reduces redundancy, ensures consistency, and offloads transformation work from the Power BI service to the dataflow's compute engine.

How Dataflows Work Internally

A dataflow is defined using Power Query Online, which generates a set of M expressions. When you create a dataflow, you specify one or more entities (similar to tables). Each entity has a query that extracts data from a source, applies transformations, and loads the result. The dataflow is stored as a set of JSON and M files in ADLS Gen2, within the Power BI service's tenant storage. When the dataflow is refreshed, the Power BI service spins up compute resources (Azure Analysis Services or Power BI's own mashup engine) to execute the M expressions. The transformed data is written back to ADLS Gen2 in CSV or Parquet format (depending on the dataflow version).

There are two types of dataflows: - Standard dataflows: Use Power BI's managed storage and compute. They support scheduled refreshes and incremental refresh. - Analytical dataflows: (Preview) Use Azure Data Lake Storage Gen2 and Azure Data Factory for compute. They offer more advanced transformation capabilities and can be used with Azure Synapse and other Azure services.

Key Components and Defaults

Entities: Each dataflow contains one or more entities. An entity is essentially a table of data with a defined schema.

Refresh schedule: Dataflows can be refreshed up to 48 times per day (every 30 minutes) in Power BI Premium. In Power BI Pro, the limit is 8 times per day (every 1 hour).

Incremental refresh: Available in Premium, this allows you to refresh only new or changed data, reducing refresh times and resource consumption.

Linked entities: Dataflows can reference entities from other dataflows, enabling modular data preparation.

Computed entities: Entities that are created by applying transformations to other entities within the same dataflow or from linked entities.

How Datasets Use Dataflows

A dataset can connect to a dataflow as a source. When you create a dataset from a dataflow, Power BI imports the data from the dataflow's entities into the dataset's model. This is a one-way relationship: the dataset reads data from the dataflow, but changes to the dataset do not affect the dataflow. The dataset can also have its own transformations applied via Power Query, but it's recommended to do all transformations in the dataflow to keep the dataset simple.

Datasets have their own refresh schedules, which can be independent of the dataflow's schedule. However, it's common to set the dataset refresh to occur after the dataflow refresh completes, ensuring the dataset uses the latest data. You can configure this using a sequential refresh in Power BI Premium.

Storage Modes and Performance

Datasets support three storage modes: - Import: Data is copied into the Power BI in-memory column store. Queries are fast, but data is only as fresh as the last refresh. Suitable for small to medium data volumes (up to 1 GB per dataset in Pro, 10 GB in Premium, with options for larger sizes). - DirectQuery: Queries are sent to the source database (e.g., SQL Server, Azure SQL Database) in real-time. No data is imported. Performance depends on the source's query speed. Suitable for large data volumes or real-time needs. - Composite: Some tables are imported, others use DirectQuery. Allows mixing modes in a single dataset. Requires Premium.

Dataflows themselves do not have storage modes; they always store data in ADLS Gen2. However, when a dataset connects to a dataflow, the dataset's storage mode determines how the data is accessed. If the dataset uses Import, it copies the dataflow's output. If the dataset uses DirectQuery, it queries the dataflow's underlying storage (ADLS Gen2) via the Power BI engine.

Configuration and Verification

To create a dataflow: 1. In Power BI service, navigate to a workspace (requires Premium or Premium Per User for some features). 2. Click New > Dataflow. 3. Choose Define new entities to start from scratch, or Attach external common data model to use a predefined schema. 4. Use Power Query Online to define transformations. 5. Save and set refresh schedule.

To create a dataset from a dataflow: 1. In Power BI service, click New > Dataset. 2. Select Dataflow as the source. 3. Choose the dataflow and entities. 4. Configure the dataset (e.g., define relationships, measures). 5. Publish.

Verification: You can monitor dataflow refresh status in the workspace view. For datasets, check the dataset settings page for last refresh time and schedule.

Interaction with Related Technologies

Dataflows can connect to a wide range of data sources, including Azure SQL Database, Azure Blob Storage, SharePoint, Excel, and many others. They also integrate with Azure Data Lake Storage Gen2, allowing other Azure services (e.g., Azure Data Factory, Azure Databricks) to consume the transformed data. Dataflows support the Common Data Model (CDM) standard, enabling interoperability with Dynamics 365, Power Apps, and other Microsoft business applications.

Datasets interact with Power BI reports and dashboards. Reports are built on top of datasets, and dashboards pin visualizations from reports. Datasets can also be used with Power BI Q&A (natural language queries) and Power BI mobile apps. In Power BI Premium, datasets can be shared across workspaces via dataset sharing or by using XMLA endpoints for programmatic access.

Exam-Relevant Details

Dataflows require Power BI Premium or Premium Per User (PPU) for most advanced features like incremental refresh, linked entities, and computed entities. Pro users can create dataflows but with limitations (e.g., no incremental refresh).

Dataflows are stored in the Power BI service's managed ADLS Gen2 storage. You cannot directly access this storage unless you have a Premium capacity and enable the 'Dataflow data storage' setting.

The maximum number of entities per dataflow is 100. Each entity can have up to 100,000 rows in Pro (limited by dataflow size), but in Premium, there is no fixed row limit, though performance may degrade.

Dataflow refresh times are subject to capacity limits. In Premium, you can have up to 48 refreshes per day, but each refresh must complete before the next can start.

Datasets can also be created from other sources like Power BI Desktop files, Excel, CSV, and direct connections to databases. The exam may ask which source is best for a given scenario.

Common Exam Traps

Confusing dataflows with datasets: Dataflows are for ETL; datasets are for modeling. A common question: 'Which component should you use to clean and transform data?' Answer: dataflow. 'Which component contains measures and relationships?' Answer: dataset.

Assuming dataflows require Premium: Basic dataflows are available in Pro, but advanced features (incremental refresh, linked entities) require Premium.

Thinking dataflows are the same as Power Query in Power BI Desktop: Power Query in Desktop is for individual datasets; dataflows are cloud-based and shareable.

Forgetting that datasets can use DirectQuery on dataflows: This is possible but not common; the exam may test that DirectQuery on a dataflow is supported but requires the dataflow to be in a Premium workspace.

Best Practices

Use dataflows to centralize data preparation, especially when multiple datasets need the same transformations.

Use incremental refresh for large dataflows to reduce refresh time.

Keep datasets lightweight by doing all transformations in dataflows.

Monitor dataflow and dataset refresh times to ensure they complete within available windows.

Use the XMLA endpoint for advanced dataset management in Premium.

Summary

Power BI datasets and dataflows are foundational components for analytics. Dataflows provide reusable, cloud-based ETL, while datasets provide the analytical model for reports. Understanding their differences, storage modes, and configuration is crucial for the DP-900 exam. Focus on when to use each, the limitations of Pro vs Premium, and how they interact with other Power BI components.

Walk-Through

1

Create a Dataflow in Power BI Service

Navigate to a workspace in the Power BI service. Click 'New' and select 'Dataflow'. You will be prompted to choose between 'Define new entities' (start from scratch) or 'Attach external common data model' (use a pre-defined schema). For most scenarios, choose 'Define new entities'. You then enter the Power Query Online editor, where you can connect to various data sources (e.g., Azure SQL Database, Excel, SharePoint). Select a source, apply transformations using the Power Query interface (e.g., remove columns, filter rows, merge queries), and name each entity. Each entity becomes a table in the dataflow. After defining all entities, click 'Save' and set a refresh schedule (e.g., daily at 6 AM). The dataflow is now ready for datasets to consume.

2

Configure Incremental Refresh for Dataflow

In a Premium workspace, you can enable incremental refresh to optimize dataflow refreshes. Open the dataflow settings, go to 'Incremental refresh' tab, and toggle it on. Define a filter column (usually a date/time column) and set parameters: 'Store rows in the last' (e.g., 5 years) and 'Incrementally refresh data from the last' (e.g., 30 days). Power BI will create a partitioned table in ADLS Gen2, where only partitions within the incremental window are refreshed during each run. This reduces the amount of data processed and speeds up refresh times. Note that the source must support query folding for incremental refresh to work efficiently.

3

Create a Dataset from a Dataflow

In the same workspace, click 'New' and select 'Dataset'. Choose 'Dataflow' as the source. You will see a list of dataflows in the workspace. Select the desired dataflow and then choose which entities to include. You can select multiple entities. After selection, Power BI automatically creates relationships based on the dataflow's schema (if foreign keys are defined). You can then add measures, calculated columns, and hierarchies. Choose the storage mode: Import (default) or DirectQuery. For Import, you can set a refresh schedule. Publish the dataset. It will now appear as a data source for reports.

4

Configure Dataset Refresh Schedule

After creating the dataset, go to the dataset settings (click the dataset in the workspace, then the gear icon). Under 'Scheduled refresh', toggle it on and set the frequency (e.g., daily) and time. You can also specify a time zone. For datasets based on dataflows, it's best to schedule the dataset refresh to occur after the dataflow refresh completes. For example, if the dataflow refreshes at 6:00 AM, schedule the dataset refresh at 7:00 AM. In Premium, you can use the XMLA endpoint to orchestrate sequential refreshes. Note that if the dataset uses DirectQuery, no refresh is needed because it queries the source directly.

5

Create a Report on the Dataset

Once the dataset is published, you can create reports. In the Power BI service, click on the dataset and select 'Create report' (or use 'Get data' in Power BI Desktop). The report canvas opens with the dataset's fields available. Drag and drop fields to create visuals. Build measures if needed using DAX. The report is automatically saved to the same workspace. You can then pin visuals to dashboards. Reports and dashboards will reflect the data from the dataset. If the dataset is refreshed, the report updates automatically. For real-time needs, use DirectQuery or push datasets.

What This Looks Like on the Job

Enterprise Scenario 1: Sales Reporting Across Multiple Regions

A multinational company needs to generate weekly sales reports for each region. The raw data comes from different sources: an on-premises SQL Server for North America, an Azure SQL Database for Europe, and CSV files from partners in Asia. Without dataflows, each regional report would need separate ETL processes, leading to inconsistent transformations. The solution: create a dataflow that connects to all three sources, applies uniform transformations (e.g., standardizing currency, date formats, and product categories), and loads the data into a single entity. This dataflow is refreshed daily at 2:00 AM. Then, a single dataset is created from this dataflow, with measures for total sales, growth %, and region comparisons. The dataset uses Import mode for fast performance. Reports are built on this dataset, and regional managers access them via Power BI mobile. The dataflow reduces ETL duplication and ensures all reports use the same cleaned data. A common misconfiguration is forgetting to handle time zone differences, causing data from Asia to appear on the wrong day. To fix, the dataflow includes a step to convert all timestamps to UTC.

Enterprise Scenario 2: Marketing Campaign Analysis

A marketing team runs dozens of campaigns simultaneously, tracking performance via Google Analytics, Facebook Ads, and a custom CRM. They need a unified view of cost, impressions, and conversions. The challenge is that each source has different schemas and refresh intervals. The team creates a dataflow with linked entities: one entity for each source, and a computed entity that merges them using campaign IDs. Incremental refresh is configured to only pull the last 7 days of data, reducing API calls. The dataset uses DirectQuery to the dataflow so that reports always show the latest data without waiting for a full import. However, DirectQuery on a dataflow can be slow if the dataflow has many transformations. To optimize, the team pushes down as much transformation as possible to the source queries (query folding). They also use the 'Single sign-on' option to avoid credential prompts. A common pitfall is exceeding the dataflow's refresh duration limit (default 2 hours in Premium). They monitor refresh times and split large entities into smaller ones.

Enterprise Scenario 3: Financial Consolidation

A finance department needs to consolidate monthly financial data from 20 subsidiaries. Each subsidiary submits an Excel file with a specific template. The old process involved manually copying data into a master spreadsheet. With Power BI, they create a dataflow that connects to a SharePoint folder containing the Excel files. The dataflow uses Power Query to combine all files, apply data validation (e.g., check for missing values), and load the data into an entity. The finance team then creates a dataset with calculated measures for variance analysis and currency conversion. The dataset uses Composite mode: the financial data is imported, but a currency exchange rate table uses DirectQuery to an Azure SQL Database that updates hourly. This allows real-time exchange rates without importing them. The dataflow is refreshed on the 1st of each month after the subsidiaries submit their files. A common issue is that Excel files may have inconsistent column names. The dataflow uses a parameter to map columns dynamically. If a file is missing, the dataflow fails; they set up alerts to notify the administrator.

How DP-900 Actually Tests This

DP-900 Exam Focus: Power BI Datasets and Dataflows

This topic aligns with objective Domain 3: Analytics, specifically 3.4 Describe Power BI datasets and dataflows. The exam expects you to differentiate between datasets and dataflows, understand their use cases, and know the limitations of Power BI Pro vs Premium. Expect 2-3 questions on this topic.

Most Common Wrong Answers and Why Candidates Choose Them

1.

'Dataflows are used to create reports.' This is wrong because reports are built on datasets, not dataflows. Candidates confuse the purpose because dataflows can be a source for datasets, but they do not directly generate reports.

2.

'Datasets store raw data.' This is wrong because datasets store a data model (tables, relationships, measures), not raw data. Raw data is stored in dataflows or source systems. Candidates think datasets are like databases.

3.

'Dataflows require Power BI Pro.' This is partially true but misleading. Basic dataflows are available in Pro, but advanced features like incremental refresh require Premium. The exam may test that a Pro user can create a dataflow but cannot use linked entities.

4.

'You can edit a dataflow in Power BI Desktop.' This is false. Dataflows are created and managed in the Power BI service (online). Power BI Desktop can connect to a dataflow but cannot edit it. Candidates assume all Power BI tools are interchangeable.

Specific Numbers and Terms That Appear on the Exam

Refresh limits: Pro: 8 per day; Premium: 48 per day.

Storage limits: Pro dataset: 1 GB; Premium: 10 GB (configurable up to 100 GB with large dataset storage).

Dataflow entity limit: 100 entities per dataflow.

Incremental refresh: Requires Premium.

Linked entities: Requires Premium.

Computed entities: Requires Premium.

Storage modes: Import, DirectQuery, Composite (Composite requires Premium).

Dataflow storage: ADLS Gen2 (Azure Data Lake Storage Gen2).

Edge Cases and Exceptions

Dataflows in shared workspaces: Dataflows can be created in shared workspaces, but if the workspace is not Premium, advanced features are unavailable.

Using DirectQuery on a dataflow: This is possible if the dataflow is in a Premium workspace. The dataset queries the dataflow's underlying storage directly.

Dataflows with on-premises data: Requires an on-premises data gateway. The gateway must be installed and configured to connect to the source.

Dataflow refresh with large data: If a dataflow exceeds the 2-hour refresh timeout (default), it will fail. You can increase the timeout in Premium capacity settings.

How to Eliminate Wrong Answers

Use the underlying mechanism: Dataflows are for ETL (Extract, Transform, Load). Datasets are for modeling and reporting. If a question asks about cleaning, transforming, or combining data from multiple sources, the answer is dataflow. If it asks about measures, relationships, or building reports, the answer is dataset. For storage mode questions, remember that Import is for performance, DirectQuery for real-time, and Composite for mixed needs. Always check if Premium is mentioned: any advanced feature (incremental refresh, linked entities, composite models) implies Premium is required.

Key Takeaways

Dataflows are cloud-based ETL tools; datasets are data models for reporting.

Dataflows require Premium for advanced features like incremental refresh and linked entities.

Datasets support three storage modes: Import, DirectQuery, and Composite (Composite requires Premium).

Dataflows store data in Azure Data Lake Storage Gen2 (ADLS Gen2).

Dataflows can have up to 100 entities; each entity can have up to 100,000 rows in Pro (no fixed limit in Premium but performance may degrade).

Refresh limits: Pro = 8 per day, Premium = 48 per day.

Incremental refresh in dataflows reduces refresh time by processing only new/changed partitions.

Datasets can use DirectQuery to query dataflows directly (requires Premium workspace).

Dataflows cannot be used directly by reports; a dataset must be created from the dataflow first.

Common exam trap: confusing dataflows with datasets; remember dataflows = ETL, datasets = model.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Power BI Dataset

Used for data modeling: defines relationships, measures, and calculated columns.

Supports three storage modes: Import, DirectQuery, Composite.

Consumed by reports and dashboards.

Can be created from various sources: dataflows, databases, files, etc.

Refresh schedule independent of dataflows; can be set to run after dataflow refresh.

Power BI Dataflow

Used for data preparation: ETL (extract, transform, load).

Stores transformed data in Azure Data Lake Storage Gen2.

Consumed by datasets as a source.

Can only be defined using Power Query Online in the service.

Supports incremental refresh, linked entities, and computed entities (Premium).

Watch Out for These

Mistake

Dataflows are the same as Power Query in Power BI Desktop.

Correct

Power Query in Desktop is used for individual datasets and runs locally. Dataflows are cloud-based ETL tools that run in the Power BI service and can be shared across multiple datasets. They use Power Query Online, which has some differences and limitations compared to the desktop version.

Mistake

You need Power BI Premium to create any dataflow.

Correct

Basic dataflows are available with Power BI Pro. However, advanced features such as incremental refresh, linked entities, and computed entities require a Premium (PPU or capacity) license. Pro users can create dataflows with up to 100 entities but without these enhancements.

Mistake

Datasets store the actual data permanently.

Correct

Datasets store a compressed, in-memory representation of the data (in Import mode) or a metadata model that points to the source (in DirectQuery mode). The actual raw data is stored in the source systems or in dataflows (ADLS Gen2). Datasets are ephemeral in the sense that they are rebuilt on refresh.

Mistake

Dataflows can be used directly as a data source for reports without a dataset.

Correct

Reports require a dataset. While you can create a report based on a dataflow in Power BI Desktop (by connecting to the dataflow), the report still uses a dataset that is created automatically. In the service, you must create a dataset explicitly. Dataflows themselves are not directly queryable by reports.

Mistake

Incremental refresh reduces the amount of data stored in the dataflow.

Correct

Incremental refresh does not reduce total storage; it only reduces the amount of data processed during each refresh. The dataflow still retains the full dataset (based on the 'store rows in the last' parameter). Partitions for older data are kept and not re-processed, but they still consume storage.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a Power BI dataset and a dataflow?

A dataflow is an ETL tool used to ingest, clean, and transform data from multiple sources, storing the output in Azure Data Lake Storage Gen2. A dataset is a data model that contains tables, relationships, measures, and calculated columns, and is used as the source for reports and dashboards. Datasets can connect to dataflows as a source. In short, dataflows prepare data; datasets model and present it.

Do I need Power BI Premium to use dataflows?

Basic dataflows are available with Power BI Pro. However, advanced features like incremental refresh, linked entities, and computed entities require Power BI Premium (Premium Per User or Premium capacity). Also, dataflows in shared workspaces require Premium if you want to use these features.

Can I use a dataflow directly in a Power BI report?

No. Reports require a dataset. You can create a dataset from a dataflow, and then build reports on that dataset. In Power BI Desktop, you can connect to a dataflow as a data source, but the report still uses a dataset (created automatically when you load the data).

What storage modes are available for datasets, and which one should I use?

Datasets support Import, DirectQuery, and Composite modes. Import mode loads data into memory for fast performance; use for small to medium data that doesn't require real-time updates. DirectQuery queries the source directly; use for large data or real-time needs. Composite mode mixes both; use when you need some tables imported and others queried live. Composite mode requires Premium.

How do I schedule a dataflow refresh to run before a dataset refresh?

You can set the dataflow refresh schedule to an earlier time than the dataset refresh. For example, schedule the dataflow at 6:00 AM and the dataset at 7:00 AM. In Premium, you can use the XMLA endpoint or Power Automate to orchestrate sequential refreshes. There is no built-in dependency setting, so manual coordination is needed.

What is incremental refresh in dataflows?

Incremental refresh allows you to refresh only new or changed data in a dataflow, rather than the entire dataset. You define a filter column (usually a date) and set parameters for how far back to store data and how far back to refresh. This reduces refresh time and resource consumption. It requires Power BI Premium.

Can I edit a dataflow in Power BI Desktop?

No. Dataflows are created and edited in the Power BI service using Power Query Online. Power BI Desktop can connect to a dataflow as a data source, but it cannot modify the dataflow itself. To edit the dataflow, you must use the service.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Power BI Datasets and Dataflows — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?