DP-900Chapter 4 of 101Objective 1.4

Analytics and Data Warehouse Concepts

This chapter covers analytics and data warehouse concepts, a core topic for the DP-900 exam. You will learn the difference between transactional and analytical workloads, the architecture of a data warehouse (star schema, fact and dimension tables), and how Azure services like Azure Synapse Analytics and Azure Data Lake Storage support modern analytics. Approximately 15–20% of exam questions touch on data warehousing and analytics, making this a high-yield area. Mastering these concepts will help you distinguish between OLTP and OLAP systems and understand how data flows from source to insight.

25 min read
Intermediate
Updated May 31, 2026

The Library Archives vs. the Research Desk

Imagine a large library with two distinct services: the Archives and the Research Desk. The Archives is a massive, climate-controlled room where every book ever acquired is stored on shelves. Books are organized by category and date, and you can browse entire sections to explore historical trends. However, retrieving a single book takes time because you must walk through the aisles, and you cannot easily combine information from multiple books on the spot. The Archives represent a data lake or raw storage for historical data. The Research Desk, by contrast, is a small, curated table with a set of pre-prepared summary cards. These cards contain answers to common questions, like 'How many books were borrowed last month?' or 'Which author has the most titles?' The cards are updated nightly. If you ask a question that requires combining data from multiple cards, the librarian can quickly cross-reference them because the data is already structured and indexed. The Research Desk represents a data warehouse: optimized for fast, predictable queries on cleaned, aggregated data. In a company, the data warehouse (Research Desk) enables business analysts to get quick answers without waiting for huge scans of the raw data lake (Archives). The key difference is purpose: archives store everything for deep analysis, while the research desk serves fast, reliable reports.

How It Actually Works

What is Analytics and Why Do We Need Data Warehouses?

Analytics is the process of examining data to draw conclusions and support decision-making. In a business context, analytics answers questions like 'What were total sales last quarter?' or 'Which products have the highest return rate?' To answer these questions efficiently, you need a system optimized for read-heavy, complex queries that aggregate large volumes of historical data. This is where a data warehouse comes in. A data warehouse is a centralized repository designed for online analytical processing (OLAP). It stores data from multiple sources, cleaned, transformed, and structured for fast query performance. This contrasts with online transaction processing (OLTP) systems, which are optimized for high-volume, low-latency writes (e.g., order entry, banking transactions). OLTP systems use normalized schemas to minimize redundancy and ensure fast inserts/updates, but they are terrible for analytical queries because those queries would require joining many tables and scanning huge indexes. The exam expects you to know the key differences between OLTP and OLAP.

The Star Schema: Fact and Dimension Tables

The most common data warehouse schema is the star schema. It consists of one or more fact tables surrounded by dimension tables. A fact table stores quantitative, measurable data (e.g., sales amount, quantity sold, order count). It usually has a composite key made up of foreign keys referencing dimension tables. Fact tables are often very large (billions of rows) and are the primary target for aggregation queries. Dimension tables store descriptive attributes (e.g., customer name, product category, date, store location). They are smaller and allow you to slice and dice fact data. For example, a sales fact table might have columns: DateKey, ProductKey, CustomerKey, SalesAmount, Quantity. The dimension tables would be DimDate, DimProduct, DimCustomer. A query like 'Total sales by product category for Q1 2023' would join the fact table with DimProduct and DimDate and aggregate. The star schema is denormalized – dimension tables can have redundancy (e.g., product category repeated for each product) – but this is intentional to reduce the number of joins and improve query speed. The exam may ask you to identify which tables are facts and which are dimensions.

Extract, Transform, Load (ETL) Process

Data gets into a data warehouse via an ETL pipeline. Extract: Data is pulled from source systems (databases, APIs, flat files). Transform: Data is cleaned (remove duplicates, handle nulls), validated, aggregated, and reshaped into the star schema. Load: The transformed data is inserted into the warehouse tables. In Azure, you can use Azure Data Factory or Azure Synapse Pipelines to orchestrate ETL. The exam may test your understanding of ETL vs. ELT (Extract, Load, Transform), where ELT loads raw data first and transforms later inside the warehouse (common with modern cloud warehouses like Synapse). ETL is traditional and ensures the warehouse only contains clean data; ELT is faster for initial load but requires transformation compute.

Azure Synapse Analytics: The Cloud Data Warehouse

Azure Synapse Analytics is Microsoft's cloud data warehouse solution. It combines big data analytics and data warehousing. The core component is Synapse SQL, which provides two query models: dedicated SQL pools (formerly SQL Data Warehouse) and serverless SQL pools. Dedicated SQL pools provision fixed compute resources (DWU – Data Warehouse Units) and store data on persistent storage. Serverless SQL pools query data directly from files in Azure Data Lake Storage, charging per query. Synapse also integrates with Apache Spark for big data processing and Synapse Pipelines for ETL. On the exam, you should know that Synapse uses a massively parallel processing (MPP) architecture: data is distributed across 60 distributions per compute node, and queries are parallelized. The distribution is controlled by a hash or round-robin distribution key. Choosing the right distribution key is critical for performance. The exam may ask about distribution types: Hash distribution (evenly distributes rows based on a column's hash – good for large fact tables) and Round-robin (random distribution – good for small tables or staging).

PolyBase and Data Virtualization

PolyBase is a technology in Synapse that allows you to query external data sources (like Azure Blob Storage, Azure Data Lake Storage, or Hadoop) using T-SQL. It enables data virtualization – you don't need to load data into the warehouse to query it. PolyBase uses external tables with schemas defined on top of files (CSV, Parquet, ORC). This is handy for combining on-premises data with cloud data. The exam may test that PolyBase supports read-only queries on external data, and you can use CREATE EXTERNAL TABLE syntax. Note: PolyBase is also available in SQL Server 2016+ and Azure SQL Database (limited).

Data Lake vs. Data Warehouse vs. Data Lakehouse

Modern analytics often involves a data lake – a repository that stores raw data in its native format (structured, semi-structured, unstructured). Azure Data Lake Storage Gen2 is a hierarchical file system on top of Azure Blob Storage. A data lake offers flexibility and low-cost storage but requires significant effort to query directly. A data warehouse provides structured, clean data with fast query performance but is more expensive. A data lakehouse (like Azure Synapse with Spark) merges the two: it stores raw data in a data lake but adds a metadata layer (like Delta Lake) to provide ACID transactions and schema enforcement. The exam expects you to understand the trade-offs: data lake = cheap storage, schema-on-read; data warehouse = fast queries, schema-on-write; data lakehouse = best of both.

Dimensional Modeling: Slowly Changing Dimensions (SCD)

In a data warehouse, dimension attributes can change over time (e.g., a customer moves to a new city). Slowly Changing Dimensions (SCD) are techniques to handle these changes. The most common types are: - Type 0: Do nothing – keep original values. - Type 1: Overwrite – no history kept. - Type 2: Add a new row with a new surrogate key and effective dates – preserves full history. - Type 3: Add a new column to store the previous value – limited history. The exam may ask which SCD type to use for a given requirement. For example, if you need to track historical changes, use Type 2.

Partitioning and Indexing for Performance

To speed up queries, data warehouses use partitioning – dividing large tables into smaller segments based on a column (e.g., date). Partitioning allows query engines to skip irrelevant partitions (partition elimination). In Synapse, you can partition tables on the distribution key or a separate partition column. Additionally, columnstore indexes are the default in Synapse – they store data column-wise instead of row-wise, enabling high compression and faster aggregation queries. Rowstore indexes (B-tree) are better for point lookups. The exam may ask about columnstore vs. rowstore: columnstore is for large scans, rowstore for small lookups.

Querying the Data Warehouse: T-SQL and Aggregations

You query a data warehouse using standard T-SQL, but with some limitations (e.g., no UPDATE/DELETE on columnstore indexes without rebuild). Common analytical functions include SUM, COUNT, AVG, GROUP BY, and window functions like ROW_NUMBER(), RANK(). The exam may test your ability to write simple aggregation queries or interpret query results. Also, understand the concept of materialized views – pre-computed and stored query results that improve performance. In Synapse, you can create materialized views that are automatically refreshed.

Security in Data Warehouses

Data warehouses contain sensitive business data. Security mechanisms include: - Row-level security (RLS): Restrict rows based on user identity. - Column-level security: Restrict access to specific columns. - Dynamic data masking: Obfuscate sensitive data in query results. - Azure Active Directory (AAD) authentication: Use managed identities or service principals. - Firewall rules: Control IP access. The exam may ask which security feature to use for a given scenario (e.g., RLS for a salesperson to see only their region).

Monitoring and Optimization

Azure Synapse provides Dynamic Management Views (DMVs) to monitor query performance, resource usage, and wait statistics. You can also use Azure Monitor and Synapse Studio to view query plans and identify bottlenecks. Common optimization techniques include:

- Choosing the right distribution key (hash on a column with high cardinality and even distribution). - Using appropriate partitioning (e.g., monthly partitions for date). - Ensuring statistics are up-to-date (UPDATE STATISTICS). - Avoiding small tables without distribution (use round-robin). The exam may test your knowledge of DMVs like sys.dm_pdw_exec_requests to find long-running queries.

Walk-Through

1

Identify Business Questions and KPIs

Before building a data warehouse, you must understand what business questions need answers. For example, 'What are monthly sales trends?' or 'Which customer segments are most profitable?' These define the KPIs (Key Performance Indicators) that the warehouse will support. At this step, you also identify the granularity (level of detail) needed, such as daily sales per product per store. This drives the design of fact and dimension tables.

2

Design the Star Schema

Based on the business questions, you design the fact table(s) and dimension tables. Choose the grain (e.g., one row per sales transaction). Identify measures (quantitative columns like sales amount) and dimensions (descriptive attributes like time, product, customer). Define surrogate keys for dimensions and foreign keys in the fact table. For example, a sales fact table might have foreign keys to DimDate, DimProduct, DimStore, and DimCustomer. This step requires careful naming conventions and data type selection.

3

Set Up Azure Synapse Analytics Workspace

In the Azure portal, create a Synapse workspace. This provisions a resource group, a Synapse SQL pool (dedicated or serverless), a Data Lake Storage Gen2 account, and optional Spark pools. You configure firewall rules, authentication (Azure AD or SQL auth), and networking. For dedicated pools, you choose the performance tier (DW100c to DW30000c). The workspace provides a unified UI (Synapse Studio) for managing pipelines, SQL scripts, and notebooks.

4

Create External Tables for Raw Data

Use PolyBase to create external tables pointing to raw data in Azure Data Lake Storage. For example, you define a CSV file format and an external table with `CREATE EXTERNAL TABLE SalesRaw (SaleID int, SaleDate date, Amount decimal) WITH (LOCATION='sales/raw/', DATA_SOURCE=MyDataSource, FILE_FORMAT=MyFileFormat)`. This allows you to query raw data without loading it. You can then run transformations using `CREATE TABLE AS SELECT (CTAS)` to load into the warehouse.

5

Build ETL Pipeline with Azure Data Factory

Create a pipeline in Azure Data Factory or Synapse Pipelines to extract data from source systems (on-prem SQL Server, Azure SQL DB, etc.), transform it (clean, aggregate, convert to star schema), and load it into Synapse. Use activities like Copy Data, Data Flow, and Stored Procedure. Schedule the pipeline to run daily. Monitor pipeline runs in the monitoring view. For example, a pipeline might copy sales data from an OLTP database, join with product and customer tables, and insert into the warehouse fact table.

6

Optimize Query Performance

After loading data, optimize performance by creating columnstore indexes, partitioning large tables, and updating statistics. Use `ALTER INDEX ALL ON FactSales REBUILD` to rebuild indexes. Partition the fact table by month using `CREATE TABLE FactSales WITH (DISTRIBUTION=HASH(ProductKey), PARTITION (SaleDate RANGE RIGHT FOR VALUES ('2023-01-01', '2023-02-01', ...)))`. Run `UPDATE STATISTICS FactSales` to help the query optimizer. Monitor query performance using DMVs like `sys.dm_pdw_exec_requests` and `sys.dm_pdw_request_steps`.

7

Implement Security and Access Control

Set up row-level security (RLS) to restrict data access. For example, create a security policy that filters sales data by region based on the user's Azure AD group. Use `CREATE SECURITY POLICY SalesFilter ADD FILTER PREDICATE dbo.fn_SalesFilter(SalesRegion) ON dbo.FactSales`. Also, configure column-level security by granting permissions on specific columns. Enable dynamic data masking for sensitive columns like email. Finally, set up firewall rules to allow only specific IP ranges.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Sales Analytics

A large retail chain with thousands of stores needs to analyze daily sales across all stores. Their OLTP system (SQL Server) handles point-of-sale transactions but struggles with queries like 'Compare sales of winter jackets vs. raincoats by region for the last 3 years.' They build a data warehouse on Azure Synapse using a star schema. The fact table FactSales contains one row per transaction line (grain: product per transaction). Dimensions: DimDate, DimProduct, DimStore, DimCustomer. They use hash distribution on ProductKey to distribute data evenly. They partition by month for easy data management. ETL runs nightly via Azure Data Factory, pulling from the OLTP database, transforming (standardizing product names, calculating discounts), and loading. The warehouse supports dashboards in Power BI that refresh daily. Without the warehouse, the OLTP system would be overloaded by analytical queries, causing transaction slowdowns. Misconfiguration: initially, they used round-robin distribution on the fact table, leading to data skew and poor query performance. Switching to hash distribution on a high-cardinality column (ProductKey) balanced the load and improved query times by 70%.

Enterprise Scenario 2: Healthcare Claims Processing

A health insurance company processes millions of claims daily. They need to detect fraud patterns and analyze claim rejection rates. They use Azure Data Lake Storage Gen2 as a data lake to store raw claim files (JSON/Parquet). Using Synapse serverless SQL pool, they query the data lake directly without loading into a dedicated pool. This is cost-effective because queries are sporadic. They create external tables over the Parquet files and run T-SQL queries to aggregate claim amounts by provider, diagnosis code, and date. For deeper analysis, they use Synapse Spark notebooks to run machine learning models. The challenge is data variety: claim formats change over time. They handle schema evolution using Delta Lake on the data lake, which provides ACID transactions. A common mistake: forgetting to partition the data lake files by date, leading to full scans of years of data. They partition by year/month/day, achieving partition elimination and reducing query cost.

Scenario 3: Financial Reporting for a Global Bank

A bank must produce regulatory reports (e.g., Basel III) that require historical data up to 10 years. They use a dedicated SQL pool in Synapse with 100 TB of data. They use Type 2 slowly changing dimensions to track customer address changes. The fact table is partitioned by quarter and distributed by hash on AccountKey. They create materialized views for common aggregates (e.g., daily balance per account) to speed up reports. Security is critical: they implement row-level security so that branch managers see only their branch data. They also use column-level security to hide account numbers from non-privileged users. Performance tuning: they monitor sys.dm_pdw_exec_requests and find queries with high total_elapsed_time and use the query plan to identify missing statistics. They also use result-set caching for repeated queries. A frequent issue is that the ETL pipeline fails due to data type mismatches; they use data flow transformations to cast columns.

How DP-900 Actually Tests This

The DP-900 exam focuses on conceptual understanding rather than deep implementation. Key objective codes: 1.4 Describe analytics and data warehouse concepts. Specifically, you must be able to:

Differentiate between OLTP and OLAP workloads.

Identify components of a data warehouse (fact tables, dimension tables, star schema).

Describe ETL vs. ELT processes.

Understand the role of Azure Synapse Analytics and its components (dedicated SQL pool, serverless SQL pool, PolyBase).

Compare data lake, data warehouse, and data lakehouse.

Common Wrong Answers and Why Candidates Choose Them: 1. Choosing OLTP for analytical queries: Candidates see 'transaction' and think 'sales reporting.' Wrong – OLTP is for high-volume writes, not complex aggregates. The key is 'transactional' vs. 'analytical.' 2. Mistaking dimension tables for fact tables: In a star schema, fact tables contain measures (numbers) and foreign keys; dimension tables contain descriptive attributes. A common trick: the exam lists columns and asks which table is the fact. If the table has 'SalesAmount' and 'Quantity', it's a fact. If it has 'ProductName' and 'Category', it's a dimension. 3. Thinking ETL and ELT are the same: ETL transforms before loading, ELT transforms after. The exam may ask which is more suitable for raw data lakes – ELT, because you load raw first. 4. Assuming PolyBase is only for on-premises: PolyBase works with Azure Storage as well. Candidates might think it's legacy.

Specific Numbers and Terms: - DWU (Data Warehouse Units): range from DW100c to DW30000c. - Distribution count: 60 distributions per node. - Columnstore index is default for Synapse. - PolyBase supports CSV, Parquet, ORC, and Avro. - Synapse serverless SQL pool charges per TB of data processed.

Edge Cases: - The exam may ask about handling changing dimensions: Type 2 for history, Type 1 for overwrite. - Partitioning: range partitions on date column. Know that partition elimination speeds queries. - Data lake vs. warehouse: data lake stores raw data, schema-on-read; warehouse stores processed data, schema-on-write.

How to Eliminate Wrong Answers: - If the question mentions 'fast ingestion of raw data,' think data lake or ELT. - If it mentions 'consistent, fast query performance for business reports,' think data warehouse. - If it mentions 'combining data from multiple sources with transformations,' think ETL. - If it mentions 'querying data in Azure Storage using T-SQL without loading,' think PolyBase or serverless SQL.

Key Takeaways

Data warehouses are OLAP systems designed for fast analytical queries on historical data.

Star schema consists of fact tables (measures) and dimension tables (descriptive attributes).

ETL transforms data before loading; ELT loads raw data and transforms inside the warehouse.

Azure Synapse Analytics provides dedicated SQL pools (provisioned) and serverless SQL pools (on-demand).

PolyBase enables querying external data in Azure Storage using T-SQL via external tables.

Columnstore indexes are the default in Synapse and optimize aggregation queries.

Data lakes store raw data (schema-on-read); data warehouses store processed data (schema-on-write).

Slowly Changing Dimensions (SCD) Type 2 preserves historical changes by adding new rows.

Partitioning allows partition elimination to speed up queries on large tables.

Row-level security (RLS) restricts data access based on user identity.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

OLTP (Online Transaction Processing)

Optimized for high-volume writes/updates (INSERT/UPDATE)

Normalized schema (many tables, fewer joins)

Row-oriented storage (B-tree indexes)

Typical latency: milliseconds per transaction

Used for order entry, banking transactions

OLAP (Online Analytical Processing)

Optimized for complex read queries (aggregations)

Denormalized schema (star/snowflake, fewer joins)

Column-oriented storage (columnstore indexes)

Typical latency: seconds to minutes per query

Used for business intelligence, reporting

Watch Out for These

Mistake

A data warehouse is the same as a database.

Correct

A database (OLTP) is optimized for fast writes and updates, with normalized tables. A data warehouse (OLAP) is optimized for complex reads and aggregations, using denormalized schemas like star schema. They serve different purposes.

Mistake

ETL and ELT are interchangeable terms.

Correct

ETL transforms data before loading, requiring staging area and transformation compute. ELT loads raw data first, then transforms inside the warehouse, leveraging the warehouse's compute power. ELT is common in cloud because storage is cheap.

Mistake

PolyBase only works with Hadoop.

Correct

PolyBase works with Azure Blob Storage, Azure Data Lake Storage, and Hadoop. In Synapse, it is used to create external tables on files in Azure Storage.

Mistake

A data lake and data warehouse are the same thing.

Correct

A data lake stores raw data in native format (schema-on-read), is cheap, but requires effort to query. A data warehouse stores cleaned, structured data (schema-on-write), is more expensive, but offers fast query performance.

Mistake

You must use a dedicated SQL pool for all Synapse queries.

Correct

Synapse offers serverless SQL pool that queries data lake files directly without provisioning resources. It is ideal for ad-hoc queries and exploration, while dedicated pools are for predictable, high-performance workloads.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a data lake and a data warehouse?

A data lake stores raw data in its native format (structured, semi-structured, unstructured) and uses schema-on-read, meaning you define the schema when querying. It is cheap and flexible. A data warehouse stores processed, cleaned, and structured data with a predefined schema (schema-on-write). It is optimized for fast analytical queries but is more expensive. In Azure, Data Lake Storage is the data lake, and Synapse SQL pool is the data warehouse.

What are fact tables and dimension tables in a star schema?

Fact tables contain quantitative measures (e.g., sales amount, quantity) and foreign keys to dimension tables. They are the central table in a star schema and are typically very large. Dimension tables contain descriptive attributes (e.g., product name, customer city, date) and are smaller. They provide context to the facts. For example, a sales fact table might join with DimProduct to get product category.

What is PolyBase in Azure Synapse?

PolyBase is a technology that allows you to query external data sources using T-SQL. In Synapse, you create external tables that point to files in Azure Data Lake Storage or Blob Storage. You can then run SELECT queries on those tables as if they were local. PolyBase is read-only for external data and supports CSV, Parquet, ORC, and other formats.

When should I use a dedicated SQL pool vs. serverless SQL pool in Synapse?

Use a dedicated SQL pool when you have predictable, high-performance workloads that require consistent query performance and you want to provision fixed compute resources (DWU). It is ideal for large-scale data warehousing with high concurrency. Use serverless SQL pool for ad-hoc queries, exploration, and when you want to pay only for the data processed per query. Serverless is great for querying data lake files without loading.

What is the purpose of columnstore indexes in a data warehouse?

Columnstore indexes store data column-wise instead of row-wise. This allows for high compression (up to 10x) and faster scans because only the columns needed are read. They are ideal for aggregation queries that scan large portions of a table. In Synapse, columnstore indexes are the default and are recommended for fact tables.

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data in a staging area before loading it into the target system. This ensures only clean data is loaded. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the target system (e.g., using SQL or Spark). ELT is common in cloud data warehouses because they have powerful compute for transformations.

How do Slowly Changing Dimensions (SCD) work?

SCDs handle changes to dimension attributes over time. Type 0 ignores changes. Type 1 overwrites the old value (no history). Type 2 adds a new row with a new surrogate key and effective date range (preserves full history). Type 3 adds a new column to store the previous value (limited history). Type 2 is most common for tracking historical changes.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Analytics and Data Warehouse Concepts — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?