DP-900Chapter 1 of 101Objective 1.1

Data Roles and Core Concepts

This chapter covers the fundamental data roles, data types, and core concepts that form the foundation of the DP-900 exam. You will learn the distinctions between data engineers, data analysts, and data scientists, the characteristics of structured, semi-structured, and unstructured data, and the differences between transactional and analytical processing. Approximately 15-20% of DP-900 exam questions touch these core concepts, making this chapter essential for building a solid understanding before diving into Azure-specific services.

25 min read
Intermediate
Updated May 31, 2026

The Orchestra of Data Roles

Imagine a symphony orchestra preparing for a concert. The composer writes the musical score—defining the structure, melody, and harmony—but never touches an instrument. The conductor interprets the score, sets the tempo, and ensures each section plays in sync, but does not play a note themselves. The musicians (violins, brass, percussion) are the skilled players who read the sheet music and produce sound using their instruments. The audience listens to the final performance but has no role in creating it. Now map this to data: The data engineer is like the composer—they design and build the data pipelines, schemas, and infrastructure (the score). The data analyst is like the conductor—they query the data, create reports, and guide business decisions (the interpretation). The data scientist is like a soloist—they use advanced techniques to extract hidden insights (improvisation). The business user is the audience—they consume dashboards and reports. In Azure, these roles interact with specific services: data engineers use Azure Data Factory and Azure Synapse pipelines; data analysts use Power BI and Azure Synapse SQL pools; data scientists use Azure Machine Learning. Understanding who does what and which tools they use is critical for the DP-900 exam, which tests your ability to match roles to responsibilities and Azure services.

How It Actually Works

What Are Data Roles and Why Do They Exist?

In any organization that works with data, different individuals have different responsibilities. The DP-900 exam expects you to understand the three primary data roles: data engineer, data analyst, and data scientist. These roles are not just job titles; they represent distinct sets of tasks, tools, and objectives. The exam will test you on which role performs which activities and which Azure services they commonly use.

Data Engineer: Focuses on designing, building, and maintaining the infrastructure and pipelines that collect, store, and process data. They ensure data is available, reliable, and accessible for analysis. Key tasks include building ETL/ELT pipelines, managing data storage (e.g., Azure Blob Storage, Azure Data Lake Storage Gen2), and orchestrating data movement with Azure Data Factory or Azure Synapse Pipelines. They also handle data security, partitioning, and optimization.

Data Analyst: Enables business users to make data-driven decisions by creating reports, dashboards, and visualizations. They query data using SQL or tools like Power BI, identify trends, and communicate insights. Their work relies on clean, structured data prepared by data engineers. Common tools include Power BI, Azure Synapse SQL pools, and Azure Analysis Services.

Data Scientist: Uses advanced statistical and machine learning techniques to predict outcomes and discover hidden patterns. They build and train models using Azure Machine Learning, perform exploratory data analysis (EDA), and deploy models into production. They often work with large, complex datasets and require high-performance compute (e.g., Azure Databricks, GPU-enabled VMs).

How Data Roles Interact

In a typical data project, the flow is:

1.

Data Engineer ingests raw data from various sources (on-premises databases, IoT devices, SaaS applications) into Azure Data Lake Storage Gen2.

2.

Data Engineer transforms and cleans the data using Azure Data Factory or Azure Databricks, creating structured tables in Azure Synapse SQL pool or Azure SQL Database.

3.

Data Analyst connects Power BI to the curated data, builds measures and calculated columns, and publishes interactive reports.

4.

Data Scientist may use the same data to train a machine learning model in Azure Machine Learning, then deploy it as an API for predictions.

The exam often presents scenarios describing a task and asks you to identify which role is responsible. For example: "Who would design a pipeline to copy data from Azure Blob Storage to Azure SQL Database?" Answer: Data engineer.

Core Data Concepts: Structured, Semi-Structured, and Unstructured Data

Data can be classified by its structure. The DP-900 exam tests your ability to distinguish between these three types and identify appropriate storage solutions.

Structured Data: Data that conforms to a rigid schema with rows and columns, typically stored in relational databases. Each column has a defined data type (e.g., INT, VARCHAR, DATE). Examples: SQL tables, Excel spreadsheets. Azure services: Azure SQL Database, Azure SQL Managed Instance, Azure Synapse SQL pool.

Semi-Structured Data: Data that has some organizational properties (tags, key-value pairs) but does not require a fixed schema. It is self-describing, meaning the schema is embedded within the data. Common formats: JSON, XML, Parquet, Avro. Examples: Azure Cosmos DB items, Azure Blob Storage JSON files. Azure services: Azure Cosmos DB, Azure Blob Storage, Azure Data Lake Storage.

Unstructured Data: Data with no predefined structure or schema. It is typically binary or text files that cannot be easily stored in a relational model. Examples: images, videos, audio files, PDFs, log files. Azure services: Azure Blob Storage (for general unstructured data), Azure Data Lake Storage (for big data analytics), Azure Files (for file shares).

Transactional vs. Analytical Processing

Another core concept is the difference between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). The exam expects you to understand the characteristics and appropriate use cases for each.

OLTP (Transactional Processing) - Designed for high-volume, low-latency transactions (e.g., order processing, banking transactions). - Optimized for INSERT, UPDATE, DELETE operations with high concurrency. - Uses normalized schemas to reduce redundancy and ensure data integrity (ACID properties). - Azure services: Azure SQL Database, Azure SQL Managed Instance, Azure Database for MySQL/PostgreSQL.

OLAP (Analytical Processing) - Designed for complex queries and aggregations over large historical datasets (e.g., BI reports, data mining). - Optimized for read-heavy workloads with large scans and aggregations. - Uses denormalized schemas (star or snowflake) to improve query performance. - Azure services: Azure Synapse SQL pool (dedicated SQL pool), Azure Analysis Services, Power BI.

Key Differences at a Glance

| Aspect | OLTP | OLAP | |--------|------|------| | Purpose | Handle day-to-day transactions | Support business intelligence and analytics | | Data updates | Frequent, small writes | Bulk loads, periodic refreshes | | Query type | Simple, point queries (e.g., get order by ID) | Complex aggregations (e.g., total sales by region) | | Schema | Highly normalized (3NF) | Denormalized (star/snowflake) | | Response time | Milliseconds to seconds | Seconds to minutes | | Historical data | Limited (recent data) | Extensive (years of data) |

Data Processing: Batch vs. Real-Time

Data can be processed in two primary modes:

Batch Processing: Data is collected over a period (e.g., hourly, daily) and processed in a single job. Commonly used for ETL, reporting, and data warehousing. Azure services: Azure Data Factory, Azure Synapse Pipelines, Azure Databricks (batch jobs).

Real-Time (Stream) Processing: Data is processed as it arrives, with minimal latency. Used for fraud detection, IoT telemetry, and live dashboards. Azure services: Azure Stream Analytics, Azure Event Hubs, Azure IoT Hub, Azure Databricks (structured streaming).

The exam may ask you to recommend a processing approach based on latency requirements.

Data Storage Abstractions

Understanding how data is stored in Azure is critical. Key concepts include:

Blob Storage: For unstructured data. Blobs are stored in containers within a storage account. Three types: block blobs (for files), append blobs (for logging), page blobs (for VHDs).

Data Lake Storage Gen2: Combines Blob Storage with a hierarchical namespace for file system semantics. Supports POSIX permissions and is optimized for analytics workloads.

Azure Files: Managed file shares accessible via SMB protocol. Ideal for lift-and-shift migrations.

Azure SQL Database: Fully managed relational database. Supports elastic pools for resource sharing.

Azure Cosmos DB: Globally distributed NoSQL database for semi-structured data. Supports multiple APIs (SQL, MongoDB, Cassandra, Gremlin, Table).

Data Redundancy and Durability

Azure Storage accounts offer several redundancy options:

Locally Redundant Storage (LRS): Three copies within a single datacenter. Protects against server rack failures but not datacenter failure.

Zone-Redundant Storage (ZRS): Three copies across availability zones in the same region. Protects against datacenter failure.

Geo-Redundant Storage (GRS): LRS in primary region plus LRS in a paired secondary region. Protects against region-wide failure.

Read-Access Geo-Redundant Storage (RA-GRS): Same as GRS but secondary data is readable.

The exam may ask which redundancy option provides the highest durability or availability.

Command Examples

While DP-900 is conceptual, you may see Azure CLI or PowerShell commands in exam scenarios. For example:

Create a storage account:

az storage account create --name mystorageaccount --resource-group myResourceGroup --location eastus --sku Standard_LRS

Upload a blob:

az storage blob upload --account-name mystorageaccount --container-name mycontainer --name myfile.txt --file /path/to/myfile.txt

Query a SQL database:

SELECT COUNT(*) FROM Sales WHERE OrderDate > '2023-01-01';

Trap Patterns

Common wrong answers on DP-900 include:

Confusing data analyst with data scientist: Data analysts create reports; data scientists build ML models.

Mixing structured and semi-structured: JSON is semi-structured, not structured.

Thinking OLTP is for analytics: OLTP is for transactions, OLAP is for analytics.

Assuming all data in Blob Storage is unstructured: Blob can store any type, but it is commonly used for unstructured data.

Walk-Through

1

Identify the Data Role

When presented with a scenario, first determine the primary task. If the task involves building pipelines, managing storage, or ETL, the role is a data engineer. If it involves querying, reporting, or visualization, the role is a data analyst. If it involves machine learning, predictive modeling, or advanced statistics, the role is a data scientist. The exam will test this distinction directly.

2

Classify the Data Structure

Examine the data format. If it has a fixed schema with rows and columns (e.g., SQL table, CSV with header), it is structured. If it is self-describing with tags (e.g., JSON, XML, Parquet), it is semi-structured. If it has no schema (e.g., images, videos, raw text), it is unstructured. This classification determines the appropriate Azure storage service.

3

Determine Processing Type

Assess whether the workload is transactional (OLTP) or analytical (OLAP). OLTP involves many small, frequent writes and simple queries. OLAP involves large scans, complex aggregations, and historical data. Also decide between batch (scheduled, periodic) and real-time (continuous, low latency) processing based on latency requirements.

4

Select the Azure Service

Based on the role, data structure, and processing type, choose the appropriate Azure service. For relational OLTP: Azure SQL Database. For NoSQL semi-structured: Azure Cosmos DB. For big data analytics: Azure Synapse Analytics. For data movement: Azure Data Factory. For machine learning: Azure Machine Learning. For visualization: Power BI.

5

Verify with Exam Questions

Practice with sample questions that ask: 'Which Azure service should a data analyst use to create interactive reports?' (Power BI). 'Which role is responsible for designing data pipelines?' (Data engineer). 'Which type of data is JSON?' (Semi-structured). 'Which processing model is best for real-time IoT data?' (Stream processing).

What This Looks Like on the Job

Scenario 1: E-commerce Company Building a Customer 360 View

A large e-commerce company wants a unified view of customer interactions across web, mobile, and in-store. The data engineer builds Azure Data Factory pipelines to ingest clickstream logs (semi-structured JSON), transaction records (structured SQL), and customer service transcripts (unstructured text) into Azure Data Lake Storage Gen2. They then use Azure Databricks to clean and join the data, outputting structured tables to Azure Synapse SQL pool. The data analyst connects Power BI to Synapse, creates measures like 'Customer Lifetime Value', and publishes a dashboard for marketing. The data scientist uses Azure Machine Learning to build a churn prediction model using the same curated data. Common misconfiguration: if the data engineer fails to partition the data by date, queries become slow and expensive. In production, the pipeline processes 5 TB daily with a 99.9% SLA.

Scenario 2: Healthcare Provider Implementing Real-Time Patient Monitoring

A hospital network streams vital signs from IoT devices to Azure Event Hubs. A data engineer sets up Azure Stream Analytics to filter abnormal readings and write alerts to Azure SQL Database (OLTP) for immediate action. The same Stream Analytics job aggregates data into Azure Blob Storage (Parquet format) for historical analysis. The data scientist uses Azure Databricks to train a sepsis prediction model on the historical data. The data analyst creates a Power BI dashboard showing real-time bed occupancy and alert trends. Key performance consideration: Stream Analytics must handle 10,000 events per second with less than 5-second latency. A common mistake is using batch processing, which would delay critical alerts.

Scenario 3: Financial Services Firm Migrating Data Warehouse

A bank migrates its on-premises SQL Server data warehouse to Azure Synapse Analytics. The data engineer uses Azure Data Factory to copy historical data (10 years, 50 TB) using PolyBase for fast loading. They choose a dedicated SQL pool with 100 DWU (Data Warehouse Units) initially, scaling to 500 DWU during nightly ETL. The data analyst recreates existing SSRS reports in Power BI, using DirectQuery for live access. The data scientist uses the Synapse SQL pool as a source for Azure Machine Learning to detect fraudulent transactions. Common pitfall: not using appropriate distribution keys (e.g., hash distribution on CustomerID) leads to data skew and poor query performance. Monitoring with DMVs like sys.dm_pdw_exec_requests is essential.

How DP-900 Actually Tests This

DP-900 Objective 1.1: Core Data Concepts

The exam tests your ability to:

Describe the roles of data professionals (data engineer, data analyst, data scientist).

Identify structured, semi-structured, and unstructured data.

Differentiate between transactional (OLTP) and analytical (OLAP) workloads.

Understand batch vs. streaming processing.

Top 3 Wrong Answers Candidates Choose

1.

Confusing data analyst with data scientist: A scenario says 'build a machine learning model.' Many choose data analyst. Reality: Data scientists build ML models; data analysts create reports. Key differentiator: ML vs. visualization.

2.

Mixing structured and semi-structured: When asked 'what type of data is JSON?', candidates often say structured. Reality: JSON is semi-structured because it has tags (key-value pairs) but no fixed schema. The exam loves this.

3.

Thinking OLTP is for analytics: A question asks 'which workload is best for complex queries over historical data?' Candidates pick OLTP. Reality: OLAP is for analytics; OLTP is for transactions.

Specific Numbers and Terms

Redundancy options: LRS (3 copies in one datacenter), ZRS (3 copies across zones), GRS (6 copies across two regions), RA-GRS (readable secondary).

OLTP vs. OLAP: OLTP uses normalized schema; OLAP uses denormalized (star/snowflake).

Azure services: Azure SQL Database (OLTP), Azure Synapse SQL pool (OLAP), Azure Cosmos DB (NoSQL), Azure Data Factory (ETL), Power BI (visualization).

Edge Cases

A CSV file with headers is structured, but a CSV without headers is still structured if rows follow a consistent format.

Parquet and Avro are semi-structured (columnar storage with schema).

Data Lake Storage Gen2 supports both structured and unstructured data.

How to Eliminate Wrong Answers

If the question mentions 'pipeline', 'ETL', or 'storage', the answer is likely data engineer.

If it mentions 'report', 'dashboard', or 'visualization', the answer is data analyst.

If it mentions 'machine learning', 'model', or 'prediction', the answer is data scientist.

For data type questions: look for 'schema' or 'table' -> structured; 'tags' or 'self-describing' -> semi-structured; 'binary' or 'no structure' -> unstructured.

Key Takeaways

Data engineers build and maintain data pipelines; data analysts create reports; data scientists build ML models.

Structured data has a fixed schema (SQL tables); semi-structured has tags (JSON); unstructured has no schema (images).

OLTP is for high-volume transactions with normalized schema; OLAP is for analytics with denormalized schema.

Batch processing processes data in scheduled chunks; stream processing processes data in real-time.

Azure Blob Storage stores unstructured data; Azure Data Lake Storage Gen2 adds hierarchical namespace for analytics.

Azure Cosmos DB is a globally distributed NoSQL database for semi-structured data.

Power BI is the primary tool for data analysts to create interactive reports and dashboards.

Azure Machine Learning is used by data scientists to build, train, and deploy machine learning models.

Data redundancy options: LRS (local), ZRS (zone), GRS (geo), RA-GRS (readable geo).

The exam tests role-to-task and data-type-to-service matching.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Data Engineer

Builds data pipelines and infrastructure

Uses Azure Data Factory, Azure Synapse Pipelines, Azure Databricks

Focuses on data ingestion, transformation, storage

Works with raw and curated data

Ensures data reliability and availability

Data Analyst

Creates reports and dashboards

Uses Power BI, Azure Synapse SQL pools, Azure Analysis Services

Focuses on querying, visualization, insights

Works with curated, clean data

Communicates findings to business stakeholders

Structured Data

Fixed schema (rows and columns)

Stored in relational databases (Azure SQL, SQL Server)

Examples: SQL tables, CSV files with headers

Schema defined before data is stored

Easily queried with SQL

Semi-Structured Data

Self-describing schema (tags, key-value pairs)

Stored in NoSQL databases (Azure Cosmos DB) or files (JSON, Parquet)

Examples: JSON, XML, Parquet, Avro

Schema can evolve over time

Queried with specialized APIs (SQL API for Cosmos DB, or tools like Spark)

OLTP

High-volume, low-latency transactions

Optimized for INSERT/UPDATE/DELETE

Normalized schema (3NF)

Millisecond response times

Azure SQL Database, Azure Database for MySQL

OLAP

Complex queries over large historical data

Optimized for SELECT and aggregations

Denormalized schema (star/snowflake)

Seconds to minutes response times

Azure Synapse SQL pool, Azure Analysis Services

Watch Out for These

Mistake

Data analysts and data scientists are the same role.

Correct

Data analysts focus on descriptive analytics (what happened) using reports and dashboards. Data scientists focus on predictive analytics (what will happen) using machine learning models. They use different tools: Power BI vs. Azure Machine Learning.

Mistake

JSON is unstructured data.

Correct

JSON is semi-structured because it uses key-value pairs and can nest objects, but it does not require a fixed schema. Unstructured data has no structure at all, like images or raw text.

Mistake

OLTP systems are best for data warehousing.

Correct

OLTP systems are optimized for high-volume transactions and normalized schemas. Data warehousing requires OLAP systems that are optimized for complex queries and denormalized schemas (star/snowflake).

Mistake

Batch processing is always slower than real-time processing.

Correct

Batch processing can be very efficient for large volumes of data with no latency requirement. Real-time processing introduces overhead for low-latency delivery. The choice depends on the use case, not just speed.

Mistake

All data in Azure Blob Storage is unstructured.

Correct

Azure Blob Storage can store any type of data, including structured (e.g., CSV), semi-structured (e.g., JSON), and unstructured (e.g., videos). The storage service itself is agnostic; the data type is determined by the content.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a data engineer and a data analyst?

A data engineer designs and builds the infrastructure and pipelines to collect, store, and process data. They work with raw data and ensure it is available for analysis. A data analyst uses curated data to create reports, dashboards, and visualizations to help businesses make decisions. In Azure, data engineers use Azure Data Factory and Azure Synapse Pipelines; data analysts use Power BI and Azure Synapse SQL pools. The exam tests this distinction by presenting a task and asking which role performs it.

Is JSON considered structured or unstructured data?

JSON is semi-structured data. It has a structure defined by key-value pairs and nesting, but it does not enforce a fixed schema like a relational table. Each JSON object can have different fields. This flexibility makes it semi-structured. In Azure, JSON is commonly stored in Azure Cosmos DB or as files in Azure Blob Storage. The exam expects you to classify JSON as semi-structured, not structured or unstructured.

What is the difference between OLTP and OLAP?

OLTP (Online Transaction Processing) handles high volumes of small, real-time transactions (e.g., order processing). It uses normalized schemas to ensure data integrity. OLAP (Online Analytical Processing) handles complex queries and aggregations over large historical datasets (e.g., sales reports). It uses denormalized schemas (star/snowflake) for query performance. In Azure, OLTP services include Azure SQL Database; OLAP services include Azure Synapse SQL pool. The exam tests which workload fits each type.

What is batch processing vs. stream processing?

Batch processing processes data in scheduled, discrete chunks (e.g., nightly ETL). It is suitable for large volumes with no real-time requirement. Stream processing processes data continuously as it arrives, with low latency (e.g., IoT telemetry). Azure services: Azure Data Factory for batch, Azure Stream Analytics for stream. The exam may ask you to choose the appropriate processing model based on latency requirements.

What Azure storage service should I use for unstructured data?

For general unstructured data (images, videos, documents), use Azure Blob Storage. For big data analytics workloads that require a hierarchical namespace and POSIX permissions, use Azure Data Lake Storage Gen2. Both are part of Azure Storage accounts. The exam often presents a scenario and asks you to select the appropriate storage service based on the data type and access patterns.

What is the role of a data scientist in Azure?

A data scientist uses advanced analytics and machine learning to build predictive models. They use Azure Machine Learning to train, deploy, and manage models. They also use Azure Databricks for large-scale data processing and experimentation. The exam distinguishes data scientists from data analysts: data scientists build models, while data analysts create reports.

What is the difference between LRS, ZRS, GRS, and RA-GRS?

LRS (Locally Redundant Storage) keeps three copies within a single datacenter. ZRS (Zone-Redundant Storage) keeps three copies across availability zones in the same region. GRS (Geo-Redundant Storage) keeps LRS in the primary region and LRS in a paired secondary region. RA-GRS is like GRS but allows read access to the secondary copy. The exam may ask which option provides the highest durability (GRS/RA-GRS) or which is most cost-effective (LRS).

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Roles and Core Concepts — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?