DP-900Chapter 64 of 101Objective 3.1

Bronze, Silver, Gold Medallion Architecture

This chapter covers the Bronze, Silver, Gold Medallion Architecture, a layered data design pattern used in Azure Databricks and modern data lakes to organize data by quality and readiness. For the DP-900 exam, this topic appears under Domain 3: Analytics, Objective 3.1, and typically represents 5–10% of questions. You will need to understand the purpose of each layer, how data flows between them, and the benefits of this architecture for analytics and machine learning workloads.

25 min read
Intermediate
Updated May 31, 2026

Oil Refinery: Crude to Refined Products

Imagine an oil refinery processing crude oil. The first stage (Bronze) is a distillation tower that separates crude into basic fractions like naphtha, kerosene, and gas oil. The output is raw and contains impurities—it's stored temporarily in large tanks. The second stage (Silver) takes those fractions and runs them through catalytic crackers and hydrotreaters to remove sulfur, adjust viscosity, and produce standard products like gasoline and diesel. These are blended to meet specifications and stored in labeled tanks. The third stage (Gold) takes specific blends and further processes them into high-value specialty products like lubricating oils, jet fuel, and petrochemical feedstocks. Each stage adds purification and enrichment. The refinery's value lies in transforming cheap crude into increasingly valuable, ready-to-use products. Similarly, data in the medallion architecture moves from raw ingestion (Bronze) to cleaned and enriched (Silver) to business-ready aggregates (Gold). Just as a refinery cannot skip steps—you cannot make jet fuel directly from crude—data pipelines must pass through each layer to ensure quality and usability.

How It Actually Works

What is the Medallion Architecture?

The Medallion Architecture is a data design pattern that organizes data into three incremental quality layers: Bronze (raw), Silver (cleaned and enriched), and Gold (aggregated, business-ready). It is widely used in Azure Databricks and Delta Lake environments to improve data reliability, performance, and governance. The architecture was popularized by Databricks and is now a standard recommendation for lakehouse implementations.

Why It Exists

Traditional ETL pipelines often dump raw data into a single location, making it difficult to trace lineage, enforce quality, and serve diverse consumers. The Medallion Architecture solves this by providing a clear progression: raw data is immutable, cleaned data is queryable, and aggregated data is optimized for dashboards and ML. This separation allows data engineers to reprocess from any layer without affecting downstream consumers.

How It Works Internally

#### Bronze Layer - Purpose: Ingests raw data from source systems (e.g., Azure Event Hubs, IoT Hub, Blob Storage) in its original format. - Storage: Typically Delta Lake tables with append-only mode. Data is stored in columnar Parquet format with a transaction log for ACID compliance. - Schema: Schema-on-read; the table schema is often defined loosely (e.g., all columns as strings) to avoid ingestion failures. - Key Properties: Immutable after write; records are never updated or deleted. Partitioning by ingestion date/time is common (e.g., yyyy/MM/dd). - Example: A JSON file from an IoT device is written to /bronze/iot/device_events/2025/03/15/events.json.

#### Silver Layer - Purpose: Cleanses, deduplicates, validates, and enriches the Bronze data. This layer is optimized for analysts and data scientists to explore. - Transformations:

- Type casting (e.g., string to timestamp) - Null handling (drop or default values) - Deduplication using row_number() over window - Joining with reference data (e.g., lookup tables) - Filtering out corrupted records - Storage: Delta Lake tables with upsert (merge) capability. The schema is well-defined and enforced. - Key Properties: Data is queryable; often used for ad-hoc analysis and as a source for further aggregation. - Example: The IoT events are parsed, invalid sensors are filtered, and device IDs are validated against a master device table.

#### Gold Layer - Purpose: Provides business-level aggregates, KPIs, and curated datasets for reporting, dashboards, and machine learning. - Transformations:

- Aggregations (e.g., SUM, AVG, COUNT) over time windows - Star schema modeling (fact and dimension tables) - Feature engineering for ML - Storage: Delta Lake tables, often with additional indexing or Z-ordering for performance. - Key Properties: Highly optimized for consumption by tools like Power BI, Azure Analysis Services, or ML frameworks. - Example: A table daily_sensor_agg with columns: date, device_id, avg_temperature, max_humidity.

Key Components and Defaults

Delta Lake: The default storage format for all layers in Azure Databricks. Provides ACID transactions, time travel, and schema enforcement.

Auto Loader: Used for incremental ingestion into Bronze. It automatically detects new files in cloud storage and processes them with schema inference and evolution.

Databricks Notebooks: Common tool for writing transformations between layers.

Pipeline Orchestration: Azure Data Factory or Databricks Jobs orchestrate the flow from Bronze to Silver to Gold.

Configuration and Verification

To view the layers in the Databricks catalog:

SHOW DATABASES;  -- typically 'bronze', 'silver', 'gold'
SHOW TABLES IN bronze;

To inspect data quality in Silver:

SELECT COUNT(*) AS total, 
       COUNT(DISTINCT device_id) AS distinct_devices,
       SUM(CASE WHEN temperature IS NULL THEN 1 ELSE 0 END) AS null_temps
FROM silver.iot_events;

Interaction with Related Technologies

Azure Data Lake Storage Gen2: The underlying storage for Delta tables. Hierarchical namespace enables efficient partitioning.

Azure Databricks: The compute engine that processes data through the layers using Spark.

Power BI: Connects to Gold tables for real-time dashboards.

Azure Purview: Can catalog and lineage-track data across Bronze, Silver, and Gold.

Common Traps on the Exam

Trap: Bronze data is always cleaned. Reality: Bronze is raw and may contain duplicates, nulls, or schema variations.

Trap: Silver is the final layer for reporting. Reality: Silver is cleaned but not aggregated; Gold is for reporting.

Trap: The architecture is only for streaming data. Reality: It works equally well for batch and streaming.

Trap: Gold tables are always smaller than Silver. Reality: Aggregations can reduce row count but increase column count (e.g., pivoted data), so size varies.

Walk-Through

1

Ingest raw data into Bronze

Data is ingested from source systems (e.g., Azure Event Hubs, Blob Storage, or IoT Hub) using Auto Loader or Structured Streaming. The ingestion process writes data as-is into Delta Lake tables in the Bronze layer. Each batch or micro-batch creates a new version of the table. The schema is inferred and may evolve over time. Partitioning by date (e.g., /yyyy/MM/dd) is common to optimize future reads. At this stage, no transformations are applied; even malformed records are stored with a flag or in a quarantine column.

2

Validate and clean data to Silver

A Databricks notebook or job reads from Bronze and applies transformations: type casting, null handling, deduplication using `row_number() over (partition by key order by ingestion_time desc)`, and joining with reference tables. Records that fail validation (e.g., missing required fields) are either dropped or written to a separate error table. The output is written to Silver Delta tables using merge (upsert) to handle updates. Schema is enforced; any new columns must be explicitly added via schema evolution.

3

Aggregate and curate data to Gold

From Silver, data is aggregated to produce business metrics. Common operations include daily aggregations (e.g., `SELECT date, device_id, AVG(temperature) FROM silver.iot_events GROUP BY date, device_id`), star schema modeling (creating fact and dimension tables), and feature engineering for ML (e.g., rolling windows, lag features). The output is written to Gold Delta tables with optimization techniques like Z-ordering on frequently filtered columns and `OPTIMIZE` and `VACUUM` commands to maintain performance.

4

Serve Gold data to consumers

Gold tables are exposed to end users via Power BI, Azure Analysis Services, or directly queried by data scientists. Access controls are applied using Databricks ACLs or Azure RBAC. The data is typically read-only for consumers. Refresh schedules are managed by the pipeline orchestration (e.g., Azure Data Factory) or Databricks Jobs, ensuring Gold tables are updated after each Silver-to-Gold run. Monitoring alerts are set up to detect stale data or quality issues.

5

Monitor and maintain data quality

Data quality checks are embedded in each layer transition. For example, row counts are compared between Bronze and Silver to ensure no unexpected drops. Delta Lake time travel allows rolling back to a previous version if a bad transformation is applied. Alerts are configured in Azure Monitor or Databricks to notify on schema drift, null rate spikes, or pipeline failures. Regular `VACUUM` operations remove old files to control storage costs.

What This Looks Like on the Job

Scenario 1: IoT Telemetry Pipeline

A manufacturing company ingests sensor data from thousands of IoT devices into Azure Event Hubs. The pipeline uses Azure Databricks with Auto Loader to stream data into the Bronze layer as raw JSON. The Silver layer cleans the data: it converts timestamps, filters out sensors with error codes, and joins with a device registry stored in Azure SQL Database using a lookup table in Databricks. The Gold layer computes hourly averages per production line and writes to a Delta table that feeds a Power BI dashboard. Common issues include schema drift when new sensor types are added—handled by enabling schema evolution in Auto Loader. Performance is optimized by partitioning Bronze by hour and Silver by date and device_id. Misconfiguration often occurs when the Silver layer does not deduplicate, causing inflated aggregates in Gold.

Scenario 2: Retail Clickstream Analysis

An e-commerce company captures clickstream data from its website into Azure Blob Storage as Parquet files. Bronze tables store raw events (page views, clicks, purchases) with a loose schema. Silver transforms the data: it joins with a product catalog, identifies sessions using a 30-minute inactivity timeout, and flags bot traffic. Gold creates a daily aggregated table with metrics like unique visitors, conversion rate, and average session duration. The Gold table is consumed by Tableau for executive dashboards. A frequent problem is that the Silver layer's sessionization logic incorrectly splits long sessions when the timeout is too short, leading to inflated visitor counts. The team uses Delta Lake time travel to compare versions and fix the logic.

Scenario 3: Financial Transactions Reconciliation

A bank processes millions of transactions daily from multiple source systems. Bronze ingests raw CSV files from legacy mainframes and JSON from modern APIs. Silver standardizes the schema, validates transaction amounts against reference data, and flags suspicious transactions. Gold produces daily summaries per branch and per product type for regulatory reporting. The pipeline must meet strict SLA for data freshness (Gold must be ready by 6 AM). The team uses Databricks Jobs with retry policies and Azure Monitor alerts. Common misconfiguration: not partitioning Gold tables appropriately, leading to slow Power BI refresh times. The solution is to use Z-ordering on the branch and date columns.

How DP-900 Actually Tests This

Exam Focus: DP-900 Objective 3.1

The DP-900 exam tests the Medallion Architecture under 'Describe analytics workloads' and 'Describe data warehousing and data lakehouse concepts.' You will NOT be asked to write code or configure pipelines. Instead, expect scenario-based questions that ask which layer to use for a given purpose.

Common Wrong Answers

1.

'Bronze layer is for cleaned data.' Wrong. Bronze is raw, unprocessed data. Candidates confuse 'Bronze' with 'Silver' because 'Bronze' sounds like a refined metal. Remember: Bronze = raw, Silver = cleaned, Gold = aggregated.

2.

'Gold layer is for data exploration.' Wrong. Gold is for business reporting and ML. Exploration happens in Silver because it has full detail and is clean. Candidates think 'Gold' is the most valuable, so it must be for everything, but the exam tests the specific purpose.

3.

'The architecture requires three separate storage accounts.' Wrong. All layers can reside in the same storage account, typically under different containers or directories. The question might describe a scenario with three separate accounts as a distractor.

4.

'Data flows directly from Bronze to Gold.' Wrong. The architecture mandates a sequential flow: Bronze → Silver → Gold. Skipping Silver would mean missing cleaning steps, leading to unreliable Gold data.

Specific Values and Terms

Look for keywords: 'raw,' 'immutable,' 'append-only' for Bronze; 'cleaned,' 'validated,' 'deduplicated' for Silver; 'aggregated,' 'curated,' 'business-ready' for Gold.

The exam may ask about Delta Lake as the underlying storage format.

Know that the architecture is also called 'multi-hop architecture' or 'multi-layer architecture.'

Edge Cases

What if a dataset needs no cleaning? The Silver layer could be a pass-through (same as Bronze), but conceptually it still exists as a separate layer.

What if data is already clean? You can still follow the architecture; Silver may just include minor type casting.

The exam might present a scenario where the Bronze layer is used for 'data exploration'—this is false; exploration happens in Silver.

How to Eliminate Wrong Answers

If the question says 'raw data as-is,' eliminate any option that mentions cleaning or aggregation—that's Bronze.

If the question says 'cleaned but not aggregated,' eliminate Bronze (raw) and Gold (aggregated)—that's Silver.

If the question says 'aggregated for dashboards,' eliminate Bronze and Silver—that's Gold.

Look for phrases like 'immutable' (Bronze), 'queryable' (Silver), 'optimized for consumption' (Gold).

Key Takeaways

The Medallion Architecture has three layers: Bronze (raw), Silver (cleaned), Gold (aggregated).

Bronze data is immutable and append-only; never update or delete records in Bronze.

Silver data is cleaned, deduplicated, and validated; it is the layer for data exploration.

Gold data is aggregated and curated for business reporting and machine learning.

Delta Lake is the recommended storage format for all layers due to ACID transactions.

The architecture is also called multi-hop or multi-layer architecture.

Data flows sequentially: Bronze → Silver → Gold; you cannot skip layers.

On the exam, identify the layer by keywords: raw (Bronze), clean (Silver), aggregate (Gold).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Bronze Layer

Stores raw data as-is from source systems

Schema-on-read; no strict schema enforcement

Append-only; data is immutable

Optimized for ingestion speed

No cleaning or deduplication applied

Silver Layer

Stores cleaned, validated, and enriched data

Schema enforced; well-defined structure

Supports upserts (merge) for updates

Optimized for queryability and exploration

Deduplication and type casting applied

Silver Layer

Contains detailed, row-level data

Used for ad-hoc analysis and data science

May include multiple versions of data (time travel)

Not aggregated; retains granularity

Schema is normalized or denormalized as needed

Gold Layer

Contains aggregated, business-level data

Used for dashboards, reports, and ML

Highly optimized for read performance

Aggregated by dimensions like time, geography

Often follows star schema (fact + dimension tables)

Watch Out for These

Mistake

The Bronze layer stores data in a structured, cleaned format.

Correct

Bronze stores data in its raw, original format (e.g., JSON, CSV, Parquet) with minimal schema enforcement. Cleaning happens in the Silver layer.

Mistake

The Gold layer is always smaller than Silver because it is aggregated.

Correct

While aggregations reduce row counts, Gold tables may have more columns (e.g., pivoted data, multiple metrics) and can be larger in size. Size varies by use case.

Mistake

The Medallion Architecture is only for streaming data.

Correct

It works for both batch and streaming. The layers are logical; streaming data is ingested into Bronze and processed through Silver and Gold incrementally.

Mistake

You must use Delta Lake for all layers.

Correct

Although Delta Lake is recommended for ACID transactions, the architecture can be implemented with other formats (e.g., Parquet, Avro) but loses transactional guarantees.

Mistake

Data in the Bronze layer is never updated.

Correct

Correct. Bronze is append-only and immutable. Updates are applied in Silver using merge operations. This ensures raw data is always recoverable.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the main purpose of the Bronze layer in the Medallion Architecture?

The Bronze layer serves as the initial landing zone for raw data ingested from source systems. Its primary purpose is to preserve data in its original format, ensuring immutability and enabling reprocessing if needed. Data is stored in an append-only manner, with minimal transformations (e.g., adding ingestion timestamp). This layer is critical for data lineage and auditability.

Can I skip the Silver layer and go directly from Bronze to Gold?

Technically you can, but it defeats the purpose of the architecture. Skipping Silver means you would perform cleaning and aggregation in a single step, making it harder to reprocess and debug. The Silver layer provides a clean, detailed dataset that can be reused for multiple Gold aggregations. Without Silver, you lose flexibility and data quality.

What is the difference between Bronze and Silver in terms of schema?

Bronze uses schema-on-read, meaning the schema is inferred at query time and can evolve without breaking ingestion. Silver enforces a schema-on-write, where the schema is defined and validated before writing. This ensures data quality and consistency in Silver, while Bronze can accept any format.

How does the Medallion Architecture improve data governance?

By separating raw, cleaned, and aggregated data, the architecture provides clear ownership and access controls. Bronze is typically restricted to ingestion pipelines; Silver is accessible to data engineers and analysts; Gold is for business users. Delta Lake's transaction log enables auditing and rollback.

Is the Medallion Architecture only for Azure Databricks?

No, it is a logical pattern that can be implemented in any data lakehouse environment. However, it is most commonly associated with Azure Databricks and Delta Lake because of their built-in support for ACID transactions and schema enforcement. Other platforms like Apache Spark with Parquet files can also implement it.

What typical transformations happen in the Silver layer?

Common transformations include: type casting (e.g., string to date), null handling (drop or fill), deduplication using row_number() over a key, filtering invalid records, joining with reference data, and renaming columns. The goal is to produce a clean, reliable dataset for analysis.

How often should data be promoted from Bronze to Silver to Gold?

The frequency depends on business requirements. For streaming data, it can be near real-time (every few minutes). For batch data, it might be hourly or daily. The pipeline orchestration (e.g., Azure Data Factory or Databricks Jobs) schedules these promotions. The key is that Gold must be updated frequently enough to meet reporting SLAs.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Bronze, Silver, Gold Medallion Architecture — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?