You are designing a data lake architecture using Azure Data Lake Storage Gen2. The data will be ingested from multiple sources with varying schemas. You need to organize the data in a way that supports both batch and streaming analytics while maintaining data lineage. Which folder structure convention should you use?
Medallion architecture provides clear separation and lineage.
Why this answer
Option C is correct because the medallion architecture (bronze, silver, gold) is the recommended pattern for Azure Data Lake Storage Gen2 when handling multiple sources with varying schemas. It supports both batch and streaming by storing raw data in bronze, applying incremental transformations in silver, and serving aggregated views in gold, while maintaining data lineage through clear layer boundaries and audit columns.
Exam trap
The trap here is that candidates often choose Option B (source then date) because it seems logical for organization, but they overlook the requirement to support both batch and streaming analytics while maintaining data lineage, which the medallion architecture explicitly addresses through layered transformations.
How to eliminate wrong answers
Option A is wrong because organizing by ingestion date only, with subfolders for each source, lacks schema evolution support and makes it difficult to trace data lineage across transformations. Option B is wrong because organizing by source system then by date does not provide a standardized processing pipeline for both batch and streaming, and it fails to separate raw, cleaned, and aggregated states. Option D is wrong because organizing by file format and date ignores the need for schema management and lineage tracking, and it does not facilitate incremental processing or data quality checks across layers.