A data analyst needs to query a large dataset stored in Azure Blob Storage using serverless SQL pool in Azure Synapse Analytics. Which data format should they use to minimize storage costs while still supporting efficient querying?
Parquet is a columnar format that offers high compression and efficient query performance, minimizing storage costs.
Why this answer
Parquet is a columnar storage format that compresses data efficiently and supports predicate pushdown, allowing serverless SQL pool in Azure Synapse to read only the necessary columns and rows. This minimizes storage costs while maintaining high query performance, unlike row-oriented formats such as CSV or JSON.
Exam trap
The trap here is that candidates often assume all compressed formats (like Avro) are equally efficient for analytics, but Azure Synapse serverless SQL pool is specifically optimized for columnar formats like Parquet, not row-oriented ones.
How to eliminate wrong answers
Option A is wrong because CSV is a row-oriented, plain-text format with no compression or schema, leading to larger storage footprint and slower queries due to full file scans. Option B is wrong because JSON is also row-oriented and self-describing, resulting in poor compression and inefficient querying as serverless SQL pool must parse the entire file. Option D is wrong because Avro, while compact and schema-based, is row-oriented and not optimized for analytical queries that benefit from columnar storage and predicate pushdown.