DP-900Chapter 56 of 101Objective 1.1

Data Formats: JSON, CSV, Parquet, and Avro

This chapter covers the four primary data formats you must understand for the DP-900 exam: JSON, CSV, Parquet, and Avro. These formats are fundamental to how data is stored, processed, and exchanged in Azure data services. Approximately 10-15% of DP-900 questions touch on data formats, either directly or in the context of Azure services like Azure Data Lake Storage, Azure Synapse Analytics, and Azure Databricks. Mastering this topic will help you answer questions about data ingestion, storage optimization, and schema management.

25 min read
Intermediate
Updated May 31, 2026

Shipping Containers for Data

Think of data formats like shipping containers used in global logistics. CSV is like a flat cardboard box—lightweight, easy to pack, but no structure for heavy or complex items. Anyone can open it with a simple tool, but if you drop it, items spill everywhere. JSON is like a plastic bin with dividers—flexible, can hold nested items, but takes up more space and requires a bit more effort to unpack. Parquet is like a custom-built steel container with compartments for different types of goods (numbers, dates, text). It's heavy to build but incredibly efficient for transport—multiple items stack neatly, and you only open the compartment you need. Avro is like a container with a built-in manifest—the manifest (schema) travels with the container, so anyone receiving it knows exactly what's inside and how to unpack it, even if they've never seen that container type before. Just as a shipping company chooses containers based on cargo type, distance, and handling equipment, a data engineer chooses a data format based on storage cost, query speed, schema evolution needs, and the tools that will process the data.

How It Actually Works

What Are Data Formats and Why Do They Matter?

Data formats define how data is structured, encoded, and stored in files. They determine how efficiently data can be read, written, queried, and transferred. In Azure, choosing the right format impacts storage costs, query performance (especially with services like Azure Synapse Serverless SQL or PolyBase), and compatibility with tools like Azure Data Factory, Azure Databricks, and Power BI.

The DP-900 exam expects you to understand the characteristics, use cases, and trade-offs of JSON, CSV, Parquet, and Avro. You will not be asked to write code but to identify which format is best for a given scenario.

CSV (Comma-Separated Values)

CSV is a plain-text format where each line represents a row, and columns are separated by a delimiter (typically a comma). It is human-readable and widely supported by almost every data tool.

How it works internally: - Each record is on a separate line. - Fields are separated by commas. If a field contains a comma, newline, or double quote, the field must be enclosed in double quotes. Double quotes inside a field are escaped by doubling them. - There is no standard for encoding, but UTF-8 is common. No schema is embedded; the first row may contain column headers, but this is not guaranteed. - CSV is row-oriented: all columns of a row are stored together.

Key characteristics: - Simple, easy to produce and consume. - No schema enforcement—data types are inferred (or all treated as strings). - No compression by default, but can be compressed with gzip, etc. - Inefficient for large datasets: reading a single column requires reading all rows; no indexing; parsing overhead.

Use cases: - Small datasets, manual editing, data exchange between legacy systems. - Export from databases, import into Excel.

Limitations: - No support for nested or hierarchical data. - No data type metadata—dates, numbers, and booleans are all strings. - No built-in compression; file sizes are large. - Parsing is error-prone due to lack of strict standard.

JSON (JavaScript Object Notation)

JSON is a lightweight text format for structured data using key-value pairs. It supports nested objects and arrays. It is language-independent but uses conventions familiar to programmers.

How it works internally: - Data is organized in objects {} (unordered collection of key-value pairs) and arrays [] (ordered list of values). - Keys are strings; values can be strings, numbers, booleans, null, objects, or arrays. - Whitespace is insignificant (except within strings). - JSON is self-describing: the structure is embedded in the data.

Key characteristics: - Human-readable (though can become verbose). - Schema-on-read: no predefined schema; each object can have different keys. - Supports nested structures, ideal for hierarchical data (e.g., IoT device readings, social media posts). - No built-in compression; files can be large. - Parsing is slower than binary formats.

Use cases: - Web APIs (REST), NoSQL databases (Azure Cosmos DB), configuration files. - Data from IoT devices, log events, telemetry.

Limitations: - Verbose: repeated keys in every record increase file size. - No indexing; full scan required for queries. - Not optimized for analytical queries (columnar access).

Parquet

Parquet is a binary, columnar storage format optimized for analytics. It is designed for efficient compression and encoding of data, especially for large datasets in data lakes.

How it works internally: - Parquet stores data by columns, not rows. All values for a column are stored together in contiguous pages. - Within each column chunk, data is further divided into pages (typically 1 MB uncompressed). Pages contain the actual data values and optional statistics (min, max, null count). - Parquet uses a row group structure: data is horizontally partitioned into row groups (e.g., 128 MB each). Within each row group, column chunks are stored separately. - Encoding techniques: dictionary encoding (replaces repeated values with integer IDs), run-length encoding (RLE), bit packing, delta encoding. These are applied per page. - Compression codecs: Snappy (default), gzip, LZO, LZ4, ZSTD. Snappy offers a good balance of speed and compression ratio. - Schema is stored in the file metadata (footer). The footer contains schema, column metadata, and row group offsets. This allows readers to skip irrelevant row groups or columns. - Parquet supports nested schemas using a repetition and definition level mechanism (Dremel paper).

Key characteristics: - Columnar: only read the columns needed for a query, drastically reducing I/O. - Highly compressed: columnar layout and encoding reduce storage. - Schema embedded: data types are explicit (int32, int64, float, double, byte_array, etc.). - Predicate pushdown: statistics in metadata allow skipping row groups that don't satisfy filter conditions. - Splittable: files can be split for parallel processing (e.g., by row groups).

Use cases: - Data warehousing, analytics, big data processing (Azure Synapse, Databricks, HDInsight). - Storing large volumes of structured data in Azure Data Lake Storage.

Limitations: - Not human-readable; requires tools to read. - Overhead for small datasets: metadata and columnar structure add overhead for small files. - Write performance is slower than row-oriented formats due to columnar assembly.

Avro

Avro is a binary, row-oriented format with a schema embedded in the file. It is designed for fast serialization and data exchange in streaming and messaging systems.

How it works internally: - Avro files consist of a header followed by data blocks. The header contains the schema (in JSON) and optionally a sync marker (a 16-byte random value). - Data blocks are groups of records. Each block starts with a sync marker, followed by the block size (in bytes), the number of records in the block, and the serialized records. - Records are serialized in binary using the schema. The schema defines the record structure, field names, and data types. Avro supports primitive types (null, boolean, int, long, float, double, bytes, string) and complex types (record, enum, array, map, union, fixed). - Avro uses a compact binary encoding: integers are variable-length (ZigZag encoding), strings are length-prefixed UTF-8. - The schema is stored in JSON in the header, making the file self-describing. Schema evolution is supported: readers can use a different schema than the writer, as long as they are compatible (field additions, deletions, type promotions). - Avro files are splittable: each data block can be read independently because the sync marker allows locating block boundaries.

Key characteristics: - Row-oriented: all fields of a record are stored together, making writes fast. - Self-describing: schema is embedded, so any reader can interpret the data. - Compact binary: smaller than JSON but larger than Parquet for analytical queries. - Supports schema evolution: add or remove fields without breaking readers. - Splittable for parallel processing.

Use cases: - Streaming data pipelines (Azure Event Hubs, Kafka, Azure Data Factory). - Data exchange between systems where schema must be preserved. - Log collection and serialization in Hadoop/Spark.

Limitations: - Not columnar: reading a single column requires reading all rows. - Larger than Parquet for analytical storage. - Less efficient compression than columnar formats.

Comparison Summary

| Feature | CSV | JSON | Parquet | Avro | |---------|-----|------|---------|------| | Readability | High | High | Low | Low | | Schema | None/optional | Self-describing | Embedded | Embedded | | Storage | Row | Row | Column | Row | | Compression | None/optional | None/optional | Excellent | Good | | Splittable | Yes (with gzip) | No (unless line-delimited) | Yes | Yes | | Nested data | No | Yes | Yes | Yes | | Schema evolution | No | No | Limited | Yes | | Best for | Small/simple data | APIs, NoSQL | Analytics | Streaming, exchange |

How Azure Services Use These Formats

Azure Data Lake Storage (ADLS) Gen2: Supports all four formats. Parquet is recommended for analytics due to performance.

Azure Synapse Analytics: Serverless SQL can query CSV, JSON, Parquet, and Avro. Dedicated SQL Pool uses tables, but COPY INTO supports these formats.

Azure Databricks: Spark natively optimizes for Parquet (Delta Lake format). Avro is used for streaming. JSON and CSV for ingestion.

Azure Data Factory: Supports all formats for copy activities and data flows.

Azure Event Hubs: Captures data in Avro format by default.

Azure Cosmos DB: Uses JSON natively for its document model.

Exam Tips

Know that Parquet is columnar and optimized for read-heavy analytics.

Know that Avro is row-oriented and optimized for write-heavy streaming.

Know that JSON is self-describing and supports nested data.

Know that CSV is simple but lacks schema and is inefficient for large data.

Remember that the DP-900 exam may ask: "Which format should you use for a data warehouse workload?" Answer: Parquet.

"Which format is best for streaming data where schema may change?" Answer: Avro.

Walk-Through

1

Choose a data format based on workload

Identify whether the workload is read-heavy (analytics) or write-heavy (streaming/ingestion). For analytical queries that read only a few columns, choose a columnar format like Parquet. For high-volume writes where schema evolution is expected, choose a row-oriented format like Avro. For simple data exchange or small datasets, CSV or JSON may suffice. This decision affects storage costs, query performance, and compatibility with Azure services.

2

Define the schema (if applicable)

For formats that support schemas (Parquet, Avro), define the schema upfront. In Parquet, the schema is stored in the file footer. In Avro, the schema is stored in JSON in the file header. For CSV, schema is typically inferred or defined externally. For JSON, schema is implicit in the data. Schema definition includes field names, data types, and nullability. Proper schema design enables efficient encoding and predicate pushdown.

3

Serialize data into the chosen format

Data is serialized according to the format's specification. For CSV, each record becomes a line with comma-separated fields, handling special characters with quotes. For JSON, data is written as key-value pairs with proper nesting. For Parquet, data is grouped into row groups, then column chunks, then pages; encoding and compression are applied. For Avro, records are grouped into blocks, each block serialized using the schema with binary encoding.

4

Write to storage (Azure Data Lake, Blob, etc.)

The serialized file is written to a storage service like Azure Data Lake Storage Gen2 or Azure Blob Storage. File size and number of files matter: many small Parquet files hurt performance due to metadata overhead. Aim for files of 100-500 MB for optimal query performance. For Avro, block size can be configured (default 64 KB). For CSV and JSON, consider compression (e.g., gzip) but note that splittability may be affected.

5

Read and query data using Azure services

When querying with Azure Synapse Serverless SQL or Spark, the engine reads the file metadata to understand schema and partitioning. For Parquet, predicate pushdown uses min/max statistics in metadata to skip irrelevant row groups. For Avro, the schema is read from the header, allowing data interpretation. For CSV, the engine must scan all rows to infer types. Query performance is best with columnar formats and proper file sizing.

What This Looks Like on the Job

Scenario 1: Enterprise Data Lake with Azure Synapse Analytics

A large retail company ingests terabytes of sales data daily from thousands of stores. They use Azure Data Lake Storage Gen2 as the single source of truth. Initially, they stored data as CSV files because it was easy to export from legacy systems. However, queries in Azure Synapse Serverless SQL took minutes even for simple aggregations. The data engineering team converted the data to Parquet format with Snappy compression. They partitioned the data by date and store ID. After conversion, query times dropped from minutes to seconds. Storage costs decreased by 60% due to better compression. The key lesson: Parquet is essential for analytical workloads. Misconfiguration like using too many small files (less than 10 MB) still caused performance issues, so they implemented a job to coalesce files into 200-500 MB chunks.

Scenario 2: Real-Time Streaming with Azure Event Hubs and Avro

A financial services firm processes millions of stock trade messages per second. They use Azure Event Hubs to capture streaming data. By default, Event Hubs Capture writes data in Avro format to Azure Blob Storage. The Avro schema includes fields like trade ID, symbol, price, volume, and timestamp. As new trade types are added, the schema evolves without breaking existing pipelines. Downstream, Azure Databricks reads the Avro files for real-time analytics. The row-oriented nature of Avro allows fast writes, which is critical for high-throughput ingestion. A common mistake is to use JSON for capture, which increases storage costs and parsing overhead. Avro's compact binary encoding reduces storage by 40% compared to JSON.

Scenario 3: IoT Data Ingestion with JSON and Conversion to Parquet

A smart city project collects sensor data from thousands of devices: temperature, humidity, traffic counts, etc. Devices send data as JSON messages via Azure IoT Hub. The raw JSON is stored in Azure Data Lake Storage for archival. However, for analytics, the data is converted to Parquet using Azure Data Factory mapping data flows. The conversion flattens nested JSON into columns and applies schema. This enables fast queries in Power BI dashboards. The challenge is that JSON files are often small (1-2 KB each), leading to the "small file problem" in Parquet. The solution is to batch JSON files and write larger Parquet files (100-200 MB) using a tumbling window in Data Factory.

How DP-900 Actually Tests This

What DP-900 Tests on Data Formats (Objective 1.1)

The DP-900 exam covers data formats under "Core Data Concepts" – specifically, identifying appropriate data storage formats for different workloads. You will not be asked to write code or parse files. Instead, you must understand:

The difference between row-oriented and column-oriented storage.

Which formats support schema evolution.

Which formats are human-readable.

Which formats are optimized for analytical queries.

Which Azure services use which formats by default.

Common Wrong Answers and Why Candidates Choose Them

1.

"CSV is best for analytical queries because it is simple." – Wrong. CSV is row-oriented and lacks compression and indexing. Analytical queries benefit from columnar formats like Parquet that reduce I/O.

2.

"JSON is a binary format." – Wrong. JSON is text-based. Binary formats are Avro and Parquet.

3.

"Avro is columnar." – Wrong. Avro is row-oriented. Candidates confuse Avro with Parquet because both are binary and self-describing.

4.

"Parquet does not support schema." – Wrong. Parquet embeds schema in the file footer.

Specific Values and Terms That Appear on the Exam

Snappy compression: Default for Parquet.

Column-oriented / columnar: Key term for Parquet.

Self-describing: Applies to JSON, Avro, and Parquet.

Schema evolution: Supported by Avro; limited in Parquet.

Row group: Parquet unit of splitting.

Block: Avro unit of records.

Event Hubs Capture: Uses Avro by default.

Azure Synapse Serverless SQL: Can query all four formats.

Edge Cases the Exam Loves

Splittability: CSV compressed with gzip is NOT splittable (unless using a splittable codec like bzip2). Parquet and Avro are splittable.

Nested data: JSON supports nested structures; CSV does not.

Schema on read vs. schema on write: JSON is schema-on-read; Parquet and Avro are schema-on-write.

How to Eliminate Wrong Answers

If the question mentions "analytics" or "query performance," eliminate CSV and JSON; choose Parquet.

If the question mentions "streaming" or "high write throughput," eliminate Parquet; choose Avro.

If the question mentions "human-readable" or "web API," eliminate Parquet and Avro; choose JSON or CSV.

If the question mentions "schema evolution" or "changing structure," choose Avro.

Key Takeaways

Parquet is columnar and best for analytical workloads in Azure Synapse and Databricks.

Avro is row-oriented and best for streaming data with schema evolution (e.g., Event Hubs Capture).

JSON is human-readable and supports nested data; used in Azure Cosmos DB and web APIs.

CSV is simple but lacks schema and compression; avoid for large-scale analytics.

Parquet uses Snappy compression by default and supports predicate pushdown.

Avro files are self-describing with embedded JSON schema.

Event Hubs Capture writes data in Avro format by default.

Azure Synapse Serverless SQL can query CSV, JSON, Parquet, and Avro.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Parquet

Columnar storage: reads only needed columns

Optimized for read-heavy analytical queries

Higher compression ratio due to columnar encoding

Slower write performance due to columnar assembly

Schema stored in footer; limited schema evolution

Avro

Row-oriented storage: writes entire records quickly

Optimized for write-heavy streaming and ingestion

Good compression but less efficient than Parquet for analytics

Faster write performance; lower overhead for serialization

Schema in header; supports full schema evolution

Watch Out for These

Mistake

CSV is the best format for large datasets because it is simple.

Correct

CSV is inefficient for large datasets due to lack of compression, no indexing, and row-oriented storage. Parquet is far better for large-scale analytics.

Mistake

JSON is a binary format.

Correct

JSON is a text-based format. Binary formats include Avro and Parquet, which are more compact and faster to parse.

Mistake

Avro and Parquet are the same because both are binary and self-describing.

Correct

Avro is row-oriented (optimized for writes and streaming), while Parquet is column-oriented (optimized for reads and analytics). Their internal storage mechanisms differ significantly.

Mistake

Parquet files cannot be split for parallel processing.

Correct

Parquet files are splittable by row groups. Each row group can be processed independently, enabling parallelism.

Mistake

All compressed CSV files are splittable.

Correct

Only compressors that support splitting (e.g., bzip2, LZ4) allow splittable CSV. Gzip-compressed CSV is not splittable, which limits parallelism.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between row-oriented and column-oriented storage?

Row-oriented storage stores all columns of a row together, making writes fast but reads of a single column require scanning all rows. Column-oriented storage stores all values of a column together, allowing queries to read only the columns they need, improving read performance for analytics. Parquet is columnar; Avro and CSV are row-oriented.

Which data format should I use for Azure Synapse Serverless SQL?

For best performance, use Parquet. It allows column pruning and predicate pushdown, reducing I/O. Serverless SQL can also query CSV, JSON, and Avro, but Parquet is optimized for analytical queries.

Does Avro support schema evolution?

Yes, Avro supports schema evolution. You can add, remove, or modify fields as long as the reader's schema is compatible with the writer's schema. This is useful for streaming data where the structure may change over time.

Is JSON a good format for large-scale analytics?

No. JSON is text-based, verbose, and row-oriented. Parsing JSON is slower than binary formats, and it lacks compression and indexing. For analytics, convert JSON to Parquet.

What is the default compression for Parquet in Azure?

The default compression codec for Parquet is Snappy. It offers a good balance of compression ratio and speed. Other options include gzip, LZ4, and ZSTD.

Can I use CSV for streaming data?

CSV is not ideal for streaming because it is text-based and lacks schema. Avro is better for streaming due to its compact binary format and schema support.

What format does Azure Event Hubs Capture use by default?

Azure Event Hubs Capture writes data in Avro format by default. This is because Avro supports schema evolution and is efficient for high-throughput streaming.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Formats: JSON, CSV, Parquet, and Avro — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?