Question 296 of 846
Develop data processinghardMultiple ChoiceObjective-mapped

Quick Answer

The answer is to combine the CSV files into fewer, larger files before loading. This is correct because PolyBase load performance optimization in Synapse relies on minimizing the overhead of external file enumeration and parallel split logic; when you have many tiny files, PolyBase must open and close each one repeatedly and split them across distributions inefficiently, whereas files of at least 256 MB allow each distribution to process a whole file, drastically reducing I/O overhead. On the Microsoft Azure Data Engineer Associate DP-203 exam, this concept tests your understanding of PolyBase’s internal parallelism and file-splitting behavior, often appearing as a trap where candidates mistakenly choose to increase compute resources or change file format instead of consolidating file sizes. A common memory tip is “bigger files, fewer splits”—think of PolyBase like a librarian who wastes time opening many small books versus handling a few thick volumes, so always aim for files around 256 MB or larger for optimal throughput.

DP-203 Develop data processing Practice Question

This DP-203 practice question tests your understanding of develop data processing. Match the stated requirement to the specific cloud service, access model, or configuration option — many options are valid in isolation but not for this scenario. After answering, compare your reasoning against the explanation and wrong-answer breakdown below. Once you have made your selection, read the full explanation to reinforce the concept and understand why each distractor is designed to mislead on exam day.

You are optimizing a data pipeline in Azure Synapse Analytics that loads data from a CSV file in ADLS Gen2 into a dedicated SQL pool using PolyBase. The load is slow and you need to improve performance. Which action would be MOST effective?

Question 1hardmultiple choice
Full question →

Answer choices

Why each option matters

Answer the question above first, then reveal the full breakdown to understand why each option is right or wrong.

Correct answer & explanation

Combine the CSV files into fewer, larger files before loading.

Combining many small CSV files into fewer, larger files reduces the number of file open/close operations and minimizes the overhead of PolyBase's external file enumeration and parallel split logic. PolyBase performs best when each file is at least 256 MB, as it can then assign a full file to each distribution, avoiding the overhead of splitting tiny files across multiple threads.

Key principle: Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Answer analysis

Option-by-option breakdown

For each option: why learners choose it and why it is or isn't the right answer here.

  • Increase the service level (DWU) of the dedicated SQL pool.

    Why it's wrong here

    Scaling up helps but is a costlier solution; the root cause is likely file granularity.

  • Change the file format from CSV to Avro.

    Why it's wrong here

    Format change alone does not address the small file problem; Avro may improve compression but not the overhead.

  • Combine the CSV files into fewer, larger files before loading.

    Why this is correct

    PolyBase performs more efficiently with fewer, larger files due to reduced metadata operations.

    Related concept

    Read the scenario before looking for a memorised answer.

  • Use Azure Data Factory to stage the data in Azure Blob Storage before loading.

    Why it's wrong here

    Staging adds extra steps and does not address the small file overhead.

Common exam traps

Common exam trap: answer the scenario, not the keyword

The trap here is that candidates assume scaling up (DWU) or changing file formats always improves performance, but the exam specifically tests the understanding that PolyBase's parallel processing is most efficient when file sizes align with distribution boundaries, making file consolidation the most effective optimization.

Detailed technical explanation

How to think about this question

PolyBase uses a 'splitter' component that divides each external file into row groups for parallel ingestion. When files are too small (e.g., <64 MB), the splitter creates many tiny row groups, increasing scheduling overhead and reducing the efficiency of the round-robin distribution across compute nodes. In real-world scenarios, consolidating thousands of 1 MB log files into a few hundred 256 MB files can reduce load times by 10x or more.

KKey Concepts to Remember

  • Read the scenario before looking for a memorised answer.
  • Find the constraint that changes the correct option.
  • Eliminate answers that are true in general but not in this case.

TExam Day Tips

  • Watch for words such as best, first, most likely and least administrative effort.
  • Review why wrong options are wrong, not only why the correct option is correct.

Key takeaway

Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Real-world example

How this comes up in practice

A media company stores terabytes of video archives that are accessed once a year for audit purposes. Moving these objects to a cold storage tier (Azure Archive, S3 Glacier, or Google Nearline) costs a fraction of hot storage. Questions like this test whether you understand storage tiers, access frequency tradeoffs, and retrieval latency requirements.

What to study next

Got this wrong? Here's your next step.

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Related practice questions

Related DP-203 practice-question pages

Use these pages to review the topic behind this question. This is how one missed question becomes focused revision.

Practice this exam

Start a free DP-203 practice session

Short sessions build daily habit. Longer sessions build exam-day stamina. Try a timed session to simulate real conditions.

FAQ

Questions learners often ask

What does this DP-203 question test?

Develop data processing — This question tests Develop data processing — Read the scenario before looking for a memorised answer..

What is the correct answer to this question?

The correct answer is: Combine the CSV files into fewer, larger files before loading. — Combining many small CSV files into fewer, larger files reduces the number of file open/close operations and minimizes the overhead of PolyBase's external file enumeration and parallel split logic. PolyBase performs best when each file is at least 256 MB, as it can then assign a full file to each distribution, avoiding the overhead of splitting tiny files across multiple threads.

What should I do if I get this DP-203 question wrong?

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

What is the key concept behind this question?

Read the scenario before looking for a memorised answer.

About these practice questions

Courseiva creates original exam-style practice questions with explanations and wrong-answer analysis. It does not publish real exam questions, exam dumps, or protected exam content. Learn why practice questions differ from exam dumps →

How Courseiva writes practice questions · Editorial policy

Same concept, more angles

1 more ways this is tested on DP-203

These questions test the same concept from different angles. Work through them to make sure you can recognise it however the exam phrases it.

Variation 1. You are implementing a data pipeline in Azure Synapse Analytics that uses PolyBase to load data from Azure Data Lake Storage Gen2 into a dedicated SQL pool. The pipeline runs nightly and processes approximately 500 GB of data. You notice that the load operation is slow and frequently times out. What should you do to improve performance?

medium
  • A.Create a columnstore index on the target table.
  • B.Use a clustered index on the target table.
  • C.Increase the resource class for the loading user.
  • D.Change the distribution of the target table to round-robin.

Why C: Option C is correct because increasing the resource class in dedicated SQL pool provides more resources (CPU, memory) to the PolyBase load operation, which directly addresses the timeout issue. Option A is wrong because columnstore indexes are relevant for query performance, not load speed. Option B is wrong because round-robin distribution can improve load speed, but the primary issue is resource allocation. Option D is wrong because the data type of the clustered index does not impact PolyBase load performance.

Last reviewed: Jun 24, 2026

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

Sign in to join the discussion.

This DP-203 practice question is part of Courseiva's free Microsoft certification practice question bank. Courseiva provides original exam-style practice questions with explanations, topic-based practice, mock exams, readiness tracking, and study analytics to help learners prepare for the DP-203 exam.