← MLA-C01·Amazon Web Services

Question 23 of 507

Data Preparation for Machine Learning →easyMultiple ChoiceObjective-mapped

Quick Answer

The answer is Amazon EMR with Apache Spark. This is the correct choice because EMR provides a fully managed Hadoop and Spark framework that excels at large-scale ETL, effortlessly handling 10 TB of compressed JSON log data while Spark’s in-memory processing and DataFrame API efficiently parse nested JSON and perform complex joins with reference tables, all without the overhead of manual cluster management. On the AWS Certified Machine Learning Engineer Associate MLA-C01 exam, this scenario tests your ability to match a data processing workload to the right managed service—a common trap is choosing AWS Glue for its serverless appeal, but Glue struggles with 10 TB of complex transformations, whereas EMR with Spark is purpose-built for such scale. Remember the mnemonic: “Big JSON joins? EMR Spark avoids the noise.”

MLA-C01 Data Preparation for Machine Learning Practice Question

This MLA-C01 practice question tests your understanding of data preparation for machine learning. Read the scenario carefully and evaluate each option against the stated constraints before committing to an answer. After answering, compare your reasoning against the explanation and wrong-answer breakdown below. Once you have made your selection, read the full explanation to reinforce the concept and understand why each distractor is designed to mislead on exam day.

A company has 10 TB of log data in compressed JSON format stored in Amazon S3. The data needs to be processed and transformed into a structured format for machine learning. The processing requires complex transformations, including parsing nested JSON and joining with a reference table. The company wants to minimize infrastructure management. Which approach should the company use?

Clue words in this question

Noticing these words before you look at the options changes how you read each choice.

Clue: "minimum / minimize"
Why it matters: Asks for the least resource use — fewest addresses, smallest subnet, lowest overhead. Eliminate over-provisioned options even if they would technically work.

Question 1easymultiple choice

Full question →

A
Use SageMaker Processing jobs to run custom scripts.
Why wrong: Processing jobs have a limit of 5 TB per instance group and may be slower.
B
Use Amazon Athena to query and transform the data.
Why wrong: Athena is for querying, not heavy ETL; nested JSON parsing is complex.
C
Use Amazon EMR with Apache Spark.
EMR is designed for large-scale data processing with Spark.
D
Use AWS Glue ETL with PySpark.
Why wrong: Glue is viable but may be slower and more expensive for 10 TB; EMR is better for large data.

Full breakdown with real-world context →

Answer choices

Why each option matters

Answer the question above first, then reveal the full breakdown to understand why each option is right or wrong.

Correct answer & explanation

✓

Use Amazon EMR with Apache Spark.

Option C is correct because Amazon EMR with Apache Spark is designed for large-scale data processing (10 TB) and can handle complex transformations like parsing nested JSON and joining with reference tables efficiently. Spark's in-memory processing and support for structured data via DataFrames make it ideal for this workload, while EMR minimizes infrastructure management by providing a managed Hadoop/Spark cluster.

Key principle: Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Answer analysis

Option-by-option breakdown

For each option: why learners choose it and why it is or isn't the right answer here.

✗
Use SageMaker Processing jobs to run custom scripts.
Why it's wrong here
Processing jobs have a limit of 5 TB per instance group and may be slower.
✗
Use Amazon Athena to query and transform the data.
Why it's wrong here
Athena is for querying, not heavy ETL; nested JSON parsing is complex.
✓
Use Amazon EMR with Apache Spark.
Why this is correct
EMR is designed for large-scale data processing with Spark.
Clue confirmation
The clue word "minimum / minimize" in the question point toward this answer.
Related concept
Read the scenario before looking for a memorised answer.
✗
Use AWS Glue ETL with PySpark.
Why it's wrong here
Glue is viable but may be slower and more expensive for 10 TB; EMR is better for large data.

Common exam traps

Common exam trap: answer the scenario, not the keyword

The trap here is that candidates confuse AWS Glue ETL with PySpark (Option D) as the default managed ETL service, but for large-scale complex transformations, EMR offers better performance and cost control, while Glue is more suited for smaller, simpler workloads or serverless needs.

Detailed technical explanation

How to think about this question

Under the hood, Apache Spark on EMR uses Resilient Distributed Datasets (RDDs) and DataFrames to distribute data across nodes, enabling parallel processing of nested JSON via `from_json()` and `explode()` functions. The reference table join can be optimized using broadcast joins if the table is small, or sort-merge joins for larger tables, leveraging Spark's Catalyst optimizer. In real-world scenarios, EMR's ability to use spot instances for cost savings and its integration with S3 via the EMRFS connector make it a robust choice for petabyte-scale ETL pipelines.

KKey Concepts to Remember

Read the scenario before looking for a memorised answer.
Find the constraint that changes the correct option.
Eliminate answers that are true in general but not in this case.

TExam Day Tips

Watch for words such as best, first, most likely and least administrative effort.
Review why wrong options are wrong, not only why the correct option is correct.

Key takeaway

Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Real-world example

How this comes up in practice

A media company stores terabytes of video archives that are accessed once a year for audit purposes. Moving these objects to a cold storage tier (Azure Archive, S3 Glacier, or Google Nearline) costs a fraction of hot storage. Questions like this test whether you understand storage tiers, access frequency tradeoffs, and retrieval latency requirements.

What to study next

Got this wrong? Here's your next step.

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Related MLA-C01 practice-question pages

Use these pages to review the topic behind this question. This is how one missed question becomes focused revision.

Data Preparation for Machine Learning practice questions

Practise MLA-C01 questions linked to Data Preparation for Machine Learning.

ML Model Development practice questions

Practise MLA-C01 questions linked to ML Model Development.

Deployment and Orchestration of ML Workflows practice questions

Practise MLA-C01 questions linked to Deployment and Orchestration of ML Workflows.

ML Solution Monitoring, Maintenance and Security practice questions

Practise MLA-C01 questions linked to ML Solution Monitoring, Maintenance and Security.

MLA-C01 fundamentals practice questions

Practise MLA-C01 questions linked to MLA-C01 fundamentals.

MLA-C01 scenario practice questions

Practise MLA-C01 questions linked to MLA-C01 scenario.

MLA-C01 troubleshooting practice questions

Practise MLA-C01 questions linked to MLA-C01 troubleshooting.

Practice this exam

Start a free MLA-C01 practice session

Short sessions build daily habit. Longer sessions build exam-day stamina. Try a timed session to simulate real conditions.

10 questions 20 questions 30 questions 50 questions Timed 30

MLA-C01 practice-test guide →Study guide →Browse all practice tests

FAQ

Questions learners often ask

What does this MLA-C01 question test?

Data Preparation for Machine Learning — This question tests Data Preparation for Machine Learning — Read the scenario before looking for a memorised answer..

What is the correct answer to this question?

The correct answer is: Use Amazon EMR with Apache Spark. — Option C is correct because Amazon EMR with Apache Spark is designed for large-scale data processing (10 TB) and can handle complex transformations like parsing nested JSON and joining with reference tables efficiently. Spark's in-memory processing and support for structured data via DataFrames make it ideal for this workload, while EMR minimizes infrastructure management by providing a managed Hadoop/Spark cluster.

What should I do if I get this MLA-C01 question wrong?

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Are there clue words in this question I should notice?

Yes — watch for: "minimum / minimize". Asks for the least resource use — fewest addresses, smallest subnet, lowest overhead. Eliminate over-provisioned options even if they would technically work.

What is the key concept behind this question?

Read the scenario before looking for a memorised answer.

About these practice questions

Courseiva creates original exam-style practice questions with explanations and wrong-answer analysis. It does not publish real exam questions, exam dumps, or protected exam content. Learn why practice questions differ from exam dumps →

How Courseiva writes practice questions · Editorial policy

Keep practising

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

This MLA-C01 practice question is part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style practice questions with explanations, topic-based practice, mock exams, readiness tracking, and study analytics to help learners prepare for the MLA-C01 exam.