Question 454 of 507
Data Preparation for Machine LearninghardMultiple ChoiceObjective-mapped

MLA-C01 Data Preparation for Machine Learning Practice Question

This MLA-C01 practice question tests your understanding of data preparation for machine learning. Match the stated requirement to the specific cloud service, access model, or configuration option — many options are valid in isolation but not for this scenario. After answering, compare your reasoning against the explanation and wrong-answer breakdown below. Once you have made your selection, read the full explanation to reinforce the concept and understand why each distractor is designed to mislead on exam day.

A financial services company is building a fraud detection model using transactional data stored in Amazon S3. The data includes transaction_id, timestamp, amount, merchant_category, and fraud_label (0/1). The data is collected from multiple sources and has inconsistencies: timestamps are in different timezones (UTC and EST), merchant categories are sometimes misspelled (e.g., 'RESTAURANT', 'Restaurant', 'restaurant'), and the fraud_label is missing for about 5% of records. The data science team uses AWS Glue for ETL. They need to prepare a clean dataset for training. The final dataset must have consistent timestamps in UTC, standardized merchant categories, and no missing fraud labels. The team also wants to minimize data loss. Which set of actions should the team take?

Clue words in this question

Noticing these words before you look at the options changes how you read each choice.

  • Clue: "minimum / minimize"

    Why it matters: Asks for the least resource use — fewest addresses, smallest subnet, lowest overhead. Eliminate over-provisioned options even if they would technically work.

Question 1hardmultiple choice
Full question →

Answer choices

Why each option matters

Answer the question above first, then reveal the full breakdown to understand why each option is right or wrong.

Correct answer & explanation

Use AWS Glue to convert timestamps to UTC, use a mapping table to group similar merchant categories (e.g., all restaurant variants to 'Restaurant'), and impute missing fraud_label using mode (most frequent value).

Option D is correct because it preserves data by imputing missing fraud labels using the mode (most frequent value), which is appropriate for a binary classification label where the majority class is likely 0. It also standardizes timestamps to UTC and uses a mapping table to group merchant category variants, ensuring consistency without data loss. Dropping records (as in A and C) would reduce the dataset size, and imputing with the mean (as in B) is invalid for a categorical label.

Key principle: Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Answer analysis

Option-by-option breakdown

For each option: why learners choose it and why it is or isn't the right answer here.

  • Use AWS Glue to convert all timestamps to UTC, apply a mapping function to correct merchant category misspellings to a standard list, and drop records with missing fraud_label.

    Why it's wrong here

    Dropping 5% of records may lose important fraud cases and introduce bias.

  • Use AWS Glue to convert timestamps to UTC, use a fuzzy matching algorithm to standardize merchant categories, and replace missing fraud_label with the mean value (0.05).

    Why it's wrong here

    Mean imputation on a binary variable produces non-integer values, which are invalid for classification.

  • Use AWS Glue to convert timestamps to UTC, correct merchant categories by mapping known misspellings to correct names, and drop records with missing fraud_label.

    Why it's wrong here

    Only mapping known misspellings may miss variants, and dropping missing labels causes data loss.

  • Use AWS Glue to convert timestamps to UTC, use a mapping table to group similar merchant categories (e.g., all restaurant variants to 'Restaurant'), and impute missing fraud_label using mode (most frequent value).

    Why this is correct

    Mode imputation preserves the majority class and avoids data loss, while timestamp conversion and category mapping clean the data correctly.

    Clue confirmation

    The clue word "minimum / minimize" in the question point toward this answer.

    Related concept

    Read the scenario before looking for a memorised answer.

Common exam traps

Common exam trap: answer the scenario, not the keyword

The trap here is that candidates often choose to drop missing values (options A and C) to avoid imputation complexity, not realizing that minimizing data loss is explicitly stated as a requirement, and that mode imputation is a standard technique for categorical labels in ML pipelines.

Detailed technical explanation

How to think about this question

In AWS Glue, DynamicFrames allow for custom transforms like Map and ApplyMapping to standardize data. For categorical imputation, using the mode (most frequent value) preserves the distribution of the target variable, which is critical for fraud detection models where class imbalance is common. The mapping table approach for merchant categories is scalable because it can be stored in a separate file (e.g., in S3) and updated independently, whereas hardcoded mappings in a script require code changes for each new variant.

KKey Concepts to Remember

  • Read the scenario before looking for a memorised answer.
  • Find the constraint that changes the correct option.
  • Eliminate answers that are true in general but not in this case.

TExam Day Tips

  • Watch for words such as best, first, most likely and least administrative effort.
  • Review why wrong options are wrong, not only why the correct option is correct.

Key takeaway

Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Real-world example

How this comes up in practice

A media company stores terabytes of video archives that are accessed once a year for audit purposes. Moving these objects to a cold storage tier (Azure Archive, S3 Glacier, or Google Nearline) costs a fraction of hot storage. Questions like this test whether you understand storage tiers, access frequency tradeoffs, and retrieval latency requirements.

What to study next

Got this wrong? Here's your next step.

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Related practice questions

Related MLA-C01 practice-question pages

Use these pages to review the topic behind this question. This is how one missed question becomes focused revision.

Practice this exam

Start a free MLA-C01 practice session

Short sessions build daily habit. Longer sessions build exam-day stamina. Try a timed session to simulate real conditions.

FAQ

Questions learners often ask

What does this MLA-C01 question test?

Data Preparation for Machine Learning — This question tests Data Preparation for Machine Learning — Read the scenario before looking for a memorised answer..

What is the correct answer to this question?

The correct answer is: Use AWS Glue to convert timestamps to UTC, use a mapping table to group similar merchant categories (e.g., all restaurant variants to 'Restaurant'), and impute missing fraud_label using mode (most frequent value). — Option D is correct because it preserves data by imputing missing fraud labels using the mode (most frequent value), which is appropriate for a binary classification label where the majority class is likely 0. It also standardizes timestamps to UTC and uses a mapping table to group merchant category variants, ensuring consistency without data loss. Dropping records (as in A and C) would reduce the dataset size, and imputing with the mean (as in B) is invalid for a categorical label.

What should I do if I get this MLA-C01 question wrong?

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Are there clue words in this question I should notice?

Yes — watch for: "minimum / minimize". Asks for the least resource use — fewest addresses, smallest subnet, lowest overhead. Eliminate over-provisioned options even if they would technically work.

What is the key concept behind this question?

Read the scenario before looking for a memorised answer.

About these practice questions

Courseiva creates original exam-style practice questions with explanations and wrong-answer analysis. It does not publish real exam questions, exam dumps, or protected exam content. Learn why practice questions differ from exam dumps →

How Courseiva writes practice questions · Editorial policy

Keep practising

More MLA-C01 practice questions

Last reviewed: Jun 24, 2026

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

Sign in to join the discussion.

This MLA-C01 practice question is part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style practice questions with explanations, topic-based practice, mock exams, readiness tracking, and study analytics to help learners prepare for the MLA-C01 exam.