Question 329 of 507
Data Preparation for Machine LearningmediumMultiple ChoiceObjective-mapped

Quick Answer

The answer is SMOTE, or Synthetic Minority Over-sampling Technique, because it generates synthetic examples for the minority class by interpolating between existing minority instances and their k-nearest neighbors, rather than simply duplicating data. This directly addresses the severe 95:5 class imbalance without losing data—which undersampling would do—and avoids the overfitting risk of naive random oversampling. On the AWS Certified Machine Learning Engineer Associate MLA-C01 exam, this question tests your understanding of class imbalance handling techniques in the context of data preparation for binary classification, often appearing as a scenario where you must choose between SMOTE, undersampling, and oversampling. A common trap is selecting random oversampling because it seems simpler, but SMOTE’s synthetic generation creates more robust decision boundaries for the minority class. Memory tip: SMOTE “smoothes” the imbalance by creating new, realistic minority points, not just copying old ones.

MLA-C01 Data Preparation for Machine Learning Practice Question

This MLA-C01 practice question tests your understanding of data preparation for machine learning. Read the scenario carefully and evaluate each option against the stated constraints before committing to an answer. After answering, compare your reasoning against the explanation and wrong-answer breakdown below. Once you have made your selection, read the full explanation to reinforce the concept and understand why each distractor is designed to mislead on exam day.

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

Question 1mediummultiple choice
Full question →

Answer choices

Why each option matters

Answer the question above first, then reveal the full breakdown to understand why each option is right or wrong.

Correct answer & explanation

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE (Synthetic Minority Over-sampling Technique) is the best choice because it generates synthetic examples for the minority class by interpolating between existing minority instances and their k-nearest neighbors, rather than simply duplicating data. This addresses the severe 95:5 class imbalance without losing data (as undersampling would) and without the overfitting risk of naive random oversampling. The synthetic samples help the model learn a more general decision boundary for the positive class.

Key principle: Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Answer analysis

Option-by-option breakdown

For each option: why learners choose it and why it is or isn't the right answer here.

  • SMOTE (Synthetic Minority Over-sampling Technique)

    Why this is correct

    Generates synthetic samples for the minority class.

    Related concept

    Read the scenario before looking for a memorised answer.

  • Random undersampling of the majority class

    Why it's wrong here

    Removes potentially valuable data.

  • Random oversampling of the minority class

    Why it's wrong here

    Duplicates existing samples, risk of overfitting.

  • Apply class weights during model training

    Why it's wrong here

    Affects loss function, not data preparation.

Common exam traps

Common exam trap: answer the scenario, not the keyword

AWS often tests the distinction between data-level techniques (like SMOTE, oversampling, undersampling) and algorithm-level techniques (like class weights), and the trap here is that candidates confuse class weighting as a data preparation method when it is actually a model training adjustment, not a data transformation step.

Detailed technical explanation

How to think about this question

SMOTE works by selecting a minority class instance, finding its k-nearest neighbors (typically k=5), and creating a synthetic sample along the line segment connecting the instance to a randomly chosen neighbor. This interpolation occurs in feature space, so the synthetic points are plausible but not exact copies, which helps the model generalize better. In real-world scenarios like fraud detection or rare disease diagnosis, SMOTE can be combined with undersampling (e.g., SMOTEENN) to clean noisy majority examples, but pure SMOTE is preferred when data loss is unacceptable.

KKey Concepts to Remember

  • Read the scenario before looking for a memorised answer.
  • Find the constraint that changes the correct option.
  • Eliminate answers that are true in general but not in this case.

TExam Day Tips

  • Watch for words such as best, first, most likely and least administrative effort.
  • Review why wrong options are wrong, not only why the correct option is correct.

Key takeaway

Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Real-world example

How this comes up in practice

A cloud solutions architect for a retail company is evaluating services for a new workload. The correct answer here reflects best practice for the specific scenario described — not a general cloud recommendation. Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option. Cloud exam questions reward reading the constraint carefully: the same technology can be right or wrong depending on the use case.

What to study next

Got this wrong? Here's your next step.

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

Related practice questions

Related MLA-C01 practice-question pages

Use these pages to review the topic behind this question. This is how one missed question becomes focused revision.

Practice this exam

Start a free MLA-C01 practice session

Short sessions build daily habit. Longer sessions build exam-day stamina. Try a timed session to simulate real conditions.

FAQ

Questions learners often ask

What does this MLA-C01 question test?

Data Preparation for Machine Learning — This question tests Data Preparation for Machine Learning — Read the scenario before looking for a memorised answer..

What is the correct answer to this question?

The correct answer is: SMOTE (Synthetic Minority Over-sampling Technique) — SMOTE (Synthetic Minority Over-sampling Technique) is the best choice because it generates synthetic examples for the minority class by interpolating between existing minority instances and their k-nearest neighbors, rather than simply duplicating data. This addresses the severe 95:5 class imbalance without losing data (as undersampling would) and without the overfitting risk of naive random oversampling. The synthetic samples help the model learn a more general decision boundary for the positive class.

What should I do if I get this MLA-C01 question wrong?

Identify which exam domain this question belongs to, review the core concept, then practise similar questions from the same domain.

What is the key concept behind this question?

Read the scenario before looking for a memorised answer.

About these practice questions

Courseiva creates original exam-style practice questions with explanations and wrong-answer analysis. It does not publish real exam questions, exam dumps, or protected exam content. Learn why practice questions differ from exam dumps →

How Courseiva writes practice questions · Editorial policy

Same concept, more angles

5 more ways this is tested on MLA-C01

These questions test the same concept from different angles. Work through them to make sure you can recognise it however the exam phrases it.

Variation 1. A dataset for binary classification has a severe class imbalance (5% positive class). Which two data preparation techniques can help address this imbalance? (Choose two.)

medium
  • A.Remove outliers from the minority class
  • B.Apply PCA to reduce dimensionality
  • C.Use stratified splitting for train/test sets
  • D.Undersample the majority class
  • E.Oversample the minority class using SMOTE

Why D: Option D is correct because undersampling the majority class reduces the number of instances from the dominant class, helping to balance the dataset and prevent the model from being biased toward the majority class. This technique is straightforward and can be effective when the majority class has redundant or noisy samples, though it risks losing valuable information.

Variation 2. A data scientist is preparing a dataset for a binary classification model. The dataset has 10,000 records with 100 features. The target variable is imbalanced, with 95% negative class and 5% positive class. Which data preparation step should the data scientist take to address the imbalance before training?

easy
  • A.Normalize all features to a 0-1 range
  • B.Use cross-validation to handle imbalance
  • C.Remove enough instances of the negative class to achieve balance
  • D.Apply SMOTE to oversample the positive class

Why D: Option D is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class (positive class, 5%) by interpolating between existing minority instances. This addresses the severe class imbalance (95:5) without discarding data, allowing the model to learn decision boundaries for the minority class more effectively than simple duplication.

Variation 3. A data scientist is preparing a dataset for training a binary classification model. The dataset has 100,000 rows and 50 features. The target variable is imbalanced, with only 5% positive cases. Which technique should the data scientist apply to address the class imbalance BEFORE training?

easy
  • A.Principal Component Analysis (PCA) dimensionality reduction
  • B.Random oversampling of the minority class
  • C.Standard scaling of numerical features
  • D.One-hot encoding of categorical variables

Why B: Random oversampling of the minority class (Option B) directly addresses the class imbalance by duplicating examples from the positive class until the class distribution is more balanced. This prevents the binary classification model from being biased toward the majority class, which is critical when only 5% of the 100,000 rows are positive cases. Oversampling is applied before training to ensure the model sees sufficient minority examples during learning.

Variation 4. A data scientist is preparing a dataset for binary classification using SageMaker. The dataset has 100 features and 10,000 rows, but the target variable is highly imbalanced (95% negative, 5% positive). Which technique should the data scientist apply during data preparation to address the imbalance?

easy
  • A.Oversampling the minority class by duplicating examples
  • B.Collect more data to match the number of samples in both classes
  • C.Random undersampling of the majority class
  • D.Apply SMOTE to generate synthetic samples for the minority class

Why D: SMOTE (Synthetic Minority Oversampling Technique) is the most appropriate technique because it generates synthetic samples for the minority class by interpolating between existing minority instances, which avoids the overfitting risk of simple duplication (oversampling) and the information loss from undersampling. In SageMaker, SMOTE can be applied during data preparation using libraries like imbalanced-learn before training, or via SageMaker Data Wrangler's built-in transform, making it a robust choice for handling class imbalance without discarding data.

Variation 5. A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?

hard
  • A.Apply data augmentation to the majority class by adding noise.
  • B.Apply Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class.
  • C.Use a weighted loss function during training to penalize misclassifications of the minority class.
  • D.Apply random under-sampling to reduce the majority class to match the minority class size.

Why B: Option B is correct because SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances, which directly addresses the severe class imbalance (95% class A, 5% class B) by creating a more balanced training dataset. This technique is particularly effective for tabular data in Amazon SageMaker, as it increases the representation of the minority class without simply duplicating existing samples, thereby reducing overfitting and improving the model's ability to learn decision boundaries for the minority class.

Keep practising

More MLA-C01 practice questions

Last reviewed: Jun 30, 2026

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

Sign in to join the discussion.

This MLA-C01 practice question is part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style practice questions with explanations, topic-based practice, mock exams, readiness tracking, and study analytics to help learners prepare for the MLA-C01 exam.