Sample questions
AWS Certified Machine Learning Engineer Associate MLA-C01 practice questions
A company is running a SageMaker endpoint serving multiple models. They need to monitor for data drift and model quality. Which THREE actions are necessary? (Choose three.)
Trap 1: Deploy a shadow endpoint for comparison
Shadow endpoint is for traffic shifting, not monitoring drift.
Trap 2: Use SageMaker Debugger for monitoring
Debugger is for training debugging, not production monitoring.
- A
Deploy a shadow endpoint for comparison
Why wrong: Shadow endpoint is for traffic shifting, not monitoring drift.
- B
Enable data capture on the endpoint
Data capture logs inference requests for monitoring.
- C
Use SageMaker Debugger for monitoring
Why wrong: Debugger is for training debugging, not production monitoring.
- D
Create a SageMaker Model Monitor schedule
Schedule defines how often to run monitoring jobs.
- E
Configure baseline constraints from training data
Baseline constraints define expected statistical properties for drift detection.
A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?
Trap 1: Increase the number of features
Adding more features increases model complexity and can worsen overfitting.
Trap 2: Use a more complex model like XGBoost
More complex models are more prone to overfitting, not less.
Trap 3: Use stratified cross-validation
Stratified CV helps evaluate generalization but does not directly reduce overfitting.
- A
Increase the number of features
Why wrong: Adding more features increases model complexity and can worsen overfitting.
- B
Increase the regularization strength
Stronger regularization (e.g., higher L2 penalty) shrinks coefficients and reduces overfitting.
- C
Use a more complex model like XGBoost
Why wrong: More complex models are more prone to overfitting, not less.
- D
Use stratified cross-validation
Why wrong: Stratified CV helps evaluate generalization but does not directly reduce overfitting.
A team is training a deep learning model on Amazon SageMaker using a custom Docker container. Which three practices should they follow to optimize training performance? (Choose three.)
Trap 1: Use the largest instance type available
Largest instance types increase cost and may not yield proportional performance gains; optimization should consider cost-efficiency.
Trap 2: Increase the number of layers in the model to improve accuracy
Deeper models may improve accuracy but also increase training time and risk overfitting; this is not a general performance optimization practice.
- A
Store training data in Amazon S3 in a shuffled and compressed format
Shuffling prevents bias and compression reduces transfer time, improving training performance.
- B
Use the largest instance type available
Why wrong: Largest instance types increase cost and may not yield proportional performance gains; optimization should consider cost-efficiency.
- C
Increase the number of layers in the model to improve accuracy
Why wrong: Deeper models may improve accuracy but also increase training time and risk overfitting; this is not a general performance optimization practice.
- D
Use SageMaker Managed Spot Training with checkpointing
Spot instances are cheaper, and checkpointing allows resuming after interruptions, providing both cost savings and reliability.
- E
Use Pipe mode to stream data instead of File mode
Pipe mode streams data directly from S3 to the training container, reducing disk usage and I/O wait.
A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?
Trap 1: Increase the batch size to the maximum possible
Very large batch sizes can degrade model accuracy and may not fit in memory.
Trap 2: Use a learning rate scheduler that reduces the learning rate over…
Schedulers help convergence but do not directly reduce training time.
Trap 3: Add more convolutional layers to the model
Adding layers increases computation, slowing training.
- A
Increase the batch size to the maximum possible
Why wrong: Very large batch sizes can degrade model accuracy and may not fit in memory.
- B
Use a GPU-based instance such as ml.p3.2xlarge
GPUs accelerate matrix operations in neural networks, reducing training time.
- C
Use a learning rate scheduler that reduces the learning rate over time
Why wrong: Schedulers help convergence but do not directly reduce training time.
- D
Add more convolutional layers to the model
Why wrong: Adding layers increases computation, slowing training.
A team is developing a model to predict customer churn. The dataset has 10,000 samples with 20 features. The target variable is binary with 15% churn rate. The team wants to use logistic regression. Which data preprocessing step is MOST important to ensure proper convergence?
Trap 1: Remove correlated features to reduce multicollinearity
Multicollinearity affects interpretability but not necessarily convergence of logistic regression.
Trap 2: Impute missing values with the median
Missing value imputation is important but not the most critical for convergence.
Trap 3: Apply SMOTE to balance the classes
SMOTE addresses class imbalance but does not affect convergence of logistic regression.
- A
Remove correlated features to reduce multicollinearity
Why wrong: Multicollinearity affects interpretability but not necessarily convergence of logistic regression.
- B
Impute missing values with the median
Why wrong: Missing value imputation is important but not the most critical for convergence.
- C
Apply SMOTE to balance the classes
Why wrong: SMOTE addresses class imbalance but does not affect convergence of logistic regression.
- D
Standardize the features to have zero mean and unit variance
Standardization ensures gradient descent converges faster and avoids dominance by large-scale features.
A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?
Trap 1: Convert all timestamps to UTC in the ETL script using Spark's…
from_utc_timestamp converts from UTC, not to UTC; the function name is misleading and can cause incorrect conversions.
Trap 2: Use AWS Glue's built-in transform to parse timestamps with timezone…
While Glue can parse timestamps, it does not automatically normalize to a common timezone for consistent hour extraction.
Trap 3: Use Python's datetime.strptime with tzlocal
tzlocal uses the system time zone, which is not reliable for multiple time zones.
- A
Convert all timestamps to UTC in the ETL script using Spark's from_utc_timestamp
Why wrong: from_utc_timestamp converts from UTC, not to UTC; the function name is misleading and can cause incorrect conversions.
- B
Use AWS Glue's built-in transform to parse timestamps with timezone offsets
Why wrong: While Glue can parse timestamps, it does not automatically normalize to a common timezone for consistent hour extraction.
- C
Use Python's datetime.strptime with tzlocal
Why wrong: tzlocal uses the system time zone, which is not reliable for multiple time zones.
- D
Convert all timestamps to UTC during the ETL process, then extract hour
Normalizing to UTC before extracting hour guarantees consistency across time zones.
A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?
Trap 1: Min-max scaling
Min-max scaling is affected by min and max values, which can be outliers.
Trap 2: Log transformation
Log transformation reduces skew but does not eliminate outlier influence on scaling.
Trap 3: Standardization (z-score)
Standardization uses mean and standard deviation, both influenced by outliers.
- A
Min-max scaling
Why wrong: Min-max scaling is affected by min and max values, which can be outliers.
- B
Log transformation
Why wrong: Log transformation reduces skew but does not eliminate outlier influence on scaling.
- C
Robust scaling (median and IQR)
Robust scaling uses median and interquartile range, not affected by extreme values.
- D
Standardization (z-score)
Why wrong: Standardization uses mean and standard deviation, both influenced by outliers.
A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?
Trap 1: Apply one-hot encoding to the year, month, and day components.
This unnecessarily increases dimensionality and does not help churn prediction directly.
Trap 2: Convert the timestamp to a numeric feature (e.g., days since…
Converting to numeric may still introduce irrelevant information and overfitting.
Trap 3: Use leave-one-out encoding based on the target variable.
Leave-one-out encoding is for categorical features with many levels, not for timestamps.
- A
Apply one-hot encoding to the year, month, and day components.
Why wrong: This unnecessarily increases dimensionality and does not help churn prediction directly.
- B
Convert the timestamp to a numeric feature (e.g., days since signup) and include it.
Why wrong: Converting to numeric may still introduce irrelevant information and overfitting.
- C
Use leave-one-out encoding based on the target variable.
Why wrong: Leave-one-out encoding is for categorical features with many levels, not for timestamps.
- D
Drop the 'signup_date' column from the dataset.
Irrelevant columns should be removed to prevent noise.
An ML team wants to deploy a model that was trained using XGBoost in SageMaker. They want to use the built-in XGBoost algorithm container for inference. Which inference option requires the least custom code?
Trap 1: Create a custom Docker container with XGBoost and deploy to an…
Requires writing and maintaining custom code.
Trap 2: Attach Elastic Inference to a generic container
Elastic Inference is an acceleration option, not a deployment method.
Trap 3: Use SageMaker Python SDK to download the model and run local…
Not a deployment option; local inference is for development.
- A
Create a custom Docker container with XGBoost and deploy to an endpoint
Why wrong: Requires writing and maintaining custom code.
- B
Deploy to a real-time endpoint using the built-in XGBoost container
The built-in container handles inference automatically.
- C
Attach Elastic Inference to a generic container
Why wrong: Elastic Inference is an acceleration option, not a deployment method.
- D
Use SageMaker Python SDK to download the model and run local inference
Why wrong: Not a deployment option; local inference is for development.
An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?
Exhibit
Refer to the exhibit.
aws sagemaker create-training-job \
--training-job-name my-training-job \
--algorithm-specification 'TrainingImage=123456789012.dkr.ecr.us-west-2.amazonaws.com/my-custom-training:latest,TrainingInputMode=File' \
--role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--input-data-config '[{"ChannelName":"train","DataSource":{"S3DataSource":{"S3Uri":"s3://my-bucket/train/","S3DataType":"S3Prefix"}},"ContentType":"text/csv"}]' \
--output-data-config '{"S3OutputPath":"s3://my-bucket/output/"}' \
--resource-config '{"InstanceType":"ml.m5.large","InstanceCount":1,"VolumeSizeInGB":30}' \
--vpc-config '{"SecurityGroupIds":["sg-12345678"],"Subnets":["subnet-12345678"]}'Trap 1: The IAM role 'SageMakerExecutionRole' does not have permission to…
The error is about assuming the role, not permissions using it.
Trap 2: The training image in ECR does not exist.
Image existence is checked later; the immediate error is role assumption.
Trap 3: The S3 bucket 'my-bucket' does not exist.
If the bucket didn't exist, the error would be later, not an immediate role failure.
- A
The IAM role 'SageMakerExecutionRole' does not have permission to create the training job.
Why wrong: The error is about assuming the role, not permissions using it.
- B
The training image in ECR does not exist.
Why wrong: Image existence is checked later; the immediate error is role assumption.
- C
The S3 bucket 'my-bucket' does not exist.
Why wrong: If the bucket didn't exist, the error would be later, not an immediate role failure.
- D
The IAM role's trust policy does not grant SageMaker permission to assume the role.
Without proper trust policy, SageMaker cannot assume the role, causing immediate failure.
Refer to the exhibit. A data scientist creates a SageMaker Pipeline definition using the JSON shown. The pipeline runs successfully, but the scientist notices that the training step did not use the parameter 'TrainingInstanceCount' defined in Parameters. Why did this happen?
Exhibit
{
"PipelineExperimentConfig": {
"ExperimentName": "my-experiment",
"TrialName": "my-trial"
},
"Parameters": {
"TrainingInstanceType": "ml.m5.large",
"TrainingInstanceCount": 2,
"MaxRuntimeInSeconds": 86400
},
"Steps": [
{
"Name": "Preprocess",
"Type": "Processing",
"ProcessingJobName": "preprocess-job",
"ProcessingResources": {
"ClusterConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m5.large",
"VolumeSizeInGB": 30
}
}
},
{
"Name": "Train",
"Type": "Training",
"TrainingJobName": "train-job",
"AlgorithmSpecification": {
"TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-algo:latest",
"TrainingInputMode": "File"
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.m5.large",
"VolumeSizeInGB": 30
}
}
]
}Trap 1: The pipeline encountered a runtime error and fell back to default…
No runtime error occurred.
Trap 2: The parameter name has a typo; it should be 'TrainingInstanceCount'…
The name matches, but it's unused.
Trap 3: The training image is not compatible with the specified instance…
No indication of incompatibility.
- A
The pipeline encountered a runtime error and fell back to default values.
Why wrong: No runtime error occurred.
- B
The parameter name has a typo; it should be 'TrainingInstanceCount' not 'TrainingInstanceCount'.
Why wrong: The name matches, but it's unused.
- C
The steps do not reference the Parameters; the values are hardcoded in the step definitions.
Parameters must be explicitly referenced in steps to take effect.
- D
The training image is not compatible with the specified instance type.
Why wrong: No indication of incompatibility.
A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)
Trap 1: Apply feature selection to reduce the number of features.
Feature selection is optional and not a mandatory step for all models.
Trap 2: Remove outliers from the dataset.
Outlier removal is not always required; it depends on the data and model.
- A
Apply feature selection to reduce the number of features.
Why wrong: Feature selection is optional and not a mandatory step for all models.
- B
Remove outliers from the dataset.
Why wrong: Outlier removal is not always required; it depends on the data and model.
- C
Handle missing values by imputation or removal.
Missing values can cause errors or biased models; handling them is necessary.
- D
Encode categorical features using one-hot encoding.
Linear regression requires numerical input; categorical features must be encoded.
- E
Scale numerical features using standardization (z-score) or normalization (min-max scaling).
Linear models are sensitive to feature scales; scaling improves convergence and performance.
A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?
Trap 1: Apply log transformation after adding a constant
Log transform is for positive data; adding constant is arbitrary.
Trap 2: Remove all rows with negative sales values
Loses data on returns, which may be informative.
Trap 3: Impute negative values with the mean
Incorrectly treats negative values as missing; mean may be positive.
- A
Clip negative sales values to zero
Sets returns to zero, which is appropriate for sales data.
- B
Apply log transformation after adding a constant
Why wrong: Log transform is for positive data; adding constant is arbitrary.
- C
Remove all rows with negative sales values
Why wrong: Loses data on returns, which may be informative.
- D
Impute negative values with the mean
Why wrong: Incorrectly treats negative values as missing; mean may be positive.
A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?
Trap 1: Apply one-hot encoding to each word
One-hot encoding is for categorical variables, not text preprocessing.
Trap 2: Compute TF-IDF vectors
TF-IDF is a feature extraction step, not preprocessing.
- A
Apply one-hot encoding to each word
Why wrong: One-hot encoding is for categorical variables, not text preprocessing.
- B
Remove punctuation and special characters
Removes noise that does not contribute to meaning.
- C
Compute TF-IDF vectors
Why wrong: TF-IDF is a feature extraction step, not preprocessing.
- D
Perform stemming or lemmatization
Reduces words to root form, reducing dimensionality.
- E
Convert all text to lowercase
Reduces vocabulary size and treats words like 'The' and 'the' as same.
A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?
Trap 1: Read the Parquet files as text using SparkContext.textFile
Loses the columnar benefits of Parquet.
Trap 2: Split the dataset into many small Parquet files (e.g., 1 MB each)
Too many small files cause I/O overhead.
Trap 3: Convert the Parquet files to CSV before processing
CSV is larger and slower to read than Parquet.
- A
Read the Parquet files as text using SparkContext.textFile
Why wrong: Loses the columnar benefits of Parquet.
- B
Split the dataset into many small Parquet files (e.g., 1 MB each)
Why wrong: Too many small files cause I/O overhead.
- C
Convert the Parquet files to CSV before processing
Why wrong: CSV is larger and slower to read than Parquet.
- D
Read the Parquet files directly using SparkSession.read.parquet
Leverages Parquet's efficiency and schema.
A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?
Trap 1: SMOTE to balance classes
Addresses class imbalance, not general overfitting.
Trap 2: One-hot encoding of all categorical features
Increases dimensionality, potentially worsening overfitting.
Trap 3: Log transformation of skewed features
Addresses skewness, not overfitting.
- A
Data augmentation (e.g., adding noise)
Increases training data diversity, reducing overfitting.
- B
SMOTE to balance classes
Why wrong: Addresses class imbalance, not general overfitting.
- C
One-hot encoding of all categorical features
Why wrong: Increases dimensionality, potentially worsening overfitting.
- D
Log transformation of skewed features
Why wrong: Addresses skewness, not overfitting.
- E
Feature selection using correlation analysis
Removes irrelevant/redundant features, reducing complexity.
A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?
Trap 1: Label encoding
Assumes ordinal relationship, not suitable for nominal categories.
Trap 2: One-hot encoding
Creates too many features, leading to high dimensionality.
Trap 3: Frequency encoding
Replaces with count/frequency, may lose information.
- A
Label encoding
Why wrong: Assumes ordinal relationship, not suitable for nominal categories.
- B
One-hot encoding
Why wrong: Creates too many features, leading to high dimensionality.
- C
Frequency encoding
Why wrong: Replaces with count/frequency, may lose information.
- D
Target encoding
Encodes using target mean, handles high cardinality well.
A data scientist is building a text classification model using a pre-trained BERT model from the Hugging Face library on SageMaker. The scientist wants to fine-tune the model on a custom dataset. Which TWO steps are necessary to set up the fine-tuning job? (Select TWO.)
Trap 1: Enable SageMaker Clarify for explainability during training
Clarify is for bias detection and explainability, not for training setup.
Trap 2: Build a custom Docker container with PyTorch and Transformers
SageMaker provides built-in HuggingFace containers, so custom container is not needed.
Trap 3: Use SageMaker Processing to preprocess the data in parallel
Processing is for data preprocessing, not for fine-tuning setup.
- A
Use the HuggingFace estimator provided by SageMaker
The HuggingFace estimator simplifies fine-tuning with pre-built containers.
- B
Enable SageMaker Clarify for explainability during training
Why wrong: Clarify is for bias detection and explainability, not for training setup.
- C
Build a custom Docker container with PyTorch and Transformers
Why wrong: SageMaker provides built-in HuggingFace containers, so custom container is not needed.
- D
Specify the PyTorch framework version and Transformers version in the estimator
Versions ensure compatibility with the pre-trained model.
- E
Use SageMaker Processing to preprocess the data in parallel
Why wrong: Processing is for data preprocessing, not for fine-tuning setup.
A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)
Trap 1: Use an ml.c5.2xlarge instance with CPU only
CPU only would not provide GPU acceleration required.
Trap 2: Use SageMaker Batch Transform for inference
Batch Transform is for offline inference, not real-time.
- A
Use an ml.c5.2xlarge instance with CPU only
Why wrong: CPU only would not provide GPU acceleration required.
- B
Use SageMaker Batch Transform for inference
Why wrong: Batch Transform is for offline inference, not real-time.
- C
Compile the model with SageMaker Neo
Neo optimizes the model for faster inference on target hardware.
- D
Use SageMaker Elastic Inference (EI) instead of a full GPU instance
EI provides GPU acceleration at lower cost for small models.
- E
Use an ml.p3.2xlarge instance for the endpoint
GPU instance provides needed acceleration for low latency.
A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?
Trap 1: Random undersampling of the majority class
Removes potentially valuable data.
Trap 2: Random oversampling of the minority class
Duplicates existing samples, risk of overfitting.
Trap 3: Apply class weights during model training
Affects loss function, not data preparation.
- A
SMOTE (Synthetic Minority Over-sampling Technique)
Generates synthetic samples for the minority class.
- B
Random undersampling of the majority class
Why wrong: Removes potentially valuable data.
- C
Random oversampling of the minority class
Why wrong: Duplicates existing samples, risk of overfitting.
- D
Apply class weights during model training
Why wrong: Affects loss function, not data preparation.
A company wants to use a pre-trained NLP model from SageMaker JumpStart for sentiment analysis. Which step is required to make predictions?
Trap 1: Label the dataset for fine-tuning
Labeling is needed only if fine-tuning; pre-trained models can be used directly.
Trap 2: Train the model from scratch on the company's data
JumpStart provides pre-trained models; fine-tuning is optional but not required for basic predictions.
Trap 3: Convert the model to ONNX format
ONNX conversion is not required; JumpStart models are already in a supported format.
- A
Label the dataset for fine-tuning
Why wrong: Labeling is needed only if fine-tuning; pre-trained models can be used directly.
- B
Train the model from scratch on the company's data
Why wrong: JumpStart provides pre-trained models; fine-tuning is optional but not required for basic predictions.
- C
Convert the model to ONNX format
Why wrong: ONNX conversion is not required; JumpStart models are already in a supported format.
- D
Deploy the model to an endpoint
Deploying to a SageMaker endpoint allows real-time inference on new data.
A data scientist is training a binary classification model using imbalanced data where the positive class is only 1% of the dataset. The scientist wants to maximize the recall for the positive class while maintaining reasonable precision. Which evaluation metric is most appropriate to tune during model selection?
Trap 1: Log loss
Log loss measures probability calibration, not classification performance for the minority class directly.
Trap 2: Area under the ROC curve (AUC)
AUC measures rank ordering but does not directly optimize recall at a specific threshold.
Trap 3: Accuracy
Accuracy can be high even if the model predicts all negatives, failing to capture the minority class.
- A
Log loss
Why wrong: Log loss measures probability calibration, not classification performance for the minority class directly.
- B
Area under the ROC curve (AUC)
Why wrong: AUC measures rank ordering but does not directly optimize recall at a specific threshold.
- C
F1 score
F1 score combines precision and recall, making it suitable for imbalanced classes when both matter.
- D
Accuracy
Why wrong: Accuracy can be high even if the model predicts all negatives, failing to capture the minority class.
An ML engineer needs to split a dataset into training, validation, and test sets. The dataset has a time-based column that should not be leaked. Which split method is most appropriate?
Trap 1: Stratified split based on target
Stratification maintains class proportions but ignores time order.
Trap 2: Random split with 70/20/10
Random split disregards time order and can leak future information into training.
Trap 3: K-fold cross-validation
Standard k-fold is random and can cause leakage in time series.
- A
Stratified split based on target
Why wrong: Stratification maintains class proportions but ignores time order.
- B
Temporal split based on date
Temporal split respects chronology by using earlier data for training and later data for testing.
- C
Random split with 70/20/10
Why wrong: Random split disregards time order and can leak future information into training.
- D
K-fold cross-validation
Why wrong: Standard k-fold is random and can cause leakage in time series.
A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?
Trap 1: Set k=1 to minimize bias
k=1 is prone to overfitting and noise.
Trap 2: Use all 100,000 rows to find neighbors for each missing value
Computationally expensive; consider sampling or approximate methods.
Trap 3: Use only the feature with missing values to find neighbors
Does not use information from other features.
- A
Set k=1 to minimize bias
Why wrong: k=1 is prone to overfitting and noise.
- B
Use all 100,000 rows to find neighbors for each missing value
Why wrong: Computationally expensive; consider sampling or approximate methods.
- C
Standardize the features before applying k-NN imputation
Ensures distance is equally weighted across features.
- D
Use only the feature with missing values to find neighbors
Why wrong: Does not use information from other features.
Question Discussion
Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.
Sign in to join the discussion.