MLA-C01 Practice Questions

Question 1

A company is running a SageMaker endpoint serving multiple models. They need to monitor for data drift and model quality. Which THREE actions are necessary? (Choose three.)

Accepted Answer

Enable data capture on the endpoint. Option B is correct because enabling data capture on the SageMaker endpoint is a prerequisite for monitoring data drift and model quality. Data capture automatically records input requests and output responses from the endpoint, which SageMaker Model Monitor later analyzes against a baseline to detect drift. Without data capture, there is no data to compare against the baseline constraints.

Answer

Deploy a shadow endpoint for comparison

Answer

Use SageMaker Debugger for monitoring

Question 2

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

Accepted Answer

Increase the regularization strength. The model shows high training accuracy (0.99) but significantly lower validation accuracy (0.75), which is a classic sign of overfitting. Increasing the regularization strength (e.g., L1 or L2 penalty) in logistic regression directly penalizes large coefficients, reducing the model's complexity and improving generalization. This is the most direct way to address overfitting in a logistic regression model.

Answer

Increase the number of features

Answer

Use a more complex model like XGBoost

Answer

Use stratified cross-validation

Question 3

A team is training a deep learning model on Amazon SageMaker using a custom Docker container. Which three practices should they follow to optimize training performance? (Choose three.)

Accepted Answer

Store training data in Amazon S3 in a shuffled and compressed format. Storing training data in Amazon S3 in a shuffled and compressed format (Option A) optimizes training performance because shuffling prevents biased gradient updates during stochastic gradient descent, while compression reduces I/O overhead and network transfer time. SageMaker's Pipe mode can then stream this compressed data directly to the training algorithm without intermediate disk writes, further accelerating throughput.

Answer

Use the largest instance type available

Answer

Increase the number of layers in the model to improve accuracy

Question 4

A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?

Accepted Answer

Use a GPU-based instance such as ml.p3.2xlarge. Option B is correct because GPU-based instances like ml.p3.2xlarge are specifically designed for parallel processing of matrix operations, which are fundamental to neural network training. By offloading compute-intensive tensor operations to GPU cores, training time can be significantly reduced without altering the model architecture or data, thus preserving accuracy.

Answer

Increase the batch size to the maximum possible

Answer

Use a learning rate scheduler that reduces the learning rate over time

Answer

Add more convolutional layers to the model

Question 5

A team is developing a model to predict customer churn. The dataset has 10,000 samples with 20 features. The target variable is binary with 15% churn rate. The team wants to use logistic regression. Which data preprocessing step is MOST important to ensure proper convergence?

Accepted Answer

Standardize the features to have zero mean and unit variance. Logistic regression uses gradient descent or similar optimization algorithms that rely on the scale of the features. When features have different units or magnitudes, the cost function becomes elongated, causing slow or unstable convergence. Standardizing to zero mean and unit variance ensures that all features contribute equally to the gradient updates, leading to faster and more reliable convergence.

Answer

Remove correlated features to reduce multicollinearity

Answer

Impute missing values with the median

Answer

Apply SMOTE to balance the classes

Question 6

A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?

Accepted Answer

Convert all timestamps to UTC during the ETL process, then extract hour. Option D is correct because converting all timestamps to UTC during the ETL process ensures a consistent time zone reference before extracting the hour-of-day feature. This avoids ambiguity from mixed time zones and aligns with best practices for machine learning feature engineering. AWS Glue ETL with Apache Spark provides built-in functions like `to_utc_timestamp()` to perform this conversion reliably.

Answer

Convert all timestamps to UTC in the ETL script using Spark's from_utc_timestamp

Answer

Use AWS Glue's built-in transform to parse timestamps with timezone offsets

Answer

Use Python's datetime.strptime with tzlocal

Question 7

A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?

Accepted Answer

Robust scaling (median and IQR). Robust scaling uses the median and interquartile range (IQR) to center and scale the data, making it resistant to extreme outliers. Since linear regression is sensitive to outliers, this transformation reduces their influence while preserving the original data distribution, unlike methods that rely on mean and variance.

Answer

Min-max scaling

Answer

Log transformation

Answer

Standardization (z-score)

Question 8

A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?

Accepted Answer

Drop the 'signup_date' column from the dataset.. Option D is correct because the 'signup_date' column is explicitly stated as not relevant for the prediction. In binary classification for customer churn, including an irrelevant timestamp can introduce noise, increase dimensionality, and potentially cause overfitting. Dropping the column is the most appropriate action to maintain model simplicity and focus on predictive features.

Answer

Apply one-hot encoding to the year, month, and day components.

Answer

Convert the timestamp to a numeric feature (e.g., days since signup) and include it.

Answer

Use leave-one-out encoding based on the target variable.

Question 9

An ML team wants to deploy a model that was trained using XGBoost in SageMaker. They want to use the built-in XGBoost algorithm container for inference. Which inference option requires the least custom code?

Accepted Answer

Deploy to a real-time endpoint using the built-in XGBoost container. Option B is correct because the built-in XGBoost container in SageMaker is pre-configured with the XGBoost serving stack, including the necessary inference code and dependencies. Deploying a model trained with XGBoost to a real-time endpoint using this container requires no custom inference script or Docker image, only the model artifact and endpoint configuration. This minimizes custom code to just the SageMaker SDK calls for creating the model and endpoint.

Answer

Create a custom Docker container with XGBoost and deploy to an endpoint

Answer

Attach Elastic Inference to a generic container

Answer

Use SageMaker Python SDK to download the model and run local inference

Question 10

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

Accepted Answer

The IAM role's trust policy does not grant SageMaker permission to assume the role.. The 'Unable to assume role' error indicates that SageMaker cannot assume the IAM role specified in the CLI command. This is a trust policy issue: the role's trust policy must include SageMaker as a trusted service (i.e., `"Service": "sagemaker.amazonaws.com"`). Without this, SageMaker is not authorized to assume the role, regardless of the role's permissions.

Answer

The IAM role 'SageMakerExecutionRole' does not have permission to create the training job.

Answer

The training image in ECR does not exist.

Answer

The S3 bucket 'my-bucket' does not exist.

Question 11

Refer to the exhibit. A data scientist creates a SageMaker Pipeline definition using the JSON shown. The pipeline runs successfully, but the scientist notices that the training step did not use the parameter 'TrainingInstanceCount' defined in Parameters. Why did this happen?

Accepted Answer

The steps do not reference the Parameters; the values are hardcoded in the step definitions.. Option C is correct because the SageMaker Pipeline definition shows that the training step's `InstanceCount` field is hardcoded to `1` in the step definition, rather than referencing the `TrainingInstanceCount` parameter using the `Parameters` object (e.g., `Parameters.TrainingInstanceCount`). In SageMaker Pipelines, parameters defined in the `Parameters` section must be explicitly referenced within the step definitions using the `Parameters` object; otherwise, the pipeline uses the hardcoded values and ignores the parameters entirely.

Answer

The pipeline encountered a runtime error and fell back to default values.

Answer

The parameter name has a typo; it should be 'TrainingInstanceCount' not 'TrainingInstanceCount'.

Answer

The training image is not compatible with the specified instance type.

Question 12

A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)

Accepted Answer

Handle missing values by imputation or removal.. Option C is correct because missing values can cause errors or biased estimates in linear regression models. Amazon SageMaker's built-in linear regression algorithm does not handle missing data automatically, so imputation (e.g., mean/median) or removal is necessary to ensure the training process completes and produces reliable coefficients.

Answer

Apply feature selection to reduce the number of features.

Answer

Remove outliers from the dataset.

Question 13

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

Accepted Answer

Clip negative sales values to zero. Option A is correct because clipping negative sales values to zero directly addresses the model's requirement for non-negative input while preserving the data's temporal structure. This approach is appropriate for time series forecasting where returns cause occasional negative values, as it treats returns as zero sales rather than removing or distorting the data points.

Answer

Apply log transformation after adding a constant

Answer

Remove all rows with negative sales values

Answer

Impute negative values with the mean

Question 14

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

Accepted Answer

Remove punctuation and special characters. Option B is correct because punctuation and special characters (e.g., commas, exclamation marks) introduce irrelevant noise that does not carry semantic meaning for most NLP models. Removing them reduces vocabulary size and prevents the model from treating 'hello!' and 'hello' as distinct tokens, which improves generalization and reduces overfitting.

Answer

Apply one-hot encoding to each word

Answer

Compute TF-IDF vectors

Question 15

A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?

Accepted Answer

Read the Parquet files directly using SparkSession.read.parquet. Option D is correct because SageMaker Processing natively integrates with Apache Spark, and reading Parquet files directly via `SparkSession.read.parquet` leverages columnar storage, predicate pushdown, and compression (e.g., Snappy) to minimize I/O and deserialization overhead. This approach is far more efficient than text-based or format-conversion methods, as Parquet is optimized for analytical workloads and preserves schema information.

Answer

Read the Parquet files as text using SparkContext.textFile

Answer

Split the dataset into many small Parquet files (e.g., 1 MB each)

Answer

Convert the Parquet files to CSV before processing

AWS Certified Machine Learning Engineer Associate MLA-C01 practice test

Three ways to study

All 507 MLA-C01 questions with answers

Study MLA-C01 by domain

Study MLA-C01 by topic

Data Preparation for Machine Learning practice questions

ML Model Development practice questions

Deployment and Orchestration of ML Workflows practice questions

ML Solution Monitoring, Maintenance and Security practice questions

MLA-C01 fundamentals practice questions

MLA-C01 scenario practice questions

MLA-C01 troubleshooting practice questions

Top MLA-C01 questions

AWS Certified Machine Learning Engineer Associate MLA-C01 practice questions

A company is running a SageMaker endpoint serving multiple models. They need to monitor for data drift and model quality. Which THREE actions are necessary? (Choose three.)

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

A team is training a deep learning model on Amazon SageMaker using a custom Docker container. Which three practices should they follow to optimize training performance? (Choose three.)

A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?

A team is developing a model to predict customer churn. The dataset has 10,000 samples with 20 features. The target variable is binary with 15% churn rate. The team wants to use logistic regression. Which data preprocessing step is MOST important to ensure proper convergence?

A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?

A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?

A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?

An ML team wants to deploy a model that was trained using XGBoost in SageMaker. They want to use the built-in XGBoost algorithm container for inference. Which inference option requires the least custom code?

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist creates a SageMaker Pipeline definition using the JSON shown. The pipeline runs successfully, but the scientist notices that the training step did not use the parameter 'TrainingInstanceCount' defined in Parameters. Why did this happen?

Exhibit

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?

A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

A data scientist is building a text classification model using a pre-trained BERT model from the Hugging Face library on SageMaker. The scientist wants to fine-tune the model on a custom dataset. Which TWO steps are necessary to set up the fine-tuning job? (Select TWO.)

A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

A company wants to use a pre-trained NLP model from SageMaker JumpStart for sentiment analysis. Which step is required to make predictions?

An ML engineer needs to split a dataset into training, validation, and test sets. The dataset has a time-based column that should not be leaked. Which split method is most appropriate?

A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?

Question Discussion

How to use these MLA-C01 questions

Quick answer