MLS-C01 Exam Questions and Answers

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

Use SageMaker File input mode and increase the EBS volume size to 1 TB.

Use SageMaker Pipe input mode to stream data directly from S3.

Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.

Convert the CSV files to Parquet format and use File input mode.

Load the data into an Amazon EFS file system and mount it to the training instance.

Why: Option B is correct because SageMaker Pipe input mode streams data directly from S3 to the training algorithm without writing to the instance's EBS volume, eliminating disk I/O bottlenecks. This is especially effective for large datasets (500 GB) that are updated daily, as it reduces startup time and avoids the need to download the entire dataset before training begins.

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

The S3 bucket is not configured with versioning, causing overwrites.

The Lambda function is reading from the oldest sequence number, causing high IteratorAgeSeconds.

The Lambda function’s reserved concurrency is too low for the increased shard count.

The partition key used by the producer does not ensure that related records go to the same shard after resharding.

After resharding, the mapping of partition keys to shards changes. If ordering matters, the partition key must be chosen to keep related records together.

Why: Option D is correct because after resharding from 2 to 4 shards, the mapping of partition keys to shards changes. If the producer does not use a partition key that ensures related records (e.g., same user session) are routed to the same shard, records that were previously ordered within a shard may now be split across multiple shards. Since the Lambda consumer processes shards independently, records from the same logical sequence can arrive out of order, and the increased shard count can also cause higher latency if the consumer is not properly parallelized.

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

Amazon Athena

Amazon EMR with Spark

AWS Glue

AWS Glue is a serverless ETL service that can perform the transformation efficiently.

AWS Data Pipeline

Why: AWS Glue is the correct choice because it provides a fully managed, serverless ETL service that can automatically convert CSV files from S3 into Parquet format using its built-in Spark engine. It is cost-effective as you only pay for the resources consumed during the job execution, and it integrates directly with data warehouses like Amazon Redshift for loading transformed data.

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

The data is encrypted with AWS KMS and Firehose cannot write to encrypted buckets.

The delivery stream does not have dynamic partitioning enabled with the appropriate custom prefix.

Without dynamic partitioning and the correct prefix, Firehose will not partition the data by year/month/day.

The buffer interval is too short for the data volume, causing incomplete records.

The S3 bucket has versioning enabled, which prevents partitioning.

Why: Option B is correct because Kinesis Data Firehose requires dynamic partitioning to be explicitly enabled and configured with a custom prefix (e.g., 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/') to automatically partition data by year, month, and day. Without this setting, Firehose writes all data to a single S3 prefix, ignoring the desired partition structure.

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

Use the DynamoDB Export to S3 feature and schedule it daily with AWS Glue.

Use DynamoDB Streams with AWS Lambda to write changes to S3 in Parquet format.

Streams capture changes in near-real-time, enabling incremental exports with minimal overhead.

Use a script that scans the DynamoDB table and filters by last updated timestamp.

Set up an Amazon EMR cluster running Spark jobs to read DynamoDB and write to S3.

Why: Option B is correct because DynamoDB Streams capture every change (insert, update, delete) in near real-time, and AWS Lambda can process these events to write only the changed records to S3 in Parquet format. This approach provides incremental, daily exports with minimal operational overhead, as it is fully serverless and requires no infrastructure management.

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

Local input mode

Batch input mode

File input mode

File mode downloads data to the local file system, making it available for training.

Pipe input mode

Why: File input mode is correct because it downloads the entire training dataset from S3 to the local file system of the ml.m5.large instance before training begins, ensuring the data is available locally as required. This mode is suitable for datasets up to 10 GB, as the instance's local storage (typically 8 GB for ml.m5.large) may be insufficient, but SageMaker uses the instance's Amazon EBS volume (up to 512 GB) for file input mode, making it viable.

Want more Data Engineering practice?

All Machine Learning Implementation and Operations questions

Domain 2: Machine Learning Implementation and Operations

A company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?

Reduce the number of epochs.

Increase the number of GPUs by using a distributed training instance type.

Enable automatic mixed precision (AMP) training to reduce memory usage.

AMP uses FP16 where possible, cutting memory usage roughly in half, which often resolves out-of-memory errors.

Use a smaller instance type to force lower memory usage.

Why: Option C is correct because enabling automatic mixed precision (AMP) training reduces GPU memory usage by storing tensors in half-precision (FP16) where possible, while keeping critical operations in full precision (FP32). This directly addresses the 'CUDA out of memory' error on an ml.p3.2xlarge instance (16 GB GPU memory) without changing the model architecture or batch size, which are already appropriate.

A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?

Deploy each model to a separate endpoint and use a load balancer.

Use a single endpoint with multiple instances behind it.

Use a SageMaker batch transform job to process inference requests in batches.

Use a SageMaker multi-model endpoint to host all models on one or more instances.

Multi-model endpoints efficiently host multiple models on shared instances, reducing cost.

Why: A SageMaker multi-model endpoint (MME) allows hosting multiple models on a single or few instances, dynamically loading them from Amazon S3 into memory as needed. This is the most cost-effective option for a large ensemble of 500 MB models because it avoids the expense of separate endpoints or multiple instances per model, while still supporting low-latency real-time inference by keeping frequently used models cached.

A company is using Amazon SageMaker to train a model with a custom algorithm. The training script reads data from an S3 bucket using boto3. The training job fails with an 'AccessDenied' error when trying to access the S3 bucket. The IAM role attached to the SageMaker notebook instance has full S3 access. What is the most likely cause?

The S3 bucket has a bucket policy that denies access from the SageMaker service.

The SageMaker execution role used for the training job does not have S3 access permissions.

The training job uses its own execution role, which must be granted S3 access.

The training script is using an incorrect S3 bucket name.

The SageMaker training job is not configured to use the S3 VPC endpoint.

Why: The IAM role attached to the SageMaker notebook instance is used for interactive development, but training jobs run under a separate SageMaker execution role. Even if the notebook role has full S3 access, the training job's execution role must also have explicit S3 permissions. The 'AccessDenied' error indicates that the execution role lacks the necessary s3:GetObject or s3:ListBucket actions for the S3 bucket.

A machine learning engineer is deploying a model using AWS Lambda for real-time inference. The model is a scikit-learn RandomForestClassifier with 100 trees, serialized as a pickle file of 150 MB. The Lambda function has 3 GB memory allocated. However, the inference requests are timing out after 30 seconds. What is the most likely cause?

scikit-learn is not compatible with AWS Lambda.

The Lambda function does not have enough memory to load the model.

The model is loaded from S3 on every invocation, causing high latency.

Lambda should load the model outside the handler to reuse across invocations, but even then, cold starts with a large model are slow.

The Lambda function timeout is set too low; increase it to 5 minutes.

Why: Option C is correct because the default behavior of loading a model from S3 on every Lambda invocation introduces significant latency. Each invocation must download the 150 MB pickle file from S3 over the network, deserialize it, and then run inference, which easily exceeds the 30-second timeout. The model should be loaded once outside the handler (in global scope) and reused across invocations to avoid this overhead.

A data scientist is using Amazon SageMaker for hyperparameter tuning. The tuning job uses a Bayesian optimization strategy. After 10 training jobs, the objective metric (validation accuracy) has plateaued at 0.85. The data scientist wants to explore more diverse hyperparameter combinations. What should the data scientist do?

Decrease the exploration weight in the tuning job configuration.

Switch to random search strategy.

Increase the exploration weight in the tuning job configuration.

Increasing exploration weight prompts the algorithm to try more diverse combinations.

Increase the number of parallel training jobs.

Why: In Bayesian optimization, the exploration weight controls the trade-off between exploring new hyperparameter regions and exploiting known good regions. Increasing this weight encourages the acquisition function to sample more diverse combinations, which can help escape a plateau. Option C is correct because it directly addresses the need for greater diversity in the search space.

An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?

ecr:GetDownloadUrlForLayer and ecr:BatchGetImage on the ECR repository

These permissions are required to pull a Docker image from ECR.

ecr:PutImage on the ECR repository

s3:GetObject on the ECR repository

sagemaker:CreateTrainingJob on the ECR resource

Why: When SageMaker pulls a custom Docker image from ECR during training job creation, the execution role needs permissions to download the image layers. The required actions are ecr:GetDownloadUrlForLayer (to generate pre-signed URLs for each layer) and ecr:BatchGetImage (to retrieve image metadata and layer manifests). Without these, the 'AccessDenied' error occurs because SageMaker cannot authenticate or fetch the container image from the ECR repository.

Want more Machine Learning Implementation and Operations practice?

Domain 3: Modeling

All Modeling questions

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (99% negative class, 1% positive class). The model currently achieves 99% accuracy but fails to detect most positive cases. Which metric should the data scientist primarily use to evaluate model performance?

ROC AUC

F1 score

F1 score balances precision and recall, suitable for imbalanced data.

Recall

Accuracy

Why: In highly imbalanced datasets (99% negative, 1% positive), accuracy is misleading because a model can achieve 99% accuracy by simply predicting the majority class for all instances, failing to detect any positive cases. The F1 score (option B) is the harmonic mean of precision and recall, providing a balanced measure that penalizes models that trade off recall for precision or vice versa. This makes it the primary metric for evaluating binary classification performance on imbalanced data, as it directly reflects the model's ability to correctly identify positive cases while minimizing false positives.

A team is building a product recommendation system using matrix factorization in Amazon SageMaker. They notice that the model's training loss decreases steadily but validation loss starts increasing after 5 epochs. What is the most likely cause?

Underfitting

Not enough training data

Learning rate too high

Overfitting

The model is memorizing the training data.

Why: In matrix factorization for recommendation systems, a decreasing training loss with an increasing validation loss after several epochs is a classic sign of overfitting. The model is memorizing the training data (including noise) rather than learning generalizable patterns, which degrades its performance on unseen validation data.

A company is using Amazon SageMaker to train a deep learning model on a large dataset. The training job is taking too long. The team wants to reduce training time without changing the model architecture. Which action should they take?

Increase the learning rate by a factor of 10

Use SageMaker's distributed training with multiple instances

Distributed training parallelizes the workload.

Reduce the number of epochs

Reduce the batch size

Why: SageMaker's distributed training with multiple instances splits the dataset and model computations across several machines, enabling parallel processing that significantly reduces wall-clock training time. This approach leverages data parallelism or model parallelism without altering the model architecture, directly addressing the need for faster training.

A data scientist is deploying a regression model in Amazon SageMaker that predicts housing prices. The model shows high bias (underfitting). Which action is most likely to reduce bias?

Reduce the amount of training data

Increase regularization strength

Use a simpler model

Add more features or increase model complexity

More complex models can capture patterns better.

Why: High bias (underfitting) means the model is too simple to capture the underlying patterns in the data. Adding more features or increasing model complexity (e.g., using polynomial features, deeper trees, or a more flexible algorithm) directly addresses underfitting by giving the model greater capacity to learn from the data. In Amazon SageMaker, this could involve using a more complex built-in algorithm like XGBoost with deeper trees or adding feature engineering transformations in a processing job.

A machine learning engineer is training a neural network on Amazon SageMaker using a custom Docker container. The training job fails with an error: 'CUDA out of memory.' The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model and data fit into memory when using batch size 32, but the engineer wants to maximize GPU utilization. Which approach should the engineer use to fix the out-of-memory error while maintaining efficient training?

Enable mixed precision training

Reduce batch size to 1

Use a CPU-only instance

Implement gradient accumulation with a larger effective batch size

Accumulates gradients over smaller batches to simulate larger batches.

Why: Gradient accumulation allows the engineer to simulate a larger effective batch size by accumulating gradients over multiple forward/backward passes before performing an optimizer step. This keeps the per-step memory footprint low (avoiding CUDA out-of-memory) while maintaining training dynamics similar to a larger batch, thus maximizing GPU utilization without crashing.

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

Increase the number of layers

Remove L2 regularization

Increase the number of training steps

Add dropout layers

Dropout regularizes by randomly dropping neurons.

Add early stopping based on validation loss

Stops training when validation loss stops improving.

Why: Option D is correct because dropout layers randomly deactivate a fraction of neurons during training, which forces the network to learn more robust features and reduces co-adaptation, a common cause of overfitting. This technique is particularly effective in deep learning models trained on SageMaker, where large architectures can quickly memorize training data.

Want more Modeling practice?

All Exploratory Data Analysis questions

Domain 4: Exploratory Data Analysis

A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?

Create an interaction term between the two features.

Remove one of the two highly correlated features from the dataset.

Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.

Increase the regularization parameter (e.g., lambda) in the model.

Apply mean-centering to both features to reduce correlation.

Why: Two features with a correlation coefficient of 0.98 are nearly perfectly multicollinear. This inflates the variance of coefficient estimates in linear models, making them unstable and reducing interpretability. Removing one of the highly correlated features is a standard dimensionality reduction technique that mitigates multicollinearity without significant information loss, as the remaining feature captures almost the same variance.

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

One-hot encoding introduced multicollinearity among the binary columns.

One-hot encoding reduced the number of features, causing underfitting.

The one-hot encoding introduced high variance, but the validation set has low variance.

The model suffers from the curse of dimensionality due to the large number of features.

With 100 additional sparse features, the model may overfit and not generalize well.

Why: One-hot encoding 'zip_code' with 100 unique values creates 100 binary features. With only 100 features, the dataset is not high-dimensional enough to cause the curse of dimensionality, which typically requires thousands of features. The poor performance is more likely due to other issues like overfitting or data leakage, not the curse of dimensionality. Option D is incorrect because the curse of dimensionality is not the most likely cause in this scenario.

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

Apply a log transformation to the feature.

Log transformation compresses high values and can make the distribution more symmetric.

Apply z-score normalization.

Apply one-hot encoding.

Apply min-max scaling.

Why: A log transformation compresses the range of the data, reducing the impact of extreme values and pulling in the long tail of a right-skewed distribution. This makes the feature more normally distributed, which is often required for linear models and many statistical tests. It is the standard technique for handling positive-valued features with heavy right skew.

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

The imputation will introduce bias if the missing values are not random.

Imputation using median is computationally expensive for large datasets.

The imputed values may reduce the variance of the 'age' distribution.

Replacing missing values with a constant reduces the variability of the feature.

The imputed values will increase the variance of the feature, leading to overfitting.

Why: Imputing missing values with the median of the observed data artificially concentrates imputed values around the center of the distribution. This reduces the overall variance of the 'age' column because the imputed values do not reflect the natural spread of the data, potentially distorting downstream analyses like regression or clustering that rely on variance structure.

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

Remove all but one feature from each group of highly correlated features.

Apply Principal Component Analysis (PCA) and keep the top 50 principal components.

PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.

Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.

Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.

Why: Principal Component Analysis (PCA) is the correct technique because it performs an orthogonal linear transformation that projects the original 500 features into a new coordinate system where the axes (principal components) are ordered by the variance they capture. By keeping the top 50 principal components, the data scientist retains the maximum possible variance in the reduced 50-dimensional space, directly addressing the goal of preserving variance while handling high multicollinearity.

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

Replace missing values with the mode of the feature.

Identify and remove outliers from the feature.

Use multiple imputation to fill in the missing values.

Multiple imputation creates several plausible imputed datasets and combines results.

Delete all rows that contain missing values for this feature.

If missingness is random and 15% is acceptable, listwise deletion is straightforward.

Drop the entire feature from the dataset.

Why: Option C is correct because multiple imputation is a robust statistical technique that accounts for uncertainty in missing values by creating multiple complete datasets, analyzing each, and pooling results. This is particularly appropriate for a dataset with 10,000 rows and 20 features, as it preserves the sample size and avoids bias that simpler methods might introduce.

Want more Exploratory Data Analysis practice?

Browse all MLS-C01 questions Take a timed practice test

Frequently asked questions

How many questions are on the MLS-C01 exam?

The MLS-C01 exam has 65 questions and must be completed in 180 minutes. The passing score is 750/1000.

What types of questions appear on the MLS-C01 exam?

Scenario-based questions covering exam objectives with detailed answer explanations.

How are MLS-C01 questions organised by domain?

The exam covers 4 domains: Data Engineering, Machine Learning Implementation and Operations, Modeling, Exploratory Data Analysis. Questions are weighted by domain — higher-weight domains appear more on your actual exam.

Are these the actual MLS-C01 exam questions?

No. These are original exam-style practice questions written against the official Amazon Web Services MLS-C01 exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.

Ready to practice all 65 MLS-C01 questions?

Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.

Amazon Web Services · Free Practice Questions · Last reviewed May 2026

MLS-C01 Exam Questions and Answers

24real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.

65 exam questions

180 min time limit

Pass: 750/1000 / 1000

4 exam domains

Overview Domain Blueprint Study Guide All QuestionsSample by Domain

1. Data Engineering 2. Machine Learning Implementation and Operations 3. Modeling 4. Exploratory Data Analysis

Domain 1: Data Engineering

All Data Engineering questions

Use SageMaker File input mode and increase the EBS volume size to 1 TB.

Use SageMaker Pipe input mode to stream data directly from S3.

Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.

Convert the CSV files to Parquet format and use File input mode.

Load the data into an Amazon EFS file system and mount it to the training instance.

The S3 bucket is not configured with versioning, causing overwrites.

The Lambda function is reading from the oldest sequence number, causing high IteratorAgeSeconds.

The Lambda function’s reserved concurrency is too low for the increased shard count.

The partition key used by the producer does not ensure that related records go to the same shard after resharding.

After resharding, the mapping of partition keys to shards changes. If ordering matters, the partition key must be chosen to keep related records together.

Amazon Athena

Amazon EMR with Spark

AWS Glue

AWS Glue is a serverless ETL service that can perform the transformation efficiently.

AWS Data Pipeline

The data is encrypted with AWS KMS and Firehose cannot write to encrypted buckets.

The delivery stream does not have dynamic partitioning enabled with the appropriate custom prefix.

Without dynamic partitioning and the correct prefix, Firehose will not partition the data by year/month/day.

The buffer interval is too short for the data volume, causing incomplete records.

The S3 bucket has versioning enabled, which prevents partitioning.

Use the DynamoDB Export to S3 feature and schedule it daily with AWS Glue.

Use DynamoDB Streams with AWS Lambda to write changes to S3 in Parquet format.

Streams capture changes in near-real-time, enabling incremental exports with minimal overhead.

Use a script that scans the DynamoDB table and filters by last updated timestamp.

Set up an Amazon EMR cluster running Spark jobs to read DynamoDB and write to S3.

Local input mode

Batch input mode

File input mode

File mode downloads data to the local file system, making it available for training.

Pipe input mode

Want more Data Engineering practice?

All Machine Learning Implementation and Operations questions

Domain 2: Machine Learning Implementation and Operations

Reduce the number of epochs.

Increase the number of GPUs by using a distributed training instance type.

Enable automatic mixed precision (AMP) training to reduce memory usage.

AMP uses FP16 where possible, cutting memory usage roughly in half, which often resolves out-of-memory errors.

Use a smaller instance type to force lower memory usage.

Deploy each model to a separate endpoint and use a load balancer.

Use a single endpoint with multiple instances behind it.

Use a SageMaker batch transform job to process inference requests in batches.

Use a SageMaker multi-model endpoint to host all models on one or more instances.

Multi-model endpoints efficiently host multiple models on shared instances, reducing cost.

The S3 bucket has a bucket policy that denies access from the SageMaker service.

The SageMaker execution role used for the training job does not have S3 access permissions.

The training job uses its own execution role, which must be granted S3 access.

The training script is using an incorrect S3 bucket name.

The SageMaker training job is not configured to use the S3 VPC endpoint.

scikit-learn is not compatible with AWS Lambda.

The Lambda function does not have enough memory to load the model.

The model is loaded from S3 on every invocation, causing high latency.

Lambda should load the model outside the handler to reuse across invocations, but even then, cold starts with a large model are slow.

The Lambda function timeout is set too low; increase it to 5 minutes.

Decrease the exploration weight in the tuning job configuration.

Switch to random search strategy.

Increase the exploration weight in the tuning job configuration.

Increasing exploration weight prompts the algorithm to try more diverse combinations.

Increase the number of parallel training jobs.

ecr:GetDownloadUrlForLayer and ecr:BatchGetImage on the ECR repository

These permissions are required to pull a Docker image from ECR.

ecr:PutImage on the ECR repository

s3:GetObject on the ECR repository

sagemaker:CreateTrainingJob on the ECR resource

Want more Machine Learning Implementation and Operations practice?

Domain 3: Modeling

All Modeling questions

ROC AUC

F1 score

F1 score balances precision and recall, suitable for imbalanced data.

Recall

Accuracy

Underfitting

Not enough training data

Learning rate too high

Overfitting

The model is memorizing the training data.

Increase the learning rate by a factor of 10

Use SageMaker's distributed training with multiple instances

Distributed training parallelizes the workload.

Reduce the number of epochs

Reduce the batch size

A data scientist is deploying a regression model in Amazon SageMaker that predicts housing prices. The model shows high bias (underfitting). Which action is most likely to reduce bias?

Reduce the amount of training data

Increase regularization strength

Use a simpler model

Add more features or increase model complexity

More complex models can capture patterns better.

Enable mixed precision training

Reduce batch size to 1

Use a CPU-only instance

Implement gradient accumulation with a larger effective batch size

Accumulates gradients over smaller batches to simulate larger batches.

Increase the number of layers

Remove L2 regularization

Increase the number of training steps

Add dropout layers

Dropout regularizes by randomly dropping neurons.

Add early stopping based on validation loss

Stops training when validation loss stops improving.

Want more Modeling practice?

All Exploratory Data Analysis questions

Domain 4: Exploratory Data Analysis

Create an interaction term between the two features.

Remove one of the two highly correlated features from the dataset.

Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.

Increase the regularization parameter (e.g., lambda) in the model.

Apply mean-centering to both features to reduce correlation.

One-hot encoding introduced multicollinearity among the binary columns.

One-hot encoding reduced the number of features, causing underfitting.

The one-hot encoding introduced high variance, but the validation set has low variance.

The model suffers from the curse of dimensionality due to the large number of features.

With 100 additional sparse features, the model may overfit and not generalize well.

Apply a log transformation to the feature.

Log transformation compresses high values and can make the distribution more symmetric.

Apply z-score normalization.

Apply one-hot encoding.

Apply min-max scaling.

The imputation will introduce bias if the missing values are not random.

Imputation using median is computationally expensive for large datasets.

The imputed values may reduce the variance of the 'age' distribution.

Replacing missing values with a constant reduces the variability of the feature.

The imputed values will increase the variance of the feature, leading to overfitting.

Remove all but one feature from each group of highly correlated features.

Apply Principal Component Analysis (PCA) and keep the top 50 principal components.

PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.

Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.

Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.

Replace missing values with the mode of the feature.

Identify and remove outliers from the feature.

Use multiple imputation to fill in the missing values.

Multiple imputation creates several plausible imputed datasets and combines results.

Delete all rows that contain missing values for this feature.

If missingness is random and 15% is acceptable, listwise deletion is straightforward.

Drop the entire feature from the dataset.

Want more Exploratory Data Analysis practice?