How many Modeling questions are on the MLS-C01 exam?

The Modeling domain is one of the weighted domains on the MLS-C01 exam. The Courseiva question bank has 300 practice questions for this domain.

How can I practice Modeling questions for MLS-C01?

Click any of the 300 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Modeling domain.

Free MLS-C01 Modeling Practice Questions (2026)

Q: What does the Modeling domain cover on the MLS-C01 exam?

The Modeling domain covers the key concepts and skills tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services.

Practice Modeling questions

10Q 20Q 30Q 50Q

MLS-C01 Modeling questions (showing 300 of 624)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (99% negative class, 1% positive class). The model currently achieves 99% accuracy but fails to detect most positive cases. Which metric should the data scientist primarily use to evaluate model performance?

A team is building a product recommendation system using matrix factorization in Amazon SageMaker. They notice that the model's training loss decreases steadily but validation loss starts increasing after 5 epochs. What is the most likely cause?

A company is using Amazon SageMaker to train a deep learning model on a large dataset. The training job is taking too long. The team wants to reduce training time without changing the model architecture. Which action should they take?

A data scientist is deploying a regression model in Amazon SageMaker that predicts housing prices. The model shows high bias (underfitting). Which action is most likely to reduce bias?

A machine learning engineer is training a neural network on Amazon SageMaker using a custom Docker container. The training job fails with an error: 'CUDA out of memory.' The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model and data fit into memory when using batch size 32, but the engineer wants to maximize GPU utilization. Which approach should the engineer use to fix the out-of-memory error while maintaining efficient training?

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

A company is using Amazon SageMaker to tune hyperparameters for a gradient boosting model. The objective is to minimize root mean squared error (RMSE). The data scientist wants to explore the hyperparameter space efficiently. Which THREE hyperparameter tuning strategies should the data scientist consider? (Choose 3.)

A data scientist is training a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. The model currently achieves 99% accuracy but a recall of only 10% on the positive class. Which metric combination should the data scientist prioritize to evaluate model improvements?

An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?

A company wants to use Amazon SageMaker to train a deep learning model using a custom TensorFlow script. The data is stored in an S3 bucket. Which SageMaker API operation should be used to launch the training job?

A data scientist is building a multi-class classification model with 10 classes. The dataset has 100,000 samples. After training a random forest with 100 trees, the model achieves 85% accuracy on the test set. However, the data scientist notices that for one rare class (1% of data), recall is only 5%. Which technique is MOST likely to improve recall for the rare class without significantly reducing overall accuracy?

A company uses an XGBoost model to predict equipment failures. The model has high precision but low recall. The business impact of a false negative is very high (missing a failure). Which action would MOST effectively increase recall while keeping precision reasonably high?

Which TWO metrics are MOST appropriate for evaluating a regression model that predicts house prices, where the business is most sensitive to large errors?

Which THREE techniques can help reduce overfitting in a neural network trained on a small dataset?

A data scientist runs a SageMaker training job that fails with the above error. The S3 bucket and object exist, and the IAM role has s3:GetObject permission. What is the MOST likely cause?

A data scientist is trying to run a SageMaker training job that writes output to an S3 bucket 'my-bucket'. The IAM policy is shown. The training job fails with an AccessDenied error when trying to write to S3. What is the reason?

A data scientist is training a binary classifier to predict customer churn. The dataset has 10,000 samples, with 500 churners (positive class). The scientist trains a logistic regression model and obtains an F1-score of 0.6. To improve the F1-score, which approach is MOST likely to be effective?

A company is deploying a real-time fraud detection system using a gradient boosting model on AWS SageMaker. The model uses 200 features and is trained on 50 GB of data. The inference latency requirement is under 10 ms per request. During load testing, the endpoint shows average latency of 15 ms. Which change is MOST likely to reduce latency below 10 ms?

A machine learning team is training a deep learning model on Amazon SageMaker and notices that the training loss is decreasing but the validation loss is increasing. What is the most likely cause?

A company is building a recommendation system for an e-commerce platform. They have user-item interaction data (clicks, purchases) and want to use matrix factorization. They plan to use Amazon SageMaker to train the model. Which dataset format is MOST appropriate for the built-in Factorization Machines algorithm?

A data scientist is tuning a gradient boosting model using Amazon SageMaker's Automatic Model Tuning (hyperparameter optimization). The objective metric is validation:auc. After 50 training jobs, the best model still has a validation AUC of only 0.65. The scientist suspects overfitting because the training AUC is 0.99. Which hyperparameter configuration is MOST likely to reduce overfitting?

A data scientist is training a neural network for image classification. The training loss is not decreasing significantly, and the validation loss is high. Which TWO actions should the scientist take to address potential vanishing gradients?

A company is using Amazon SageMaker to train a large language model. The training job is taking too long. The data scientist wants to reduce training time without sacrificing model accuracy. Which THREE strategies are MOST appropriate?

A data scientist is training a binary classification model on a dataset with a severe class imbalance (95% negative, 5% positive). The model achieves 95% accuracy but only correctly identifies 10% of the positive class. Which metric should the data scientist use to evaluate model performance?

A company is building a recommendation system for an e-commerce platform. They have user-item interaction data and want to use matrix factorization. However, the dataset is sparse (99% missing interactions). Which approach should the data scientist take to train the model effectively?

A financial services company is building a model to detect fraudulent credit card transactions. The dataset contains 1 million transactions, with only 0.1% labeled as fraud. The data scientist trains a logistic regression model on the raw dataset and obtains the following results on a held-out test set: accuracy = 99.8%, precision = 50%, recall = 60%, F1 = 0.545. The business requirement is to maximize recall while keeping precision above 80%. Which course of action should the data scientist take to improve the model?

A research team is developing a deep learning model to classify medical images into 10 disease categories. They have a dataset of 50,000 labeled images, but the class distribution is highly imbalanced: the most common class has 20,000 images, while the rarest class has only 200 images. To address this, they apply data augmentation (random rotations, flips, and brightness adjustments) to the minority classes until each class has 20,000 images. They then train a convolutional neural network (CNN) from scratch using cross-entropy loss. The model achieves 95% overall accuracy but only 30% recall on the rarest class. Which change is MOST likely to improve recall on the rarest class without significantly reducing overall accuracy?

A data scientist is training a linear regression model to predict house prices. The dataset includes features such as square footage, number of bedrooms, and location. After training, the model achieves an R² of 0.85 on the training set but only 0.60 on the test set. Which of the following is the MOST likely cause of this discrepancy?

A machine learning team is building a multi-class image classifier using a pre-trained ResNet-50 model in Amazon SageMaker. The dataset has 10 classes but is highly imbalanced, with one class representing 80% of the samples. The team wants to improve model performance on the minority classes. Which TWO of the following approaches are most likely to help? (Select TWO.)

A financial services company is building a fraud detection model using a large dataset of credit card transactions. The dataset contains 10 million rows with 50 features, including transaction amount, merchant category, time of day, and customer historical features. The label is binary: fraudulent (1% of data) or legitimate. The company wants to deploy a real-time inference endpoint using Amazon SageMaker that can score transactions with sub-100ms latency. The current model is a gradient boosting model (XGBoost) trained on a sample of 1 million rows due to memory constraints. The model achieves 0.95 AUC on a held-out test set but the fraud recall (sensitivity) is only 0.4, which is unacceptable because the cost of missing a fraud is high. The data science team has access to a larger compute instance (ml.m5.24xlarge) for training. Which course of action is most likely to improve fraud recall while maintaining latency requirements?

Drag and drop the steps to deploy a model as a SageMaker endpoint for real-time inference in the correct order.

Drag and drop the steps to use Amazon SageMaker Debugger to debug a training job in the correct order.

Match each hyperparameter tuning strategy to its description.

Match each AWS AI service to its capability.

A data scientist is training a binary classification model on a highly imbalanced dataset where the positive class represents only 1% of the data. Which metric should be used to evaluate model performance during training to ensure the model is learning to detect the positive class?

A machine learning team is deploying a model that performs real-time inference on streaming data from Amazon Kinesis Data Streams. The model requires sub-100ms latency. Which deployment option should the team choose?

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long. The scientist notices that GPU utilization is low and data loading is the bottleneck. Which action should the scientist take to improve training performance?

A data scientist is building a regression model to predict house prices. The dataset contains many features, some of which are highly correlated. The model is overfitting. Which regularization technique should the scientist use to penalize large coefficients and perform feature selection?

A company uses Amazon SageMaker to train a classification model. The training job fails with an error indicating that the algorithm requires a GPU but the instance type does not have one. The scientist used the built-in XGBoost algorithm. What should the scientist do to resolve the issue?

A data scientist is working on a multi-class classification problem with 10 classes. The model outputs probabilities and the scientist wants to evaluate the model's ability to rank classes correctly. Which metric is most appropriate?

A machine learning engineer is deploying a model using Amazon SageMaker and wants to automatically scale the endpoint based on the number of incoming requests. Which scaling policy should be used?

A data scientist is training a neural network on a dataset with 1 million images. The training loss decreases steadily but the validation loss starts to increase after 10 epochs. Which action should the scientist take to improve generalization?

A company uses Amazon SageMaker to train a model using the built-in Linear Learner algorithm. The training data contains missing values in some features. What is the best practice for handling missing values with this algorithm?

Which TWO metrics are appropriate for evaluating a binary classification model when the cost of false negatives is high?

Which THREE techniques are effective for reducing overfitting in a deep neural network?

Which TWO actions are best practices for tuning hyperparameters using Amazon SageMaker Automatic Model Tuning?

Refer to the exhibit. A data scientist is assigned an IAM policy to deploy a SageMaker model. When the scientist tries to create an endpoint, the action fails with an authorization error. What is the missing permission?

Refer to the exhibit. A data scientist runs the above AWS CLI command to create a SageMaker training job using the built-in Linear Learner algorithm. The training job fails with an error. What is the most likely cause?

Refer to the exhibit. A data scientist is configuring SageMaker Model Monitor for data quality checks. The configuration above is used. What is the purpose of the `ProbabilityThresholdAttribute` set to "0.5"?

A data scientist is training a binary classifier using logistic regression on a dataset that is highly imbalanced (95% negative class, 5% positive class). The model achieves 95% accuracy but only predicts the negative class. Which metric should the scientist use to evaluate the model's performance on the positive class?

A team is training a deep learning model on Amazon SageMaker using a large dataset stored in S3. The training job is taking a long time, and the team suspects that data loading is the bottleneck. The dataset consists of many small files (average size 10KB). Which change would most effectively reduce the I/O bottleneck?

A company is using Amazon SageMaker to deploy a machine learning model for real-time inference. The model was trained using XGBoost and achieves high accuracy. However, during deployment, the endpoint returns a 'ModelError' when receiving input data. The input is a CSV string. What is the most likely cause?

A data scientist is using Amazon SageMaker to perform hyperparameter tuning for a neural network. The tuning job uses the 'Random' search strategy. After 10 training jobs, the best objective metric has plateaued. The scientist wants to improve the results without increasing the total number of training jobs. Which approach should they take?

A company is building a recommendation system using collaborative filtering on Amazon SageMaker. The dataset contains user-item interactions with a long-tail distribution: a few items have millions of interactions, while most items have very few. The model currently uses matrix factorization with ALS. The recall@20 metric is low for niche items. Which modification would most likely improve recall for long-tail items?

A data scientist is training a random forest model on a dataset with 50 features. After training, the model achieves 98% accuracy on the training set but only 85% on the test set. Which technique is most appropriate to reduce the generalization error?

A company is using Amazon SageMaker to deploy a model that predicts customer churn. The model was trained using a linear learner algorithm. During inference, the endpoint returns predictions that are always 0.5 (the probability of churn). What is the most likely cause?

A data scientist is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single GPU instance and is taking too long. The scientist wants to reduce training time without sacrificing model accuracy. The dataset contains 100,000 images of size 256x256. Which change would most effectively reduce training time?

A company uses Amazon SageMaker to host a model for real-time predictions. The model endpoint is experiencing high latency during peak hours. The data scientist wants to reduce latency without increasing cost. Which action should they take?

A data scientist is training a linear regression model on a dataset with 10 numerical features. After training, the model's R-squared value is 0.99 on the training set but only 0.60 on the test set. Which TWO of the following are appropriate actions to reduce overfitting? (Choose TWO.)

A company uses Amazon SageMaker to build a text classification model using a pre-trained BERT model. The dataset contains 10,000 labeled documents. The model is overfitting: training accuracy is 99%, validation accuracy is 85%. Which TWO of the following are most likely to help reduce overfitting? (Choose TWO.)

A data scientist is training a k-means clustering model on a dataset with 1,000 points. The scientist uses the elbow method to choose the number of clusters. The elbow plot shows a clear bend at k=4. After running k-means with k=4, the scientist wants to evaluate the quality of the clustering. Which THREE of the following are suitable internal clustering validation metrics? (Choose THREE.)

A data scientist is training a binary classification model on imbalanced data (95% negative, 5% positive). The model achieves 99% accuracy on the test set but fails to detect any positive cases. Which metric should the scientist focus on to evaluate model performance?

A company is using Amazon SageMaker to train a deep learning model. The training job is taking a long time, and the data scientist wants to reduce training time without sacrificing accuracy. Which technique should they use?

A machine learning engineer is deploying a model for real-time inference using Amazon SageMaker. The model is a large ensemble that requires 8 GB of memory and 4 vCPUs. The expected traffic is 100 requests per second with a 200 ms latency requirement. Which instance configuration should they choose?

A data scientist is using Amazon SageMaker to train a linear regression model. The training data has 10 features, and the scientist wants to interpret the model's coefficients. Which algorithm should they use?

A company is building a multiclass classification model using Amazon SageMaker. The dataset has 100 classes and is highly imbalanced. The model currently achieves high accuracy on the majority classes but poor performance on minority classes. Which technique should the data scientist use to improve minority class performance?

A data scientist is training an LSTM model for time series forecasting using Amazon SageMaker. The model is overfitting. Which action is LEAST likely to reduce overfitting?

A company is using Amazon SageMaker to deploy a model for real-time inference. The model requires 500 MB of memory and has a latency requirement of 100 ms. The endpoint is receiving 10 requests per second. Which instance type should be chosen for cost-effectiveness?

A data scientist is using Amazon SageMaker to train a model. The training job fails with an error 'Insufficient instance capacity'. Which action should the scientist take to resolve this?

A machine learning engineer is training a model using Amazon SageMaker. The training data is stored in S3 and is 10 TB. The engineer wants to use Pipe input mode to stream data from S3. Which algorithm supports Pipe mode?

A data scientist is training a model using Amazon SageMaker. The training job is running on GPU instances, but the GPU utilization is low. Which TWO actions could improve GPU utilization?

A company is using Amazon SageMaker to deploy a model for real-time inference. The model is a deep neural network that requires GPU for low latency. The endpoint currently uses a single ml.p3.2xlarge instance. Traffic is expected to increase by 5x. Which TWO actions should the company take to handle the increased traffic?

A data scientist is performing feature engineering for a machine learning model. The dataset contains categorical features with high cardinality. Which THREE techniques are appropriate for encoding high-cardinality categorical features?

Refer to the exhibit. A training job failed with the error shown. What is the most likely cause?

Refer to the exhibit. A data scientist is trying to create a SageMaker training job but receives an access denied error. The IAM policy shown is attached to the user. What is the likely issue?

Refer to the exhibit. A data scientist checks the status of a SageMaker endpoint and sees the output above. What does this indicate?

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 5% of the data. The model achieves 99% accuracy but only identifies 10% of the actual positive cases. Which metric should the data scientist focus on to evaluate the model's performance on the positive class?

A machine learning team is building a model to predict customer churn. They have a dataset with 10,000 samples and 50 features, including categorical variables with high cardinality (e.g., ZIP code). Which feature engineering technique is most appropriate to reduce dimensionality while preserving predictive information?

An ML team trains a deep learning model using Amazon SageMaker with a custom Docker container. Training completes successfully, but the model's accuracy on the test set is significantly lower than expected. The team suspects overfitting. Which two actions should they take to mitigate overfitting? (Choose TWO.)

A data scientist is using Amazon SageMaker to train a linear regression model. The training job fails with the error: 'AlgorithmError: Input data has NaN values'. Which step should the data scientist take to resolve this issue?

A company is building a recommendation system using collaborative filtering. The dataset contains implicit feedback (clicks) from users on items. Which algorithm is best suited for this scenario?

A machine learning engineer is using Amazon SageMaker to deploy a model for real-time inference. The model is a large ensemble that requires 4 GB of memory and has a latency requirement of 100 ms. Which instance type and deployment configuration should the engineer choose to optimize cost while meeting requirements?

A data scientist is training a gradient boosting model using SageMaker's built-in XGBoost algorithm. The model is overfitting on the training data. Which hyperparameter adjustment is most likely to reduce overfitting?

A company uses Amazon SageMaker to train a model for fraud detection. The dataset has 1 million samples with 200 features. The data is highly imbalanced (0.1% fraud). The team wants to use a random forest model. Which technique should they use to handle the class imbalance during training?

A machine learning team is developing a model to predict housing prices. They have a dataset with numerical features like square footage and number of bedrooms, and categorical features like neighborhood. Which preprocessing step is essential before training a linear regression model?

A data scientist is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is taking too long. The data scientist wants to reduce training time without changing the model architecture. Which action should they take?

A machine learning engineer is evaluating a classification model that predicts whether a transaction is fraudulent. The model outputs a probability score. The cost of a false negative (missed fraud) is 10 times higher than the cost of a false positive (false alarm). Which TWO evaluation metrics should the engineer use to tune the model? (Choose TWO.)

A data scientist is building a text classification model using a bag-of-words approach. The dataset contains 100,000 documents with a vocabulary of 50,000 unique words. The model is overfitting. Which THREE techniques can help reduce overfitting? (Choose THREE.)

A company is deploying a machine learning model using Amazon SageMaker. The model requires GPUs for inference. Which THREE configurations can the company use to meet this requirement? (Choose THREE.)

A data scientist wants to build a binary classifier to predict customer churn. The dataset has 10,000 records with 500 churners (5%). Which technique should the data scientist use to address class imbalance?

A company is training a deep learning model on SageMaker using a large dataset stored in S3. The training job is taking a long time due to I/O bottlenecks. Which action would MOST effectively reduce the I/O bottleneck?

A data scientist is tuning a gradient boosting model using SageMaker automatic model tuning. The hyperparameter 'num_round' ranges from 50 to 500. The tuning job uses 'ObjectiveMetric' = 'validation:auc'. After 50 training jobs, the best objective value is 0.95. The data scientist suspects overfitting. What should the data scientist do?

A company uses Amazon SageMaker to train a linear regression model. The training data includes a feature 'age' with values ranging from 0 to 100. The model's loss is not converging. What is the MOST likely cause?

A machine learning engineer is deploying a PyTorch model on SageMaker for real-time inference. The model requires GPU for low latency. Which instance type and configuration should the engineer choose?

A data scientist is using SageMaker to train a random forest model. The dataset has 100 features and 1 million rows. The training job fails with a 'ResourceLimitExceeded' error. What is the MOST likely cause?

A company wants to use Amazon SageMaker to automatically tune hyperparameters for a XGBoost model. Which built-in SageMaker feature should be used?

A data scientist trains a model using SageMaker and notices that the training loss decreases but validation loss increases after a few epochs. What is the MOST likely issue?

A company is using SageMaker to train a deep learning model with TensorFlow. The training job is running on an ml.p3.16xlarge instance. The data scientist wants to maximize GPU utilization. Which configuration should be used?

A data scientist is building a binary classifier using logistic regression. The dataset has 10 features and 100,000 observations. The model achieves 99% accuracy on the test set, but the precision is 50% and recall is 90%. Which TWO actions should the data scientist take to improve model performance? (Choose 2.)

100

A company is deploying a machine learning model for real-time fraud detection. The model must have extremely low latency (<10 ms) and high throughput. Which THREE design choices meet these requirements? (Choose 3.)

101

A data scientist is training a deep learning model on SageMaker using a custom container. The training job fails with an 'OutOfMemory' error. Which THREE actions could resolve this issue? (Choose 3.)

102

A data scientist is training a binary classification model on a dataset with 100 features and 10,000 samples. The model achieves 99% accuracy on the training set but only 65% on the test set. Which technique should be applied first to address this issue?

103

A machine learning team needs to deploy a model that makes real-time predictions with latency under 100 ms. The model is a deep neural network with 500 MB of parameters. Which AWS service should they use?

104

A data scientist is building a time series forecasting model for daily sales data. The data exhibits strong seasonality with a weekly pattern and a yearly trend. The scientist wants to use Amazon SageMaker's built-in algorithm. Which algorithm is most appropriate?

105

A company is training a large language model on Amazon SageMaker using a single GPU instance. The training is taking too long. Which change would most likely reduce training time?

106

A data scientist is using Amazon SageMaker to train a linear regression model. After training, the scientist notices that the model has a high bias. What is the most likely cause?

107

A team is deploying a model for fraud detection. The dataset is highly imbalanced (99% legitimate, 1% fraudulent). They trained a logistic regression model and achieved 99% accuracy on the test set. However, the model fails to detect most fraud cases. Which metric should the team focus on to evaluate the model?

108

A company wants to use Amazon SageMaker to train a model using a custom algorithm packaged in a Docker container. Which approach should they use?

109

A data scientist is training a neural network on Amazon SageMaker and wants to automatically stop training if the validation loss does not improve for 5 consecutive epochs. Which feature should they use?

110

A team is building a model to predict customer churn. They have 50 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). Which feature engineering technique is most appropriate?

111

A data scientist is training a random forest classifier on Amazon SageMaker and wants to reduce overfitting. Which TWO actions should the scientist take? (Choose TWO.)

112

A company is using Amazon SageMaker to deploy a model for real-time inference. The model takes 200 ms to respond, but the requirement is 100 ms. Which THREE actions could reduce latency? (Choose THREE.)

113

A data scientist is performing feature selection for a linear regression model. Which TWO methods are appropriate? (Choose TWO.)

114

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. The data scientist runs a training job that reads from s3://my-bucket/training-data/ and writes to s3://my-bucket/output/. The training job fails with an access denied error. What is the most likely cause?

115

Refer to the exhibit. A custom training job using Pipe input mode fails. The logs indicate the algorithm cannot read the data. What is the most likely issue?

116

Refer to the exhibit. A data scientist wants to update the endpoint to use a new model image. The scientist updates the endpoint configuration with the new image and calls UpdateEndpoint. After the update, the endpoint status is 'Updating' but remains in that state for a long time. What is the most likely cause?

117

A data scientist is training a binary classification model on a dataset with 100,000 positive samples and 1,000 negative samples. The model achieves 99% accuracy on the test set but a very low F1 score. What is the most likely cause?

118

A machine learning engineer needs to deploy a real-time inference endpoint for a model that requires GPU acceleration for low latency. Which AWS service should be used?

119

A data scientist is training a deep learning model on Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single GPU instance. The model is a large transformer with millions of parameters. Which change would most likely reduce training time?

120

A data scientist is building a regression model to predict housing prices. The dataset includes numerical features such as square footage, number of bedrooms, and year built, as well as categorical features such as neighborhood and roof type. Which TWO preprocessing steps are most important to apply before training a linear regression model?

121

A machine learning engineer is evaluating a multi-class classification model that predicts product categories. The model outputs probabilities for 10 classes. The engineer wants to improve the model's calibration so that the predicted probabilities reflect the true likelihood of each class. Which THREE techniques can help?

122

A data scientist is training a text classification model using Amazon SageMaker's built-in BlazingText algorithm. The dataset contains 1 million documents. Which TWO hyperparameters are most important to tune for improving model accuracy?

123

A company is using Amazon SageMaker to train a large language model with billions of parameters. The training job uses multiple GPU instances in a distributed fashion. The training is converging but the loss is not decreasing as expected. The data scientist suspects that the learning rate is too high. Which technique should the data scientist use to automatically adjust the learning rate during training?

124

A data scientist is building a recommendation system using collaborative filtering. The dataset contains user-item interactions in a sparse matrix. The model will be trained on Amazon SageMaker using the built-in Factorization Machines algorithm. Which data format should the scientist use for the training data?

125

A data scientist is training a binary classification model and wants to evaluate its performance using a metric that is robust to class imbalance. Which metric should be used?

126

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint for real-time inference. The model requires a preprocessing step that involves tokenizing text and converting it to a numerical format. To minimize latency, where should the preprocessing logic be implemented?

127

A data scientist is training a deep learning model for image classification on Amazon SageMaker. The dataset consists of 10,000 images of size 224x224 pixels. The training job uses a single ml.p3.2xlarge instance. The data scientist notices that the GPU utilization is very low (~20%) and the training is slow. Which change would most likely improve GPU utilization?

128

A data scientist is using Amazon SageMaker to train a model and wants to automatically stop the training job if the loss does not improve for a certain number of epochs. Which SageMaker feature can be used for this purpose?

129

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 5% of the data. Which metric is most appropriate for evaluating model performance?

130

A team is using Amazon SageMaker to train a deep learning model for image classification. The training job is taking too long, and they want to reduce training time without sacrificing model accuracy. Which approach is most effective?

131

A machine learning engineer is tuning hyperparameters for a gradient boosting model using Amazon SageMaker Automatic Model Tuning. The objective metric is validation accuracy. After several tuning jobs, the best accuracy achieved is 0.85, but the engineer suspects the model is overfitting. Which hyperparameter adjustment is most likely to reduce overfitting?

132

A data scientist needs to choose an algorithm for a regression problem with 50 features and 1 million training examples. The model must be interpretable and the training data fits in memory. Which algorithm is most appropriate?

133

A company is building a sentiment analysis model for customer reviews. The dataset includes 10,000 positive and 10,000 negative reviews. The data scientist splits the data into 70% training, 15% validation, and 15% test sets. After training, the model achieves 99% accuracy on training set but only 82% on validation set. What is the most likely issue?

134

A machine learning team is deploying a real-time inference endpoint for a recommendation model using Amazon SageMaker. The model takes a long time to load (several minutes) due to its size (5 GB). Which deployment strategy minimizes the cold start latency?

135

A data scientist is training a neural network for time series forecasting. The training loss decreases initially but then starts to increase after 20 epochs. Which action should the scientist take to address this?

136

A company uses a linear regression model to predict house prices. The model's R-squared is 0.95 on the training set but 0.60 on the test set. Which of the following is the most likely cause?

137

A data scientist needs to perform feature scaling for a dataset containing numerical features with different units (e.g., age in years and income in dollars). Which scaling method is most appropriate when the algorithm assumes data is normally distributed?

138

Which TWO of the following are valid techniques for handling missing values in a dataset for machine learning?

139

Which THREE of the following are appropriate methods to reduce overfitting in a decision tree model?

140

Which TWO of the following are true about the bias-variance tradeoff?

141

A data scientist is trying to create a SageMaker training job but receives an access denied error. The IAM policy attached to the role is shown in the exhibit. What is the most likely cause of the error?

142

A data scientist ran an XGBoost training job in SageMaker and it failed with the error shown in the exhibit. Which hyperparameter change is most likely to resolve the numeric overflow?

143

A team deployed a SageMaker endpoint with the configuration shown in the exhibit. During a traffic spike, the endpoint becomes unresponsive. Which change to the endpoint configuration would best improve availability?

144

A data science team is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (95% negative class, 5% positive class). The team wants to maximize the F1 score. Which built-in SageMaker algorithm is most appropriate?

145

A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is taking longer than expected. The engineer notices that the GPU utilization is low (around 30%) while CPU utilization is high. Which action is most likely to improve training speed?

146

A company is using Amazon SageMaker to deploy a model for real-time inference. The model receives requests with varying payload sizes. The company observes occasional latency spikes. Which feature can help mitigate this?

147

A data scientist is building a regression model to predict house prices. The dataset includes features such as square footage, number of bedrooms, and location. After training a linear regression model, the scientist notices that the residuals have a pattern: they increase as the predicted value increases. Which action is most appropriate?

148

A machine learning team is using Amazon SageMaker to train a large language model. The training script uses PyTorch and the model requires significant memory. The team wants to use model parallelism across multiple GPUs. Which SageMaker feature should they use?

149

A company is using Amazon SageMaker to deploy a model for real-time inference. The model is updated frequently. Which deployment strategy allows for zero-downtime updates and easy rollback?

150

A data scientist is using Amazon SageMaker to train a model. The training dataset is stored in S3 as CSV files. The scientist wants to use the SageMaker built-in Linear Learner algorithm. Which input mode should be used for optimal performance?

151

A machine learning team is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: 'Unable to write to /opt/ml/model'. The container does not have root access. What is the most likely cause?

152

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 samples with 100 features. After training a logistic regression model, the scientist observes that the model has high variance (overfitting). Which technique can reduce overfitting?

153

Which TWO of the following are valid approaches to handle missing values in a dataset for a machine learning model?

154

Which THREE of the following are best practices for training a deep learning model on Amazon SageMaker?

155

Which TWO of the following are appropriate use cases for Amazon SageMaker built-in algorithms?

156

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 1% of the data. The model needs to maximize recall while keeping precision above 0.7. Which sampling strategy should the data scientist use?

157

A data scientist is training a gradient boosting model on a large dataset (100 GB) stored in Amazon S3. The training job uses a SageMaker built-in XGBoost algorithm with a single ml.p3.2xlarge instance. The job fails with a memory error. Which solution should the data scientist adopt to resolve the memory issue?

158

A data scientist wants to evaluate the performance of a multiclass classification model. The model outputs probabilities for 10 classes. Which metric is most appropriate for evaluating the model's ranking performance across all classes?

159

A company is building a recommendation system using Amazon SageMaker Factorization Machines. The dataset includes user IDs, item IDs, and implicit feedback (clicks). The data is sparse with millions of users and items. The model needs to capture interactions between users and items. Which hyperparameter tuning strategy should be used to improve model performance?

160

A data scientist is training a neural network on image data using TensorFlow with GPU instances on SageMaker. The training is slow because the GPU utilization is low. The data pipeline uses tf.data with a large number of preprocessing operations. Which action would most likely increase GPU utilization?

161

A data scientist is using Amazon SageMaker to train a linear regression model. The dataset has 500 features and 50,000 observations. The model converges but has high bias. Which technique should the data scientist use to reduce bias?

162

A data science team is deploying a machine learning model to production using SageMaker. The model is a PyTorch model that requires custom inference logic including image preprocessing. The team needs to ensure that the endpoint can handle variable batch sizes and has low latency. Which deployment approach should the team use?

163

A data scientist is building a fraud detection model using a highly imbalanced dataset. The model uses a random forest classifier. The recall for the minority class is 0.6, and precision is 0.9. The business requires recall above 0.8. Which action should the data scientist take to improve recall?

164

A data scientist trains a convolutional neural network (CNN) for image classification. The training loss decreases steadily, but the validation loss starts increasing after 10 epochs. Which technique should the data scientist use to address this problem?

165

A data scientist is training a gradient boosting model using SageMaker's built-in XGBoost algorithm. The dataset has missing values in several features. Which TWO actions should the data scientist take to handle missing values effectively? (Choose two.)

166

A data scientist is using Amazon SageMaker to train a deep learning model for natural language processing. The training job is taking too long to converge. The data scientist wants to speed up training without significantly sacrificing model accuracy. Which THREE strategies should the data scientist consider? (Choose three.)

167

A data scientist is evaluating a binary classification model that predicts whether a customer will churn. The model achieves an AUC of 0.85 on the test set. Which TWO statements about AUC are correct? (Choose two.)

168

A data scientist trains a linear regression model to predict house prices. The model has high bias (underfitting). Which action is most likely to reduce bias?

169

A company is building a binary classifier to detect fraudulent transactions. The dataset is highly imbalanced (99% legitimate, 1% fraudulent). Which metric is most appropriate for evaluating the model?

170

A machine learning team trains a deep learning model on SageMaker. The training job uses a single ml.p3.2xlarge instance and takes 12 hours. The team needs to reduce training time without changing the algorithm. Which approach is most effective?

171

A data scientist builds a Random Forest model using SageMaker. The model performs well on training data but poorly on test data. Which step is most likely to reduce overfitting?

172

A healthcare company needs to predict patient readmission risk using clinical notes. Which AWS service can be used to preprocess the text data into numerical features for a machine learning model?

173

A data scientist uses SageMaker Autopilot to automatically build a binary classification model. The dataset has 50 features and 100,000 rows. After the experiment, Autopilot provides multiple candidate models. Which candidate should the data scientist select to minimize inference latency for real-time predictions?

174

A company uses SageMaker to train a time-series forecasting model using Amazon Forecast. The dataset contains historical sales data for 10,000 products over 2 years. Which data format is required for the target time series?

175

A data scientist needs to implement a recommendation system for an e-commerce website. Which Amazon service is specifically designed for building and deploying recommendation models?

176

A data scientist trains a neural network using TensorFlow on SageMaker. The training job fails with a 'CUDA out of memory' error. What is the most likely cause and solution?

177

Which TWO metrics are suitable for evaluating a regression model? (Select TWO.)

178

Which THREE techniques help reduce overfitting in a neural network? (Select THREE.)

179

Which TWO services can be used to serve machine learning models for real-time inference? (Select TWO.)

180

Refer to the exhibit. An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job that reads training data from s3://my-bucket/confidential/data.csv. What will happen?

181

Refer to the exhibit. A SageMaker training job failed with the error shown. What is the most likely cause of this error?

182

Refer to the exhibit. A data scientist creates a SageMaker model using the configuration above. When deploying the model to an endpoint, the endpoint status remains 'Creating' for a long time and then fails. What is the most likely cause?

183

A data scientist is training a binary classifier on an imbalanced dataset (95% negative, 5% positive). The model achieves 99% accuracy but only correctly identifies 2% of the positive samples. Which metric should the data scientist focus on to improve the model's performance?

184

A team is using Amazon SageMaker to train a deep learning model. The training job is taking too long, and they want to reduce training time without significant accuracy loss. They have already tried increasing the number of instances. Which technique should they consider next?

185

A machine learning engineer is deploying a model that predicts loan defaults. The model uses features like income, credit score, and debt-to-income ratio. After deployment, the model's performance degrades over time. Which concept best describes this phenomenon?

186

A data scientist is building a text classification model. The dataset contains 10,000 documents, each labeled with one of 5 categories. Which algorithm is most suitable for this task?

187

A company uses Amazon SageMaker to train a model for detecting fraudulent transactions. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraudulent). Which approach is most effective to address this imbalance?

188

A team is training a neural network for image classification using Amazon SageMaker. The training loss decreases rapidly but the validation loss starts increasing after a few epochs. Which action should the team take?

189

A data scientist is using Amazon SageMaker to train a linear regression model. The target variable is right-skewed. Which transformation should the data scientist apply to the target variable to improve model performance?

190

A company wants to deploy a machine learning model that predicts customer churn. The model must provide interpretable predictions to explain why a customer is likely to churn. Which algorithm is most appropriate?

191

A data scientist is tuning hyperparameters for an XGBoost model on a large dataset using Amazon SageMaker. The training job is taking too long, and they want to speed up the tuning process. Which strategy is most effective?

192

Which TWO of the following are common techniques to handle missing values in a dataset?

193

Which THREE of the following are valid metrics for evaluating a regression model?

194

Which TWO of the following are techniques used to reduce overfitting in a neural network?

195

A data scientist is training a binary classification model on imbalanced data (95% negative, 5% positive). The model achieves 95% accuracy but only 10% recall on the positive class. Which metric should be used to evaluate model performance?

196

A company uses SageMaker to train a large language model. The training job is taking too long. The data scientist wants to use distributed training with data parallelism. Which SageMaker feature should be used?

197

A data scientist is building a regression model to predict house prices. The dataset contains features like number of bedrooms, square footage, and location. After training, the model has high variance. Which technique should the data scientist use to reduce variance without significantly increasing bias?

198

A machine learning engineer is deploying a model to SageMaker for real-time inference. The model is a TensorFlow SavedModel. Which SageMaker capability should be used to create an endpoint?

199

A data scientist is training a deep learning model on a GPU instance. The training loss is decreasing, but the validation loss starts increasing after a few epochs. Which action should the data scientist take to address this?

200

A data scientist is building a recommendation system for an e-commerce platform. The dataset contains user interactions (clicks, purchases) and item metadata. The scientist wants to use matrix factorization. Which algorithm should be used?

201

A company wants to build a model to detect fraudulent transactions. The dataset has a highly imbalanced class distribution. Which technique should be used during training to handle class imbalance?

202

A data scientist is working with a dataset containing categorical features with high cardinality. The scientist wants to use a tree-based model. Which encoding method should be used?

203

A machine learning engineer is deploying a PyTorch model to SageMaker. The model requires custom inference logic. Which approach should the engineer use?

204

Which TWO metrics are appropriate for evaluating a binary classification model when the cost of false negatives is high? (Choose 2)

205

Which THREE techniques can help reduce overfitting in a neural network? (Choose 3)

206

Which TWO SageMaker features can be used to perform hyperparameter optimization? (Choose 2)

207

A data scientist is training a binary classifier on an imbalanced dataset where the positive class represents only 2% of the data. The model achieves 99% accuracy but only identifies 5% of actual positives. Which metric should the scientist use to evaluate the model's ability to detect the positive class?

208

A team is training a large deep learning model on Amazon SageMaker. The training job is taking too long and they want to reduce training time without changing the model architecture. Which action is MOST effective?

209

A machine learning engineer needs to deploy a model that makes real-time predictions with latency under 100ms. The model is a small ensemble of decision trees. Which AWS service is MOST suitable?

210

A data scientist is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. During training, the job fails with an access denied error. What is the MOST likely cause?

211

A team is training a deep learning model using TensorFlow on a single GPU instance in SageMaker. The GPU utilization is below 30%. Which change will MOST improve GPU utilization?

212

A machine learning team is using Amazon SageMaker to build a regression model. The target variable is heavily right-skewed with a long tail. Which data transformation should the team apply to the target variable before training?

213

A data scientist is using Amazon SageMaker to train a model with a large dataset that does not fit into memory on a single instance. The training algorithm supports distributed training. Which approach should the scientist use to train the model efficiently?

214

A machine learning engineer is monitoring a deployed model on SageMaker and notices that the prediction latency is increasing over time. The model is a linear regression with a small number of features. Which is the MOST likely cause?

215

A team is using Amazon SageMaker to train a model and wants to automatically stop training when the model stops improving to save costs. Which SageMaker feature should they use?

216

A data scientist is training a gradient boosting model using SageMaker. The model is overfitting to the training data. Which TWO actions can help reduce overfitting? (Choose 2)

217

A company uses a SageMaker endpoint for real-time inference. They need to ensure high availability during deployment updates. Which THREE steps achieve this? (Choose 3)

218

A data scientist is building a classification model and wants to evaluate its performance. Which TWO metrics are appropriate for a multi-class classification problem? (Choose 2)

219

A company is training a deep learning model on Amazon SageMaker. The training job is taking a long time and the data scientist suspects that the model is overfitting. Which of the following actions can help reduce overfitting and improve generalization?

220

A data scientist is building a binary classification model to predict customer churn. The dataset is highly imbalanced, with only 5% of customers churning. The scientist evaluates several models using accuracy, precision, recall, and F1 score. Which metric is most appropriate for comparing model performance in this scenario?

221

A company is using Amazon SageMaker to train a linear regression model. The data scientist notices that the training loss is decreasing but the validation loss has started to increase after a few epochs. What is the most likely cause?

222

A data scientist is using Amazon SageMaker built-in XGBoost algorithm to train a regression model. The training job completes successfully but the model performance on the test set is poor, with high bias. Which hyperparameter adjustment is most likely to help reduce bias?

223

A company is building a sentiment analysis model using Amazon SageMaker BlazingText. The training data consists of 100,000 product reviews. The data scientist wants to use the Word2Vec algorithm to generate word embeddings. Which configuration is required to use the continuous bag-of-words (CBOW) architecture?

224

A data scientist is using Amazon SageMaker to train a model. The training job is taking longer than expected. The scientist wants to reduce training time without changing the algorithm or the hardware. Which action is most likely to help?

225

A data scientist is training a neural network on Amazon SageMaker. The network has many layers and the training is very slow. The scientist suspects that the gradients are vanishing. Which technique is most specifically designed to mitigate the vanishing gradient problem?

226

A company is using Amazon SageMaker to deploy a model for real-time inference. The model has a latency requirement of less than 100 milliseconds. During testing, the latency is around 150 milliseconds. Which action can most likely reduce the latency to meet the requirement?

227

A data scientist is using Amazon SageMaker to train a classification model. The dataset contains categorical features with high cardinality. Which encoding method is most appropriate for handling high-cardinality categorical features in a linear model?

228

A data scientist is using Amazon SageMaker to train a random forest model for a binary classification task. The dataset has 50 features and 10,000 samples. The model achieves high training accuracy but poor test accuracy. Which TWO actions should the scientist take to improve generalization?

229

A company is using Amazon SageMaker to train an XGBoost model. The training data contains missing values. Which TWO methods can XGBoost handle missing values internally?

230

A data scientist is building a deep learning model using Amazon SageMaker. The model is overfitting the training data. Which THREE actions can help reduce overfitting?

231

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance role. A data scientist is trying to train a model using the SageMaker built-in XGBoost algorithm with training data in 'my-bucket/training-data/' and expects output in 'my-bucket/output/'. The training job fails with an access denied error. What is the most likely missing permission?

232

Refer to the exhibit. A data scientist ran a SageMaker training job and reviewed the logs. The training completed quickly, but the model performance is very poor. What is the most likely cause?

233

Refer to the exhibit. A data scientist is using Amazon SageMaker Ground Truth to label a dataset. The output manifest file references S3 objects with metadata. The scientist notices that a training job using the labeled data yields poor accuracy. What is the most likely issue?

234

A data scientist wants to use a linear regression model to predict house prices. After training, the model shows high bias and low variance. Which action would most likely improve the model's performance?

235

A machine learning team is using SageMaker to train a deep learning model. The training job is failing due to insufficient GPU memory. Which approach should the team take to resolve this issue without changing the model architecture?

236

A company uses SageMaker to deploy a model for real-time inference. The model is a large ensemble that requires 8 GB of memory and has high latency. The team wants to reduce latency without increasing cost. Which strategy is most effective?

237

During training, a binary classification model has an AUC of 0.99 on the training set but only 0.72 on the validation set. Which of the following is the most likely cause?

238

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job fails with an error indicating that the algorithm expects the data to be in recordIO-protobuf format, but the input is CSV. What is the most efficient way to resolve this?

239

A data scientist is training a recurrent neural network (RNN) for time series forecasting. The model's training loss is not decreasing, and the gradients are vanishing. Which technique should the scientist apply to address vanishing gradients?

240

A company is building a model to classify customer reviews as positive or negative. The dataset has 10,000 positive and 100 negative reviews. Which metric is most appropriate for evaluating model performance?

241

A team is using SageMaker to train a model with hyperparameter tuning. The training jobs are taking too long. The team wants to reduce time without sacrificing model quality. Which approach should they take?

242

A data scientist is using Amazon SageMaker to deploy a custom model container. The model is a large transformer that requires 16 GB of memory. The scientist wants to minimize inference latency. Which SageMaker hosting option should they choose?

243

Which TWO actions can help reduce overfitting in a neural network? (Choose 2.)

244

Which THREE evaluation metrics are appropriate for a multi-class classification problem? (Choose 3.)

245

Which TWO techniques are used to handle missing values in a dataset before training? (Choose 2.)

246

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents only 5% of the data. The model currently achieves 95% accuracy but only 10% recall on the positive class. Which metric should the scientist focus on to improve the model's ability to detect the positive class?

247

A team is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large ensemble of 10 deep learning models, each 500 MB. The inference latency requirement is under 200 ms. Currently, the endpoint using a single ml.p3.2xlarge instance takes 1.5 seconds per request. Which approach is MOST likely to meet the latency requirement?

248

A machine learning engineer is using Amazon SageMaker to train a model. The training job fails with an out-of-memory error. The training data size is 10 GB and the instance is ml.m5.xlarge (16 GB memory). Which change is MOST likely to resolve the issue without increasing cost?

249

An e-commerce company wants to build a recommendation system. They have user-item interaction data (clicks, purchases) and user demographic data. The goal is to recommend items that a user is likely to purchase. Which approach should be used?

250

A data scientist is training a deep learning model for image classification using TensorFlow on Amazon SageMaker. The model trains slowly, and the GPU utilization is below 20%. Which action will MOST effectively increase GPU utilization and reduce training time?

251

A company uses Amazon SageMaker to train a model and wants to track metrics like loss and accuracy in real-time. Which SageMaker feature should be used?

252

A data scientist is using Amazon SageMaker to build a text classification model. The dataset has 100,000 labeled samples and 20 classes. The scientist wants to use a pre-trained BERT model and fine-tune it. Which approach is MOST cost-effective?

253

A company is deploying a machine learning model for real-time fraud detection. The model must have low latency (under 100 ms) and high throughput. The model is an ensemble of 5 gradient boosted trees (XGBoost), each 200 MB. Which deployment strategy is MOST suitable?

254

A data scientist is training a regression model. The training loss is decreasing but the validation loss starts to increase after a few epochs. Which technique should the scientist use to address this issue?

255

Which TWO of the following are best practices for hyperparameter tuning using Amazon SageMaker? (Choose 2)

256

Which THREE of the following are valid strategies to reduce overfitting in a deep neural network? (Choose 3)

257

Which TWO of the following are appropriate use cases for using Amazon SageMaker BlazingText? (Choose 2)

258

A financial services company uses Amazon SageMaker to train a model for credit risk prediction. The dataset contains 500 features and 1 million records. The target variable is binary with 20% default rate. The data scientist uses a gradient boosting algorithm (XGBoost) with default hyperparameters. After training, the model achieves 95% accuracy, but the precision for the default class is only 30%, and recall is 15%. The business requires at least 50% recall and 40% precision for the default class. The data scientist tries to adjust the decision threshold, but this does not simultaneously meet both targets. The scientist suspects that the model is not learning the default patterns well. The company also has a large dataset of unlabeled transactions that could be used. Which action should the data scientist take to improve the model?

259

A healthcare company is building a model to predict patient readmission within 30 days of discharge. The dataset includes 10,000 patient records with 200 features, including lab results, demographics, and historical admissions. The target variable is highly imbalanced: only 8% of patients are readmitted. The data scientist splits the data into 80% training and 20% test sets, ensuring the same proportion of readmissions in each. The scientist trains a logistic regression model and a random forest model. The logistic regression achieves 92% accuracy but recall of 10% for the readmitted class. The random forest achieves 90% accuracy but recall of 25%. The business requirement is to achieve at least 60% recall for readmissions while maintaining reasonable precision. The scientist also has access to a large collection of unlabeled patient records from other hospitals. Which strategy should the data scientist use to meet the business requirement?

260

A retail company uses Amazon SageMaker to train a model for product demand forecasting. The dataset contains daily sales data for 10,000 products over 3 years. The data includes features like price, promotions, holidays, and seasonality. The data scientist uses a linear regression model and gets an RMSE of 50 units. However, the business requires more accurate forecasts, especially for products with high variability. The scientist notices that the residuals show a pattern: the model underestimates demand during promotional periods. Which approach should the scientist take to improve the model?

261

A data scientist is training a binary classification model on a dataset with 10,000 features. The model overfits severely. Which technique is MOST appropriate to reduce overfitting?

262

A team is training a deep learning model on Amazon SageMaker. The training job is slow because the data is stored in S3 as many small files. Which approach is MOST effective to improve training throughput?

263

A machine learning engineer is using SageMaker to train an XGBoost model on a dataset with a severe class imbalance (1:1000). The goal is to maximize recall on the minority class. Which hyperparameter tuning strategy is MOST appropriate?

264

A data scientist is evaluating a regression model. The RMSE on the training set is 2.5, and on the test set is 2.7. The R² on the test set is 0.98. What does this indicate?

265

A company uses SageMaker to host a real-time inference endpoint for a classification model. The endpoint receives traffic spikes that cause high latency. The team wants a solution that automatically scales based on demand while keeping costs low. Which approach is BEST?

266

A data scientist is using SageMaker to train a custom TensorFlow model. The training script reads data from S3 using TensorFlow's tf.data API. The training is bottlenecked by I/O. Which strategy would MOST effectively improve data throughput?

267

A data scientist is building a binary classifier and obtains the following confusion matrix on the test set: TP=80, FP=20, TN=70, FN=30. What is the precision?

268

A team is training a large language model using SageMaker's distributed training. They notice that the training loss is not decreasing after the first few epochs. Which action is MOST likely to resolve this issue?

269

A company is deploying a model that predicts customer churn. The model's recall for the churn class is 0.9, but precision is 0.4. The business cost of false positives is high. Which strategy would MOST likely improve precision without significantly harming recall?

270

A data scientist is training a linear regression model and wants to handle multicollinearity among features. Which TWO actions are appropriate?

271

A machine learning engineer is using SageMaker's built-in XGBoost algorithm for a multi-class classification problem. The training job completes but the model accuracy is low. Which THREE hyperparameters should the engineer tune to improve performance?

272

A data scientist is evaluating a binary classification model. The model's AUC-ROC is 0.95. Which TWO statements are true?

273

A company uses Amazon SageMaker to train a deep learning model for image classification. The training dataset consists of 500,000 images, each 256x256 pixels, stored in S3. The team uses a single ml.p3.2xlarge instance for training. The training time is unacceptably long (over 48 hours). The team wants to reduce training time without sacrificing model accuracy. They have already optimized the data pipeline by using SageMaker Pipe mode and sharding the S3 dataset. The model is a ResNet-50 implemented in TensorFlow. The team is considering the following options: A) Switch to a ml.p3.16xlarge instance which has 8 GPUs and more memory. B) Implement distributed data parallelism using Horovod across multiple instances. C) Use SageMaker's built-in Hyperparameter Tuning to find optimal hyperparameters. D) Reduce the image resolution to 128x128 to speed up training. Which option will MOST effectively reduce training time while maintaining accuracy?

274

A data scientist is working on a regression problem to predict house prices. The dataset has 80 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). The target variable is log-transformed. The data scientist trains a linear regression model and obtains an R² of 0.45 on the test set. To improve performance, the data scientist considers: A) Applying one-hot encoding to all categorical features and using Ridge regression. B) Using target encoding for high-cardinality features and using a tree-based model like XGBoost. C) Removing all categorical features and using polynomial features for numerical features. D) Using principal component analysis (PCA) on all features before training a linear model. Which approach is MOST likely to improve the model's performance?

275

A company has deployed a real-time inference endpoint using SageMaker for a fraud detection model. The model uses a Random Forest classifier. The endpoint receives predictions but the latency is too high. The metric shows p99 latency of 500ms, but the requirement is under 200ms. The team has already optimized the instance type to the maximum allowed by their budget. The data scientist suggests: A) Reducing the number of trees in the Random Forest model. B) Switching to a linear model like Logistic Regression. C) Enabling SageMaker's batch transform instead of real-time endpoint. D) Adding more instances to the endpoint behind a load balancer. Which option will MOST effectively reduce latency while maintaining acceptable accuracy?

276

A company is building a binary classifier to predict customer churn. The dataset has 10,000 samples with 500 churners (5% positive class). After training a logistic regression model, the precision is 0.8 and recall is 0.2. Which metric should the data scientist focus on to improve the model's ability to identify churners while minimizing false positives?

277

A data scientist is training a deep learning model on Amazon SageMaker using a custom Docker container. The training job fails with an error 'OutOfMemoryError: CUDA out of memory'. The instance type is ml.p3.2xlarge (8 GB GPU memory). The model has 50 million parameters. What is the most likely cause and solution?

278

A team deployed a SageMaker endpoint for real-time inference using a PyTorch model. After monitoring, they notice that the latency is highly variable, with p99 latency 10x the p50 latency. The endpoint uses a single ml.c5.2xlarge instance with auto-scaling based on average CPU utilization. Which change is most likely to reduce latency variability?

279

A data scientist is using Amazon SageMaker to train a model with the built-in XGBoost algorithm. The dataset contains missing values. What is the default behavior of SageMaker XGBoost regarding missing values?

280

A company is fine-tuning a BERT model on Amazon SageMaker for a text classification task. The training script uses PyTorch and Hugging Face Transformers. The training job completes successfully, but the final model accuracy is low. The dataset has 10,000 labeled samples. What is the most likely cause and solution?

281

A data scientist is using Amazon SageMaker Autopilot to automatically build a model for a regression problem. The dataset has 100 features and 50,000 rows. Autopilot recommends a model with an R² of 0.85 on the validation set. However, when deployed to production, the model performs poorly (R² of 0.2). What is the most likely cause?

282

A machine learning team is using SageMaker to train a model with the built-in Linear Learner algorithm. The dataset has 1 million rows and 20 features. The training completes, but the model's mean squared error (MSE) is high. Which parameter adjustment is most likely to reduce MSE?

283

A data scientist is training a recurrent neural network (RNN) for time series forecasting. The training loss decreases steadily for the first 10 epochs, then plateaus. The validation loss starts increasing after epoch 10. What is the most appropriate action?

284

A company is using Amazon SageMaker to host a model for real-time inference. The model is a large ensemble of 10 XGBoost models, each 2 GB. The endpoint uses a single ml.c5.18xlarge instance. The inference latency is high (average 2 seconds). Which change would most effectively reduce latency?

285

A data scientist has this IAM policy attached to their role. When trying to create a SageMaker endpoint using the AWS CLI, they get an 'AccessDenied' error. What is the most likely reason?

286

A data scientist is using Amazon SageMaker to train a linear regression model. The dataset has outliers. Which TWO techniques can help reduce the impact of outliers? (Choose TWO.)

287

A data scientist is tuning a random forest model using SageMaker Hyperparameter Tuning. The objective metric is validation:accuracy. Which THREE hyperparameters are most commonly tuned for random forest? (Choose THREE.)

288

A company is using SageMaker to train a TensorFlow model for image classification. The training is slow on a single GPU instance. Which TWO strategies can reduce training time? (Choose TWO.)

289

A data science team is using Amazon SageMaker to train a deep learning model for object detection using the built-in SSD algorithm. The dataset consists of 100,000 labeled images stored in a SageMaker Pipe Mode input. The training job uses a single ml.p3.2xlarge instance. After 2 hours, the training job fails with the error 'ResourceLimitExceeded: The account-level service limit for ml.p3.2xlarge for training job usage is 1. Contact AWS Support to request a limit increase'. However, the team has already submitted a limit increase request and it was approved for 5 instances. What is the most likely cause of the error?

290

A financial services company is building a fraud detection model using Amazon SageMaker. The dataset has 10 million transactions, with 0.1% fraudulent. They train an XGBoost model with default hyperparameters. The model achieves 99.9% accuracy on the test set, but only catches 10% of actual fraud cases. The company wants to maximize the number of fraud cases caught while keeping the false positive rate below 5%. The data scientist has already tried adjusting the class weights and threshold, but the recall is still low. What should the data scientist do next?

291

A startup is deploying a machine learning model for real-time recommendation on Amazon SageMaker. The model is a TensorFlow model (1 GB) and the endpoint uses a single ml.c5.2xlarge instance. The inference latency is currently 500 ms per request. The startup expects traffic to increase 10x in the next month. They want to maintain latency under 500 ms. What is the most cost-effective solution?

292

A company is training a deep learning model on Amazon SageMaker using a large dataset stored in S3. The training job is failing with an error indicating insufficient memory. The model architecture and hyperparameters are fixed. Which change is MOST likely to resolve the issue without modifying the model code?

293

A data scientist is using Amazon SageMaker to train a gradient boosting model on a dataset with categorical features. The dataset contains a column 'UserID' with over 1 million unique values. The training is taking very long and the model size is large. Which technique would MOST effectively reduce training time and model size while maintaining accuracy?

294

A machine learning engineer is using Amazon SageMaker to deploy a model for real-time inference. The model must respond within 100 milliseconds. The initial deployment uses a single ml.m5.large instance, but latency is too high. Which change should the engineer make to reduce latency?

295

A data scientist is training a binary classification model using Amazon SageMaker's built-in XGBoost algorithm. The dataset is highly imbalanced (95% negative class, 5% positive class). The model achieves high accuracy but poor recall on the positive class. Which TWO actions should the data scientist take to improve recall without significantly sacrificing precision?

296

A company uses Amazon SageMaker to train a linear regression model. During evaluation, they observe that the model has high bias (underfitting). Which THREE actions can reduce bias?

297

A machine learning engineer is deploying a model using Amazon SageMaker. The model requires preprocessing steps (e.g., scaling, encoding) that were applied during training. Which TWO options can ensure the same preprocessing is applied at inference?

298

Refer to the exhibit. A data scientist is trying to run a SageMaker training job using a script that reads training data from 's3://my-bucket/training/data.csv'. The job fails with an access denied error. What is the MOST likely reason?

299

A company uses Amazon SageMaker to train a custom TensorFlow model for image classification. The training job runs on a single ml.p3.2xlarge instance. The dataset contains 500,000 images stored in S3. The training time is too long (over 24 hours). The data scientist wants to reduce training time without changing the model architecture. The dataset is already in TFRecord format. The training script uses the default TensorFlow data pipeline. Which change will MOST significantly reduce training time?

300

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. The dataset has 50 features and 100,000 rows. After the experiment completes, the best candidate model achieves an F1 score of 0.85 on the validation set. However, when deployed to a real-time endpoint, the model's F1 score drops to 0.72 on production data. The data distributions between training and production are similar. What is the MOST likely cause of the performance drop?

Practice all 300 Modeling questions

Other MLS-C01 exam domains

Data Engineering Machine Learning Implementation and Operations Exploratory Data Analysis

Frequently asked questions

What does the Modeling domain cover on the MLS-C01 exam?

The Modeling domain covers the key concepts tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all MLS-C01 domains — no account required.

How many Modeling questions are in the MLS-C01 question bank?

The Courseiva MLS-C01 question bank contains 300 questions in the Modeling domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Modeling for MLS-C01?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Modeling questions for MLS-C01?

Yes — the session launcher on this page draws questions exclusively from the Modeling domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your MLS-C01 domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included