Free MLS-C01 Machine Learning Implementation and Operations Practice Questions (2026)

Q: What does the Machine Learning Implementation and Operations domain cover on the MLS-C01 exam?

The Machine Learning Implementation and Operations domain covers the key concepts and skills tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services.

Q: How many Machine Learning Implementation and Operations questions are on the MLS-C01 exam?

The Machine Learning Implementation and Operations domain is one of the weighted domains on the MLS-C01 exam. The Courseiva question bank has 300 practice questions for this domain.

Q: How can I practice Machine Learning Implementation and Operations questions for MLS-C01?

Click any of the 300 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Machine Learning Implementation and Operations domain.

Practice Machine Learning Implementation and Operations questions

10Q 20Q 30Q 50Q

MLS-C01 Machine Learning Implementation and Operations questions (showing 300 of 351)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?

A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?

A company is using Amazon SageMaker to train a model with a custom algorithm. The training script reads data from an S3 bucket using boto3. The training job fails with an 'AccessDenied' error when trying to access the S3 bucket. The IAM role attached to the SageMaker notebook instance has full S3 access. What is the most likely cause?

A machine learning engineer is deploying a model using AWS Lambda for real-time inference. The model is a scikit-learn RandomForestClassifier with 100 trees, serialized as a pickle file of 150 MB. The Lambda function has 3 GB memory allocated. However, the inference requests are timing out after 30 seconds. What is the most likely cause?

A data scientist is using Amazon SageMaker for hyperparameter tuning. The tuning job uses a Bayesian optimization strategy. After 10 training jobs, the objective metric (validation accuracy) has plateaued at 0.85. The data scientist wants to explore more diverse hyperparameter combinations. What should the data scientist do?

An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?

A DevOps engineer created a SageMaker notebook instance using the Terraform configuration shown. The notebook instance is in a VPC with a public subnet. However, the notebook instance cannot access the internet. What is the most likely cause?

A company is using Amazon SageMaker to train a XGBoost model on a large dataset. The training job is taking a long time. The data scientist wants to reduce training time without sacrificing model accuracy. The dataset is 100 GB in CSV format stored in S3. What is the most effective approach?

A company is using Amazon SageMaker to deploy a model for real-time inference. The model endpoint is behind an Application Load Balancer (ALB) for A/B testing. The data scientist notices that the endpoint is returning HTTP 503 errors intermittently. The CloudWatch metrics show that the endpoint's Invocations metric is within limits, but the ModelLatency metric has high variance. What is the most likely cause?

A company is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is failing with an OutOfMemory error. The data scientist wants to minimize cost while resolving the issue. Which action should the data scientist take?

A data scientist is deploying a model using Amazon SageMaker for real-time inference. The model is memory-intensive and requires a GPU. Which instance type should be selected for the endpoint?

A company is using AWS Glue to run ETL jobs that transform data for machine learning. The jobs are failing with 'Out of Memory' errors. The data size is growing, and the company needs a cost-effective solution. Which approach should be taken?

A data scientist is training a model using Amazon SageMaker and wants to automatically stop training when the model stops improving. Which feature should be used?

A company is using Amazon SageMaker to build a machine learning pipeline. The pipeline includes data preprocessing, training, and evaluation steps. The company wants to ensure that the pipeline is reproducible and that artifacts are versioned. Which TWO actions should be taken? (Choose TWO.)

A data scientist is deploying a model on Amazon SageMaker for real-time inference. The model is a PyTorch model that requires custom inference code. The data scientist needs to handle variable-length inputs and optimize inference latency. Which THREE steps should the data scientist take? (Choose THREE.)

A data scientist is trying to create a training job named 'test-model' using an IAM role with the attached policy. The creation fails with an AccessDenied error. What is the most likely cause?

A company runs a machine learning pipeline on Amazon SageMaker. The pipeline consists of three steps: data preprocessing (using a custom container), training (using a built-in algorithm), and model evaluation (using a custom container). The pipeline is orchestrated using AWS Step Functions. Recently, the pipeline has been failing intermittently at the model evaluation step with a 'TimeoutError'. The evaluation step runs a Python script that loads the trained model and a test dataset from S3, computes metrics, and writes results back to S3. The step is configured with a timeout of 600 seconds. The test dataset size has grown over time. The data science team suspects that the timeout is due to the increased data size. They want a solution that minimizes changes to the existing infrastructure and avoids increasing the timeout arbitrarily. Which approach should the team take?

A media company uses Amazon SageMaker to train a deep learning model for video classification. The training job uses a single ml.p3.2xlarge instance and processes 50 GB of labeled video data stored in Amazon S3. The training completes successfully in 12 hours. However, the data scientists report that the model’s accuracy is lower than expected. They suspect the training data contains labeling errors. To improve model accuracy without incurring significant additional cost, they want to identify and remove mislabeled training examples before retraining. They have a small budget of $50 and need to complete the analysis within 2 hours. Which approach should the data scientists take?

A company wants to deploy a machine learning model that performs real-time inference with sub-second latency. The model is a deep neural network with 500 MB of weights. The inference endpoint must scale to zero when not in use to minimize cost. Which AWS service should the company use?

A data science team is training a large deep learning model using Amazon SageMaker. The training job is taking a long time because the model has many layers and the dataset is large. The team wants to reduce training time by distributing the training across multiple GPUs on a single instance, as well as across multiple instances. Which TWO actions should the team take? (Choose two.)

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

Drag and drop the steps to train a model using Amazon SageMaker built-in algorithm in the correct order.

Drag and drop the steps to evaluate a trained model using SageMaker Model Monitor in the correct order.

Match each SageMaker built-in algorithm to its primary use case.

Match each SageMaker built-in metric to its meaning.

A data scientist is training a linear regression model on a dataset with 100 features. The model shows high variance on the test set. Which action is MOST likely to reduce overfitting?

A company is using Amazon SageMaker to deploy a real-time inference endpoint for a computer vision model. The endpoint receives bursts of traffic with up to 500 requests per second, but the load is unpredictable. Which scaling strategy is MOST cost-effective while maintaining low latency?

A machine learning team is building a fraud detection system using Amazon SageMaker. The training data is highly imbalanced (99% legitimate, 1% fraudulent). They need to maximize the recall of the fraud class while keeping precision above 90%. Which approach should they take?

A data scientist wants to use Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is expected to take several hours. Which storage option should be used to minimize data loading time and cost?

An ML engineer is deploying a model to a SageMaker endpoint for real-time inference. The model requires a custom inference script that preprocesses input data and postprocesses predictions. Which SageMaker feature should be used to implement this custom logic?

A company uses Amazon SageMaker to train a text classification model. The training data is stored in S3 and contains sensitive personally identifiable information (PII). The company must ensure that the data is encrypted at rest in S3 and that the encryption key is managed by the company's own hardware security module (HSM). Which configuration should be used?

A data scientist is using Amazon SageMaker to train an XGBoost model on a dataset with missing values. The dataset has both numeric and categorical features. Which preprocessing step is MOST appropriate before training?

An ML team is using Amazon SageMaker to train a model. They notice that the training job is taking longer than expected and the CloudWatch metrics show high GPU utilization but low CPU utilization. Which action is MOST likely to improve training speed?

A company is using Amazon SageMaker to host a model that performs real-time inference. The model receives around 100 requests per second with occasional spikes up to 500 requests per second. The current endpoint uses 2 ml.m5.large instances. During spikes, latency increases significantly, and some requests time out. What is the MOST cost-effective solution to handle the spikes without losing requests?

Which TWO configuration steps are necessary to deploy a custom Docker container for training in Amazon SageMaker? (Choose two.)

Which THREE actions can help reduce the inference latency of a SageMaker endpoint? (Choose three.)

Which TWO services can be used to perform hyperparameter tuning in Amazon SageMaker? (Choose two.)

An IAM policy is attached to a SageMaker notebook instance. The data scientist wants to use the notebook to train a model using data from S3 bucket 'my-bucket'. However, the training job fails with an access denied error. What is the MOST likely cause?

A SageMaker endpoint creation fails with the above CloudWatch Logs excerpt. What is the MOST likely cause?

A data scientist runs the AWS CLI command shown in the exhibit. The output shows that job-2 failed. Which action should the data scientist take to diagnose the failure?

A data science team is deploying a machine learning model to production using Amazon SageMaker. The model requires real-time inference with low latency. Which SageMaker feature should they use to deploy the model?

During training of a deep learning model on a GPU instance in SageMaker, the training job fails with an insufficient memory error. Which step should be taken first to resolve this issue?

A company uses SageMaker to train a model each night. The training data is stored in an S3 bucket with SSE-S3 encryption. The training job fails with an access denied error. Which configuration is needed?

A data scientist is deploying a model to a SageMaker endpoint and needs to optimize for cost while maintaining low latency. Which TWO actions should the data scientist take?

You are deploying a custom Docker container for a SageMaker model that requires a specific NVIDIA CUDA version. Which THREE steps must you take to ensure the container runs correctly on SageMaker?

A machine learning pipeline uses SageMaker Processing jobs for feature engineering. Which TWO are benefits of using SageMaker Processing over running a custom script on an EC2 instance?

Refer to the exhibit. A data scientist runs the AWS CLI command to create a SageMaker training job. The training job fails because the input data is not accessible. Which step should the data scientist take to fix the issue?

Refer to the exhibit. A SageMaker training job uses an IAM role with this policy. The training job writes output to s3://my-bucket/output/. Which statement about the policy is true?

Refer to the exhibit. A SageMaker endpoint is returning 5xx errors. The logs show the above error. Which change will most likely resolve the issue?

A team is using SageMaker to train a model using the built-in XGBoost algorithm. The training job is taking longer than expected. The team suspects that the data is not being loaded efficiently. Which data format should they use to minimize training time?

A company uses SageMaker Ground Truth to label images for object detection. After labeling, they notice that the bounding boxes are often misaligned with the objects. Which action should they take to improve label quality?

A data scientist needs to create a SageMaker notebook instance with access to a private S3 bucket. The bucket uses SSE-KMS encryption. Which additional configuration is required?

You are building a CI/CD pipeline for SageMaker using AWS CodePipeline. Which THREE components are essential for a fully automated model training and deployment pipeline?

A company uses SageMaker to run training jobs on a schedule. The training data is stored in an S3 bucket that receives new data every hour. Which TWO approaches can the company use to trigger a training job when new data arrives?

You are deploying a PyTorch model to a SageMaker endpoint. The model is large (5 GB) and the endpoint is using an ml.c5.2xlarge instance. Inference latency is higher than required. Which change would most effectively reduce latency?

A data scientist needs to deploy a PyTorch model for real-time inference. Which AWS service is best suited for this task?

A company uses SageMaker to train a model, but the training job fails due to insufficient memory. What is the most cost-effective way to resolve this?

A team wants to automate the retraining of a model weekly using new data that arrives in S3. Which combination of services should they use?

A deployed SageMaker endpoint is returning high latency. The model is a scikit-learn Random Forest. Which action is most likely to reduce latency?

A data scientist needs to run a hyperparameter tuning job for a deep learning model. Which SageMaker feature should they use?

A company wants to serve predictions from a model using a REST API with low latency. Which SageMaker deployment option is most appropriate?

A team notices that a SageMaker training job using TensorFlow is running slower than expected. The training data is in S3 in TFRecord format. Which action is most likely to improve training throughput?

A company wants to monitor a deployed model for data drift. Which AWS service should they use?

A data scientist has trained a model using SageMaker and wants to deploy it to an endpoint. Which step is required before deployment?

Which TWO actions can help reduce inference latency for a SageMaker endpoint?

Which THREE factors should be considered when choosing an instance type for a SageMaker training job?

Which TWO services can be used to orchestrate a machine learning pipeline?

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. Which action will the notebook be able to perform?

Refer to the exhibit. A SageMaker training job is launched with the CLI command shown. The job fails with an error 'S3 data distribution type not supported for File mode'. What is the most likely fix?

Refer to the exhibit. A SageMaker endpoint logs this error. What is the most likely cause?

A machine learning team is deploying a model using Amazon SageMaker. The model inference code runs on GPUs and requires a custom container. The team wants to minimize cold start latency. Which SageMaker hosting option should they use?

A data scientist is training a deep learning model on a large dataset using SageMaker. The training job is taking too long. Upon reviewing the CloudWatch logs, the scientist notices that the GPU utilization is below 10% most of the time. Which change is MOST likely to improve GPU utilization and reduce training time?

A company is using Amazon SageMaker to train a model and wants to track hyperparameter tuning jobs. Which AWS service is BEST suited to store and query metadata such as tuning job configurations and results?

A machine learning engineer needs to deploy a model that performs real-time inference with strict latency requirements of under 100 milliseconds. The model is a large ensemble of 10 deep learning models. Which SageMaker deployment strategy is MOST appropriate?

A data scientist is using SageMaker Autopilot to automatically build a classification model. The dataset is highly imbalanced (1% positive class). Which configuration should the scientist set to handle the class imbalance?

A company is using Amazon SageMaker to train a model and wants to automatically retrain the model every week using new data. Which AWS service should be used to orchestrate the retraining pipeline?

A machine learning team is using SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The training job fails with an 'AccessDenied' error. Which IAM permission is MOST likely missing from the SageMaker execution role?

A company is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The labeling job is taking longer than expected. The team notices that many workers are spending a lot of time on images with no objects. Which labeling strategy should they use to reduce costs and time?

A data scientist needs to run a one-time training job on a large dataset using SageMaker. The job requires a specific PyTorch version and custom dependencies. Which approach is MOST efficient?

Which TWO factors should be considered when choosing between Amazon SageMaker's real-time endpoints and serverless inference? (Select TWO.)

Which THREE measures can help reduce inference latency for a deep learning model deployed on SageMaker real-time endpoints? (Select THREE.)

Which TWO actions are best practices for securing a SageMaker notebook instance? (Select TWO.)

A data scientist is training a neural network on a GPU instance in Amazon SageMaker. The training job fails with an 'OutOfMemoryError'. Which action should the data scientist take to resolve this issue?

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency. The team is concerned about cost. Which SageMaker hosting option should the team use?

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model is a PyTorch model that requires a custom inference script. The engineer notices that the endpoint is returning 500 errors after deployment. Which step should the engineer take to debug the issue?

A data scientist is using AWS Glue to prepare training data. The job reads from an S3 bucket, performs transformations, and writes to another S3 bucket. The job is failing due to insufficient memory. Which solution should the data scientist use to fix this?

A company is building a fraud detection model. The dataset is highly imbalanced (99% legitimate, 1% fraud). The data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model achieves 99% accuracy but only catches 10% of fraud cases. Which technique should the data scientist apply to improve recall for the minority class?

A company uses Amazon SageMaker to train a model. The training job uses a custom Docker container. The job fails with the error 'CannotStartContainerError: API error (500).' Which of the following is the most likely cause?

A data scientist needs to version control datasets used for machine learning experiments. Which AWS service should the data scientist use?

A company is using Amazon SageMaker to train a model on a large dataset stored in S3. The training job is taking a long time due to slow data loading. Which action can the data scientist take to reduce data loading time?

A machine learning engineer is deploying a TensorFlow model to an Amazon SageMaker endpoint. The endpoint is behind an Application Load Balancer (ALB) for A/B testing. The engineer notices that the new variant is not receiving any traffic. What is the most likely cause?

A data scientist is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. Which TWO actions are necessary to allow SageMaker to access the data?

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently. Which THREE practices should the company implement for model versioning and deployment?

A company is using Amazon SageMaker to build a custom model. The training job is failing with a 'ResourceLimitExceeded' error. Which TWO actions should the company take to resolve this issue?

A data scientist is trying to create a SageMaker training job using an execution role with the attached IAM policy. The training job fails with an access denied error when trying to read training data from the S3 bucket 'my-bucket'. What is the most likely cause?

A data scientist is reviewing the training logs from a SageMaker training job. The model's loss decreases steadily and accuracy increases. However, when the model is evaluated on a holdout test set, the accuracy is only 0.65. Which issue does this behavior suggest?

A data scientist submits a SageMaker training job with the provided configuration. The job fails immediately with the error 'Algorithm not found: 382416733822.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.2-1'. What is the most likely cause?

A data scientist is using Amazon SageMaker to train a custom image classification model using a PyTorch script. The training job runs successfully but the model accuracy is lower than expected. The scientist wants to debug the training process by inspecting gradients and layer outputs. Which SageMaker feature should be used to capture this internal state during training?

A company deploys a real-time inference endpoint using Amazon SageMaker with an ML model that has strict latency requirements. The endpoint currently uses a single ml.c5.xlarge instance. During a load test, the p99 latency exceeds the 100ms threshold. The team adds more instances but latency does not improve because the model is heavily CPU-bound. What is the MOST cost-effective change to meet the latency requirement?

100

An ML engineer needs to run a hyperparameter tuning job on Amazon SageMaker. The training algorithm supports distributed training across multiple GPUs. The engineer wants to minimize the total time to find the best hyperparameters. Which strategy should be used?

101

A company is deploying a machine learning model for real-time fraud detection using Amazon SageMaker. The model must have a p99 inference latency under 50ms. Which TWO actions should the ML team take to meet the latency requirement?

102

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The team has multiple models that need to be served, but usage patterns are unpredictable and traffic spikes occur several times a day. The team wants to minimize costs while maintaining low latency. Which THREE actions should the team take?

103

A data scientist is using Amazon SageMaker to train a large neural network on a GPU instance. The training is taking longer than expected. The scientist wants to reduce training time without changing the model architecture. Which TWO approaches should the scientist consider?

104

An ML engineer is deploying a model on a SageMaker endpoint and wants to ensure that only authorized users and services can invoke the endpoint. The company uses AWS IAM for access control and requires that the endpoint be invoked only from within a specific VPC. What combination of actions should the engineer take? (Choose the single best answer.)

105

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

106

An ML team wants to perform batch inference on a large dataset stored in Amazon S3 using a pre-trained model. The team needs to process the data in parallel across multiple instances to reduce processing time. Which approach should they use?

107

Refer to the exhibit. An ML engineer attaches this IAM policy to a user. The user wants to invoke the SageMaker endpoint my-endpoint from an EC2 instance with public IP 52.1.1.1. What will happen?

108

Refer to the exhibit. A data scientist is deploying a PyTorch model on a SageMaker endpoint. When the endpoint is invoked, the above error appears in CloudWatch logs. What is the MOST likely cause?

109

Refer to the exhibit. An ML engineer creates a CloudFormation stack with this template. The stack creation succeeds, but when the engineer tries to invoke the endpoint, it returns a ModelError. The CloudWatch logs show that the container exited with error. What is the MOST likely cause?

110

A financial services company is deploying a machine learning model for credit risk assessment. The model must have an inference latency under 200ms and must be able to handle up to 1000 transactions per second (TPS). The company wants to minimize costs. The model is a gradient boosting model implemented in XGBoost. Which SageMaker deployment option should the team choose?

111

An ML team uses Amazon SageMaker to train a deep learning model. The training job runs on a single ml.p3.2xlarge instance and is taking 10 hours. The team wants to reduce the training time to under 2 hours without changing the model architecture. Which approach is MOST effective?

112

A company wants to use Amazon SageMaker to host a model that was trained using a custom algorithm. The model artifact is stored in Amazon S3. The company wants to ensure that the endpoint can automatically scale based on the number of incoming requests. Which configuration should the company use?

113

A company uses Amazon SageMaker to train a model. The training job fails with an 'OutOfMemory' error. The training data is stored in S3 and the instance type is ml.m5.xlarge. What is the most efficient way to resolve this issue?

114

A data scientist is deploying a real-time inference endpoint using SageMaker. The model is a large NLP model requiring GPU for low latency. The endpoint must be highly available across two Availability Zones. Which deployment configuration meets these requirements?

115

A team uses AWS Glue ETL jobs to preprocess data for SageMaker training. The job runs successfully but the output data is empty. What is the most likely cause?

116

A company uses SageMaker to host a model for real-time predictions. The model is updated weekly. To minimize downtime during model updates, what should the company do?

117

A machine learning team is using SageMaker Processing jobs to run feature engineering on large datasets. The job takes a long time to complete. Which change would most likely reduce the processing time?

118

A company is using SageMaker to train a model. The training data includes personally identifiable information (PII). The company must ensure that the data is encrypted at rest and in transit. Which combination of actions meets these requirements?

119

A data scientist wants to use AWS Step Functions to orchestrate a machine learning workflow including data preprocessing, training, and evaluation. Which SageMaker integration is best suited for this purpose?

120

A company is using SageMaker to host a model that makes predictions on streaming data from Amazon Kinesis. The model must provide predictions with sub-second latency. Which approach should the company use?

121

A team is using SageMaker to train a model. They want to track hyperparameters, metrics, and model artifacts. Which SageMaker feature should they use?

122

A company is deploying a SageMaker model for real-time inference. The endpoint must be highly available and cost-effective. Which TWO actions should the company take? (Select TWO.)

123

A data scientist is using SageMaker to train a model. The training job needs to access data in an S3 bucket in a different AWS account. The data scientist has set up proper S3 bucket policies and IAM roles. Which THREE steps are necessary to allow SageMaker to access the cross-account S3 bucket? (Select THREE.)

124

A company uses SageMaker to train a model. The training job is taking too long and the data scientist wants to speed it up. Which THREE strategies should the data scientist consider? (Select THREE.)

125

A data scientist attempts to create a SageMaker training job using the IAM policy shown in the exhibit. The training job fails with an access denied error. What is the most likely cause?

126

A SageMaker training job fails with the failure reason shown in the exhibit. What is the most likely cause?

127

A SageMaker endpoint has a CloudWatch alarm configured as shown in the exhibit. The alarm fires when the p99 latency exceeds 500 ms for two consecutive minutes. Which action should the data scientist take to reduce latency?

128

A data scientist is training a model on Amazon SageMaker and notices that the training job is taking much longer than expected. The instance type is ml.m5.xlarge and the dataset is 10 GB in CSV format. Which action is MOST likely to reduce training time without changing the instance type?

129

A machine learning engineer is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency (under 100 ms) and high throughput (1000 requests per second). Which SageMaker deployment option is MOST suitable?

130

A company is using Amazon SageMaker Ground Truth to create labeled datasets for a text classification task. The labeling job uses a private workforce of 10 annotators. After labeling 10,000 items, the quality of labels is inconsistent. Which approach will MOST effectively improve labeling consistency?

131

A data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model overfits on the training data. Which hyperparameter adjustment is MOST likely to reduce overfitting?

132

A company is building a recommendation system using Amazon SageMaker. The training data includes user-item interactions stored in a DataFrame with over 100 million rows. The data scientist wants to perform feature engineering, including one-hot encoding of categorical features with high cardinality. Which approach is MOST cost-effective and scalable?

133

A machine learning team is using Amazon SageMaker Experiments to track multiple training runs. They need to compare the performance of different models based on metrics like accuracy and F1 score. However, when they view the experiment list in SageMaker Studio, the metrics are not displayed. What is the MOST likely cause?

134

A company is deploying a PyTorch model on a SageMaker endpoint for real-time inference. The model is stored as a .pth file in an S3 bucket. The data scientist wants to use the SageMaker PyTorch inference toolkit. Which file is REQUIRED in the model artifacts to serve the model?

135

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. After the Autopilot job completes, the best model has an accuracy of 0.85 on the validation set. However, the data scientist notices a class imbalance (90% negative, 10% positive). Which metric should the data scientist use to evaluate the model's performance on the positive class?

136

A company is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single p3.2xlarge instance and takes 10 hours. The data scientist wants to reduce training time using distributed training. Which SageMaker feature should be used?

137

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently with new data. Which TWO approaches can be used to update the model without downtime? (Choose TWO.)

138

A data scientist is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error message indicating that the container exited with a non-zero code. Which THREE steps should the data scientist take to diagnose the issue? (Choose THREE.)

139

A company is building a machine learning pipeline on AWS. The pipeline includes data ingestion, preprocessing, training, and deployment. Which THREE AWS services can be used to orchestrate the pipeline? (Choose THREE.)

140

A data scientist wants to deploy a PyTorch model for real-time inference with low latency. Which AWS service should they use?

141

A company's ML model training on Amazon SageMaker is taking longer than expected. The training job uses a single ml.p3.2xlarge instance. Which change is most likely to reduce training time?

142

A team is using Amazon SageMaker Autopilot to automatically build models. The dataset has 50 features and 1 million rows. After training, Autopilot generates multiple candidates. The team wants to deploy the model with the highest accuracy. What is the best practice to select and deploy the model?

143

An ML engineer needs to store and version training datasets and model artifacts. Which AWS service should they use?

144

A team is training a large language model using PyTorch on multiple GPUs. The training is taking too long due to inefficient data loading. Which AWS service can help accelerate data loading by caching data close to the GPU instances?

145

A company's ML pipeline uses AWS Step Functions to orchestrate data preprocessing, training, and evaluation. The training step occasionally fails due to a transient error. What is the most robust way to handle this without manual intervention?

146

A data scientist needs to perform hyperparameter optimization for a gradient boosting model. Which built-in Amazon SageMaker feature should they use?

147

A company's ML model is deployed on a SageMaker endpoint. The model's predictions are used in a customer-facing application that requires low latency. Over time, the model's performance degrades due to data drift. What is the most suitable approach to detect this drift automatically?

148

An ML team is using SageMaker Processing jobs to run feature engineering scripts. The scripts require a specific Python package not included in the default SageMaker image. How should the team provide this package?

149

Which TWO options are valid ways to reduce inference latency for a model deployed on a SageMaker real-time endpoint? (Select TWO.)

150

Which THREE steps should be taken to secure a SageMaker notebook instance that accesses sensitive data? (Select THREE.)

151

Which TWO AWS services can be used to deploy a trained model for serverless inference? (Select TWO.)

152

A company runs a real-time fraud detection model on a SageMaker endpoint. The model is a TensorFlow neural network trained on transactional data. The endpoint uses a single ml.p3.2xlarge instance. Recently, the application’s latency has increased from 50ms to 500ms on average. The CloudWatch metrics show that CPU utilization is at 90%, GPU utilization is at 30%, and memory utilization is at 40%. The number of requests per second has remained stable. The ML team suspects the model is not fully utilizing the GPU. What action should the team take to reduce latency without changing the instance type?

153

A company is deploying a machine learning model to production on Amazon SageMaker. The model requires low-latency inference (under 10 ms) for real-time predictions. The data scientist has trained a model using XGBoost and wants to minimize cost while meeting latency requirements. Which SageMaker hosting option should be used?

154

A data scientist is training a deep learning model on Amazon SageMaker using a custom TensorFlow container. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 16 GB GPU memory and 61 GB system memory. The model uses mixed precision training. Which step should the data scientist take to resolve the issue without changing the instance type?

155

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training role has the necessary permissions to decrypt the data. However, the training job fails with an access denied error. What is the most likely cause?

156

A machine learning team is deploying a model using Amazon SageMaker. They need to automatically retrain the model every week with new data and update the endpoint without downtime. Which approach should they use?

157

A company is running a real-time inference endpoint on Amazon SageMaker. The endpoint is using an ml.c5.xlarge instance. Over the past month, the CPU utilization has been consistently below 10%, and the latency is well within requirements. The company wants to reduce costs. What should they do?

158

A data scientist is using Amazon SageMaker to train a model. The training job is taking longer than expected. The data scientist notices that the GPU utilization is low. Which action would most likely improve GPU utilization?

159

A company is using Amazon SageMaker to deploy a model for real-time predictions. The model requires access to a DynamoDB table to look up features. The SageMaker endpoint is configured with a VPC and subnet. However, the endpoint cannot connect to DynamoDB. What is the most likely reason?

160

A data scientist is training a model using Amazon SageMaker. The training dataset is 500 GB and is stored in S3. The data scientist wants to use Pipe input mode to stream data directly from S3 to the training container. However, the training job fails with an error indicating that the container cannot read the data. What is the most likely cause?

161

A company is using Amazon SageMaker to deploy a model. The model is a large ensemble that requires 8 GB of memory. The company wants to minimize endpoint cost. Which instance type should they choose?

162

Refer to the exhibit. A company is using an IAM role with the attached policy to deploy a SageMaker model. The data scientist can create training jobs and models, but when trying to create an endpoint, they receive an access denied error. What is the missing permission?

163

Refer to the exhibit. A data scientist is reviewing CloudWatch logs for a SageMaker real-time endpoint. The log shows that a prediction took 15 ms. The endpoint is configured with an ml.c5.large instance and the model is a small scikit-learn model. The latency requirement is under 10 ms. Which action would most likely reduce the latency?

164

A company is using Amazon SageMaker to train a model. The training data includes sensitive personally identifiable information (PII). The company needs to ensure that the training data is protected and that the trained model does not inadvertently expose PII. Which TWO actions should the company take? (Choose TWO.)

165

A data scientist is deploying a model on Amazon SageMaker. The model requires inference on images, and the data scientist wants to use a GPU instance for low latency. However, the data scientist is unsure about the instance type to choose for the endpoint. Which TWO factors should the data scientist consider when selecting the instance type? (Choose TWO.)

166

A company uses Amazon SageMaker to train models. The data scientist wants to automate the retraining process whenever new data arrives in an S3 bucket. Which THREE services can be used together to achieve this? (Choose THREE.)

167

A company operates a real-time fraud detection system using an Amazon SageMaker endpoint. The model is a gradient boosting model trained on historical transaction data. The endpoint is deployed on an ml.c5.2xlarge instance with auto-scaling enabled based on average latency. Recently, during a flash sale event, the endpoint started returning HTTP 503 errors. The CloudWatch metrics show that the CPU utilization is at 70%, and the average latency has increased from 50 ms to 200 ms. The auto-scaling policy is configured to add one instance when average latency exceeds 100 ms for 5 consecutive minutes, and remove one instance when latency drops below 50 ms for 5 minutes. The current number of instances is 2. The flash sale lasted 30 minutes. What should the company do to prevent this issue in future flash sales?

168

A data scientist is using Amazon SageMaker to train a model on a large dataset (10 TB) stored in S3 in Parquet format. The training job uses an ml.p3.16xlarge instance with multiple GPUs. The data scientist notices that the GPU utilization is low (around 30%) and the training is slow. The dataset consists of hundreds of thousands of small Parquet files. The data scientist suspects that the I/O is bottlenecked. What should the data scientist do to improve GPU utilization and training speed?

169

A company has deployed a machine learning model on a SageMaker endpoint that serves predictions to a web application. The model uses a custom inference container that loads the model artifacts from an ECR repository. After updating the model with new training data, the data scientist creates a new model and updates the endpoint. However, some users report that they still get predictions from the old model. The data scientist confirms that the endpoint configuration points to the new model. What is the most likely cause?

170

A data scientist is using SageMaker to train a model. The training job is failing with a 'ResourceLimitExceeded' error. Which action should be taken to resolve this issue?

171

A company is deploying a real-time inference endpoint using SageMaker. The model has a high memory footprint and requires GPU acceleration. Which instance type and configuration should be used to minimize cost while meeting latency requirements?

172

A machine learning team is using AWS Glue to prepare data for training. They notice that the ETL job takes a long time to process large datasets. Which change is most likely to improve performance?

173

A company is using SageMaker Ground Truth to label images for a computer vision model. After launching the labeling job, they notice that the labeling throughput is lower than expected. What should they do to increase throughput?

174

A data scientist is using SageMaker Debugger to monitor a training job. The training loss is not decreasing as expected. Which Debugger feature can help identify the issue?

175

A company is using Amazon Rekognition to detect objects in images stored in S3. They want to reduce costs by processing images only when they are uploaded. Which AWS service should be used to trigger Rekognition automatically?

176

A machine learning engineer needs to deploy a model that requires custom inference code with dependencies. Which SageMaker deployment option should be used?

177

A company is training a deep learning model on SageMaker using multiple GPUs. The training is slow due to inefficient data loading. Which TWO actions can improve I/O performance?

178

A data scientist is using SageMaker to build a model for fraud detection. The dataset is highly imbalanced. Which THREE techniques should be applied to address class imbalance?

179

A company is using SageMaker Autopilot to automatically build ML models. They want to ensure that the generated models are reproducible. Which TWO settings should they configure?

180

A company operates a real-time fraud detection system using SageMaker. The model is deployed on an ml.c5.xlarge instance behind an Application Load Balancer (ALB). Recently, during a sales event, traffic spiked and the endpoint returned HTTP 503 errors. The team scaled the instance count from 2 to 5, but errors persisted. CloudWatch metrics show low CPU utilization (~30%) and high memory usage (~90%). The model loads a large dictionary file (2GB) into memory at startup. Which action should resolve the issue?

181

A research lab is using SageMaker to train deep learning models on a custom dataset stored in S3. Each training job uses a single ml.p3.2xlarge instance. Recently, training jobs have been failing intermittently with 'NetworkError: Connection reset by peer' during the data download phase. The data scientist notices that the dataset is 50GB and the network throughput is low. The training script uses the default S3 download method (boto3) to copy data from S3 to the local instance storage. Which solution should the data scientist implement to resolve the issue?

182

A media company uses SageMaker to train a recommendation model. The training data is stored in an S3 bucket with versioning enabled. The data pipeline updates the training data daily by overwriting objects with new data. Recently, the model's performance degraded, and the team suspects that the training data was corrupted on a specific day. They want to train the model using the data from a previous version. How can the team retrieve the previous version of the training data?

183

A data scientist is deploying a machine learning model on Amazon SageMaker for real-time inference. The model requires low-latency predictions and must be able to handle up to 1000 requests per second. Which TWO actions should the data scientist take to ensure the endpoint can meet the performance requirements? (Choose 2.)

184

A machine learning team is using Amazon SageMaker to train a deep learning model on a large dataset stored in Amazon S3. The training job is taking too long. The team wants to reduce training time without modifying the model architecture. Which THREE actions should the team take? (Choose 3.)

185

A machine learning engineer is deploying a custom XGBoost model for real-time inference on Amazon SageMaker. The model was trained using the SageMaker XGBoost built-in algorithm. The endpoint is deployed with an ml.m5.large instance and is receiving around 50 requests per second. The engineer notices that the endpoint's latency is around 200 ms, but the requirement is under 100 ms. The model's serialized format is a .tar.gz file. The engineer wants to reduce inference latency without modifying the model or retraining. What should the engineer do?

186

A machine learning team is using Amazon SageMaker to train a PyTorch model on a dataset that is 500 GB in size. The training job runs on a single ml.p3.2xlarge instance, but the training takes over 48 hours, which exceeds the maximum allowed time. The team wants to reduce training time to under 24 hours. They are open to using multiple instances and have budget for up to 4 instances. The dataset is stored in Amazon S3 and can be split into shards by a key. The model architecture must remain unchanged. What should the team do?

187

A company is using Amazon SageMaker to host a model for real-time inference. The model was trained using SageMaker's built-in Linear Learner algorithm. The endpoint has been running for a week, and the operations team notices that the endpoint's latency has increased from 50 ms to 150 ms over the past few days. The number of requests per second has remained steady at about 200. The team suspects a memory leak in the inference container. What should the team do to diagnose the issue?

188

A data scientist is using Amazon SageMaker to train a TensorFlow model on a dataset that includes sensitive personal information (PII). The data is stored in Amazon S3 with server-side encryption using AWS KMS (SSE-KMS). The training job fails with an Access Denied error when trying to read from S3. The data scientist has already verified that the SageMaker execution role has s3:GetObject permissions on the S3 bucket. What additional configuration is needed?

189

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint uses an ml.c5.xlarge instance. The company wants to reduce costs without affecting performance. The current traffic pattern shows a daily peak of 500 requests per second for 2 hours, and the rest of the day sees fewer than 50 requests per second. The model has a cold start time of about 30 seconds. What should the company do?

190

A machine learning engineer is deploying a model on Amazon SageMaker that was trained using a custom Docker container. The container is stored in Amazon ECR. The engineer creates a SageMaker model and endpoint configuration, but when creating the endpoint, it fails with an error: 'Could not find the inference code at the expected path.' The engineer verified that the container image is correct and the model artifacts are in S3. What is the most likely cause?

191

A data scientist is using Amazon SageMaker to train a model using the built-in XGBoost algorithm. The training job uses a hyperparameter tuning job to optimize hyperparameters. The tuning job has been running for 3 hours and has completed 20 training jobs. The data scientist wants to stop the tuning job early if it is not making progress. What should the data scientist do to accomplish this?

192

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires GPU inference. The company wants to minimize latency and cost. Which instance type and deployment strategy should be used?

193

A data scientist is training a model using Amazon SageMaker with a custom Docker container. The training job fails with an error: 'Resource exhausted: Out of memory'. The training data is stored in S3. What should the data scientist do to resolve this issue?

194

A machine learning engineer needs to deploy a model that performs real-time fraud detection. The model must be highly available and scalable. Which AWS service should be used to host the model?

195

A company is using Amazon SageMaker to train a model. The training job is taking too long. The data scientist notices that the GPU utilization is low. Which action should be taken to improve training performance?

196

A data scientist is using Amazon SageMaker Debugger to monitor training jobs. The training loss is decreasing but then suddenly spikes. What is the most likely cause and how should it be addressed?

197

A company wants to perform automated hyperparameter tuning for a model. Which Amazon SageMaker feature should be used?

198

A machine learning engineer is building a pipeline using Amazon SageMaker Pipelines. The pipeline has multiple steps including data preprocessing, training, and evaluation. Which statement about SageMaker Pipelines is correct?

199

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives variable traffic and the company wants to optimize cost while maintaining responsiveness. Which scaling policy should be used?

200

A data scientist needs to process a large dataset (100 TB) for training a machine learning model. The data is stored in Amazon S3. Which approach is most cost-effective and efficient for data processing?

201

A company is deploying a machine learning model using Amazon SageMaker. The model must be updated frequently without downtime. Which TWO strategies can achieve this? (Choose two.)

202

A data scientist is using Amazon SageMaker to train a model and wants to track experiments, including parameters and metrics. Which THREE actions should be taken? (Choose three.)

203

A machine learning engineer is setting up a training job in Amazon SageMaker. Which THREE components are required to define a training job? (Choose three.)

204

A company is training a deep learning model on Amazon SageMaker. The training job is failing with an out-of-memory error. Which SageMaker feature should the company use to resolve this issue without changing the instance type?

205

A data scientist is deploying a machine learning model using SageMaker and wants to automate the retraining pipeline. The training data is updated daily in an S3 bucket. Which combination of AWS services should the data scientist use to trigger a new training job when new data arrives?

206

A company is using SageMaker to train a large NLP model. The training job is taking too long due to high I/O wait time. The data is stored as CSV files in S3. Which optimization should the company implement to reduce I/O wait time?

207

A machine learning team is using SageMaker to build a model. They need to track hyperparameter tuning experiments, compare results, and visualize metrics. Which SageMaker feature should they use?

208

A company has deployed a model on SageMaker for real-time inference. The endpoint is experiencing high latency during traffic spikes. Which action should the company take to reduce latency?

209

A data scientist is using SageMaker to train a model with a custom algorithm. The training script uses TensorFlow and runs on GPU instances. The training job fails with 'CUDA_ERROR_OUT_OF_MEMORY'. What is the most likely cause?

210

A company wants to use SageMaker to host multiple models behind a single endpoint to reduce costs. Which SageMaker feature should they use?

211

A machine learning team is using SageMaker to train a model. They want to ensure that the training data is encrypted at rest in the S3 bucket and that the data is also encrypted during transit. Which configuration should they use?

212

A company is using SageMaker to train a model with a large dataset that is stored in S3. The training job is taking a long time due to high I/O latency. The team has already converted the data to RecordIO format. What should they do next to reduce I/O latency?

213

Which TWO of the following are benefits of using SageMaker Managed Spot Training? (Select TWO.)

214

Which THREE of the following are valid ways to deploy a model using SageMaker? (Select THREE.)

215

Which TWO of the following are valid configurations for SageMaker Training Job resource limits? (Select TWO.)

216

An IAM policy attached to a SageMaker notebook role is shown in the exhibit. A data scientist is trying to run a training job from the notebook, but the job fails with an access denied error. The training job needs to read data from 'my-bucket' and write output to 'my-bucket'. What is the most likely cause of the failure?

217

A SageMaker training job log shows the exhibit. The training job fails immediately after starting. The training data is supposed to be provided via Pipe mode from S3. What is the most likely cause?

218

A SageMaker endpoint configuration is shown in the exhibit. The company wants to deploy the model to a real-time endpoint. What is missing from this configuration to successfully create the endpoint?

219

A data scientist is using Amazon SageMaker to train a model. Training is taking longer than expected. The scientist notices that the training job is using a single instance type with limited GPU memory. Which action will MOST likely reduce training time?

220

An ML team deploys a real-time inference endpoint on Amazon SageMaker. Users report high latency. The model is a PyTorch model using a custom container. Which combination of changes should the team implement to reduce latency? (Choose the best answer.)

221

A company uses Amazon SageMaker to host a model for fraud detection. The model uses a custom XGBoost container. The endpoint receives about 100 requests per second, each with 50 features. The team notices that the model's predictions are occasionally incorrect for a subset of requests. Which approach should the team take to debug the issue?

222

A machine learning engineer needs to deploy a TensorFlow model to a SageMaker endpoint. The model expects a specific input format. The engineer has the model artifacts stored in an S3 bucket. Which step is REQUIRED to deploy the model?

223

A company is building a recommendation system using Amazon SageMaker. The data is stored in a large S3 bucket with millions of small CSV files. The team wants to train a factorization machines model. Which data ingestion strategy will be MOST efficient?

224

An ML team is using SageMaker Autopilot to automatically build a binary classification model. The dataset has 500,000 rows and 200 columns, with a severe class imbalance (1% positive). Which configuration should the team set to address the imbalance?

225

A company wants to use Amazon Rekognition to detect objects in images stored in an S3 bucket. The images are uploaded by users. Which IAM policy statement is necessary to allow Rekognition to read from the bucket?

226

A data scientist is using SageMaker to train a deep learning model. The training script uses TensorFlow and runs on a single p3.2xlarge instance. The scientist wants to reduce training time by using multiple GPUs. What should the scientist do?

227

A team has deployed a SageMaker endpoint for a sentiment analysis model. The model was trained on text data from social media. After deployment, the team notices that the model's accuracy has dropped significantly after 3 months. Which action should the team take to detect and address this issue?

228

Which TWO actions can reduce inference latency for a SageMaker real-time endpoint? (Choose 2.)

229

Which THREE are valid considerations when deploying a large deep learning model (10 GB) on a SageMaker endpoint? (Choose 3.)

230

Which TWO SageMaker features can be used to monitor and debug training jobs? (Choose 2.)

231

Refer to the exhibit. A developer has this IAM policy attached to an IAM role used by SageMaker. When attempting to create an endpoint, the operation fails with an access denied error. What is the MOST likely cause?

232

Refer to the exhibit. A data scientist is training a PyTorch model on a SageMaker ml.p3.2xlarge instance (16 GB GPU memory). The training fails with the shown error. Which change should the scientist make to resolve the error?

233

Refer to the exhibit. An administrator has attached this IAM policy to a user. The user tries to start a SageMaker training job that uses a custom Docker image from Amazon ECR. The training job fails with an access denied error. What is the MOST likely reason?

234

A data scientist is training a model using Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single ml.p3.2xlarge instance. The data is stored in S3 and is about 50 GB in size. Which action would MOST likely reduce training time?

235

A machine learning engineer is deploying a model using Amazon SageMaker. The model is a PyTorch model that performs real-time inference with low latency requirements. The engineer wants to use automatic scaling based on the number of concurrent requests. Which SageMaker feature should be used to achieve this?

236

A company is using Amazon SageMaker to build a binary classification model. The dataset is highly imbalanced, with 95% negative class and 5% positive class. Which technique should be used to address the class imbalance?

237

An ML team is deploying a model to a SageMaker endpoint for real-time inference. The model is large (2 GB) and requires GPU for low-latency inference. The team wants to minimize cost while maintaining a response time of under 200 ms. Which instance configuration and SageMaker feature would be best?

238

A data scientist is training a model using Amazon SageMaker and wants to track hyperparameter tuning jobs, training jobs, and model metrics. The team also needs to compare experiments visually. Which AWS service should be used?

239

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket in a different AWS account. Which IAM policy configuration is required to allow SageMaker to access the data?

240

A machine learning engineer is deploying a model on SageMaker and needs to ensure that the endpoint can handle a sudden spike in traffic. The engineer expects traffic to increase by 10x during a promotional event. Which scaling strategy should be used?

241

A data scientist is using Amazon SageMaker to train a model and wants to use a custom Docker container for training. The container requires access to a private Amazon ECR repository. Which IAM role configuration is needed?

242

A company wants to use Amazon SageMaker to train a model using data that is updated daily. The training data is stored in an S3 bucket, and the team wants to automate the training process whenever new data arrives. Which AWS service should be used to trigger the SageMaker training job?

243

A company is deploying a machine learning model on Amazon SageMaker. The model needs to be updated frequently with new versions. The team wants to minimize downtime and test the new model version before routing all traffic to it. Which TWO strategies should be used together?

244

A data scientist is training a model using Amazon SageMaker and wants to reduce the training time. The training job uses a single GPU instance. Which THREE actions can reduce training time?

245

A company wants to deploy a machine learning model on Amazon SageMaker and needs to monitor the model's performance in production. Which TWO AWS services can be used to set up monitoring?

246

A data scientist is deploying a PyTorch model to Amazon SageMaker for real-time inference. The model runs on a large instance but inference latency is too high. Which action is MOST likely to reduce latency without sacrificing accuracy?

247

A team is using Amazon SageMaker to train a linear regression model on a dataset with 10 features. After training, they notice the model has high bias. Which action is MOST likely to reduce bias?

248

A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is failing with a 'ResourceLimitExceeded' error. The engineer checks the account limits and sees that the current limit for the instance type is 2, and they are already using 2 instances for other jobs. Which approach would resolve the issue MOST cost-effectively?

249

A data scientist is using Amazon SageMaker to train a model. The training job uses a custom Docker image stored in Amazon ECR. The training job fails with an error 'CannotPullContainerError'. Which TWO actions should the data scientist take to resolve this issue? (Choose TWO.)

250

A company is deploying a machine learning model using Amazon SageMaker. To reduce costs, they want to use SageMaker Managed Spot Training. Which THREE conditions must be met for the training job to use spot instances? (Choose THREE.)

251

A data engineer is building a data pipeline for a machine learning project using Amazon SageMaker. The raw data is stored in Amazon S3. Which TWO steps are essential to ensure data privacy and security before training? (Choose TWO.)

252

An IAM policy attached to a SageMaker execution role is shown in the exhibit. When a data scientist tries to create a training job that writes logs to CloudWatch Logs, the job fails. What is the MOST likely reason?

253

An engineer runs the AWS CLI command in the exhibit to create a SageMaker endpoint configuration. The endpoint is created successfully, but when invoked, the inference response is slow. The engineer wants to test with a different instance type. Which action should the engineer take?

254

An engineer sees the error in the exhibit when trying to deploy a model from a model registry in SageMaker. What is the MOST likely cause?

255

A data scientist creates a model resource in SageMaker using the JSON configuration in the exhibit. When creating an endpoint, the deployment fails with an error 'ModelError: Cannot find inference code'. What is the MOST likely cause?

256

A company is using Amazon SageMaker to train a large natural language processing model. The training job uses a GPU instance and is expected to take several hours. The data scientist wants to monitor GPU utilization in real-time. Which approach is MOST effective?

257

A data scientist needs to deploy a trained model to Amazon SageMaker for real-time inference. The model is stored as a .tar.gz file in Amazon S3. Which AWS service is used to create a SageMaker endpoint?

258

A company uses Amazon SageMaker to train machine learning models. The training data contains personally identifiable information (PII). The company needs to ensure that the data is encrypted in transit between S3 and SageMaker. Which configuration is REQUIRED?

259

A machine learning engineer is using Amazon SageMaker to train a model. The training job is taking too long. The engineer suspects the data loading is a bottleneck. Which action would MOST effectively diagnose the issue?

260

A company is using Amazon SageMaker to run a hyperparameter tuning job. The tuning job uses Bayesian optimization. Which THREE statements about Bayesian optimization are correct? (Choose THREE.)

261

A data scientist wants to deploy a PyTorch model for real-time inference. Which SageMaker deployment option provides the lowest latency for single-digit millisecond responses?

262

A team is training a large NLP model using SageMaker. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 61 GB GPU memory. Which action should the team take to resolve the issue without changing the model architecture?

263

A company uses SageMaker Pipelines to automate model retraining. The pipeline fails intermittently at the Preprocess step with a 'ResourceLimitExceeded' error. The team uses a ml.m5.xlarge instance. What is the most likely cause?

264

A machine learning engineer needs to store and version datasets for reproducibility. Which AWS service is designed for this purpose?

265

A data scientist uses SageMaker to train a model. The training job takes 10 hours, but the team needs to reduce costs. Which approach is MOST cost-effective?

266

A company deploys a SageMaker endpoint for real-time inference. After a week, the response latency increases from 50 ms to 500 ms. CPU utilization is at 30%. What is the most likely cause?

267

A team needs to automatically retrain a model every week using new data. Which SageMaker feature is designed to schedule and automate this workflow?

268

A model deployed on a SageMaker endpoint is producing predictions that are consistently biased against a certain demographic. Which step should the team take FIRST to address this issue?

269

A machine learning team is using SageMaker to train a model with a custom Docker container. The training script runs locally but fails on SageMaker with a 'Permission denied' error when writing to /opt/ml/model. What is the likely cause?

270

A company wants to monitor SageMaker endpoints for data drift. Which TWO services can be used together to detect and alert on drift?

271

A data scientist needs to deploy a model with a custom inference container. Which THREE requirements must the container meet for SageMaker hosting?

272

A machine learning team is using SageMaker Pipelines to orchestrate a multi-step workflow. The pipeline fails with a 'ThrottlingException' when submitting a training job. Which TWO actions can reduce the likelihood of throttling?

273

A data scientist is training a deep learning model on Amazon SageMaker using the built-in Object Detection algorithm. The training job is failing with a 'ResourceLimitExceeded' error when trying to launch multiple GPU instances. Which of the following is the MOST likely cause?

274

A machine learning team is deploying a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (<100 ms). The model is a PyTorch model with custom pre- and post-processing logic. The team uses a SageMaker Model with a custom inference container. After deployment, they observe that the endpoint takes over 500 ms for the first request, but subsequent requests are fast (~50 ms). What is the MOST likely cause?

275

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket. The data scientist wants to use the Pipe mode for training to stream data directly from S3 instead of downloading it first. Which of the following is a prerequisite for using Pipe mode?

276

A data scientist is performing hyperparameter tuning using Amazon SageMaker Automatic Model Tuning (AMT). The job uses a random search strategy. After 20 training jobs, the best objective metric value has plateaued. The data scientist wants to explore more of the hyperparameter space. Which action should the data scientist take?

277

A company is using Amazon Forecast for demand forecasting. The data includes time series data for multiple items. The company wants to ensure that the forecast is updated daily as new data arrives. Which approach should be used to automate this process?

278

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model requires GPU for inference. Which instance type should be selected?

279

A data scientist is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The team has limited budget and wants to minimize labeling costs while ensuring high-quality labels. Which approach is MOST cost-effective?

280

A company is using Amazon SageMaker to train a model on data stored in S3. The training job needs to access data from an S3 bucket in a different AWS account. The data owner has granted cross-account access via a bucket policy. However, the training job fails with an AccessDenied error. What is the MOST likely cause?

281

A data scientist is using Amazon SageMaker to train a model with a custom Docker container. The training script reads data from an S3 bucket and writes the model artifact to an S3 bucket. The training job fails with a 'NoSuchKey' error. What is the MOST likely cause?

282

A company is using Amazon SageMaker to train a machine learning model. The training job is configured to use the File mode to download data from S3 to the training instances. The training data is stored in a single S3 bucket with multiple prefixes. Which TWO actions are required to ensure the training job can access the data? (Choose TWO.)

283

A data scientist is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives a large number of requests with variable traffic patterns. The team wants to minimize cost while ensuring low latency. Which THREE actions should the team take? (Choose THREE.)

284

A machine learning team is using Amazon SageMaker to train a model. The training job uses spot instances to reduce cost. However, the training job is frequently interrupted. Which TWO actions can help mitigate the impact of spot interruptions? (Choose TWO.)

285

A company is deploying a machine learning model using SageMaker. The model is a PyTorch model that requires GPU for inference. The company wants to minimize costs while ensuring low latency. Which instance type should be used for the SageMaker endpoint?

286

A data scientist is training a deep learning model on SageMaker using a custom Docker container. The training job fails with an error indicating that the container exited with a non-zero status. The CloudWatch logs show 'FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/data.csv''. What is the most likely cause?

287

A company uses SageMaker to host a real-time inference endpoint. The endpoint is receiving a large number of requests, but the latency is higher than expected. The data scientist observes that the CPU utilization is low but memory utilization is high. Which action should be taken to reduce latency?

288

A machine learning engineer is deploying a model using SageMaker and wants to use automatic scaling for the endpoint based on the number of concurrent requests. The engineer has defined a scaling policy using the SageMakerVariantInvocationsPerInstance metric. However, the scaling is not triggering as expected. What could be the issue?

289

A data scientist is using SageMaker Ground Truth to create a labeled dataset for object detection. After the labeling job completes, the scientist notices that the output manifest file contains incorrect labels. What is the most efficient way to correct these labels?

290

A company is using SageMaker to train a linear regression model on a dataset that fits into memory on a single instance. The training job is taking longer than expected. The data scientist wants to reduce training time without changing the algorithm. Which approach is most effective?

291

A company is using SageMaker to host a model that performs real-time fraud detection. The model receives high request volumes with occasional spikes. The company wants to ensure that the endpoint can handle spikes without throttling while minimizing cost. Which scaling strategy should be used?

292

A data scientist is using SageMaker to train a model using the built-in XGBoost algorithm. The training job fails with the error 'AlgorithmError: Framework error: No module named 'xgboost''. What is the most likely cause?

293

A company is using SageMaker to deploy a model for real-time inference. The model requires low latency, and the company wants to test the endpoint before production. Which approach should be used to validate endpoint performance?

294

A data scientist is using SageMaker to train a model and wants to track experiments, including hyperparameters and metrics. Which TWO actions should the scientist take to set up experiment tracking? (Choose TWO.)

295

A company is deploying a machine learning model to a SageMaker endpoint and wants to ensure that the endpoint is resilient to instance failures. Which THREE steps should the company take to achieve high availability? (Choose THREE.)

296

A data scientist is training a model using SageMaker and wants to use spot instances to reduce costs. Which THREE considerations should the scientist evaluate? (Choose THREE.)

297

A data scientist wants to deploy a PyTorch model for real-time inference with latency under 100 ms. Which AWS service is most suitable?

298

A team is training an XGBoost model using SageMaker with a large dataset in S3 (100 GB). Training is taking too long. Which change will most likely reduce training time without sacrificing accuracy?

299

A company deploys a SageMaker model for inference. After a few days, response times increase significantly. CloudWatch metrics show high CPU utilization and memory usage. The model is a large ensemble. What is the most cost-effective solution?

300

A data scientist needs to run a one-time SQL query on a large dataset in S3 to create a training dataset. The query involves aggregations and joins. Which service is most suitable?

Practice all 300 Machine Learning Implementation and Operations questions

Other MLS-C01 exam domains

Data Engineering Modeling Exploratory Data Analysis

Frequently asked questions

What does the Machine Learning Implementation and Operations domain cover on the MLS-C01 exam?

The Machine Learning Implementation and Operations domain covers the key concepts tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all MLS-C01 domains — no account required.

How many Machine Learning Implementation and Operations questions are in the MLS-C01 question bank?

The Courseiva MLS-C01 question bank contains 300 questions in the Machine Learning Implementation and Operations domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Machine Learning Implementation and Operations for MLS-C01?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Machine Learning Implementation and Operations questions for MLS-C01?

Yes — the session launcher on this page draws questions exclusively from the Machine Learning Implementation and Operations domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your MLS-C01 domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included