MLA-C01 ML Model Development — All Questions With Answers

Question 1easymultiple choice

Read the full ML Model Development explanation →

A data scientist is training a binary classification model using imbalanced data where the positive class is only 1% of the dataset. The scientist wants to maximize the recall for the positive class while maintaining reasonable precision. Which evaluation metric is most appropriate to tune during model selection?

Question 2mediummultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is training a deep learning model on SageMaker and notices that the training loss decreases rapidly in the first few epochs but then plateaus. The validation loss starts increasing after 10 epochs. Which action should the engineer take to improve generalization?

Question 3hardmultiple choice

Read the full ML Model Development explanation →

A team is deploying a machine learning model for real-time fraud detection. The model must have inference latency under 10 ms and handle up to 1000 requests per second. The model is a gradient boosting model using XGBoost. Which SageMaker hosting configuration is MOST cost-effective while meeting the requirements?

Question 4easymultiple choice

Read the full ML Model Development explanation →

A data scientist is using Amazon SageMaker to train a linear regression model. After training, the scientist notices that the training and validation errors are both low, but the model performs poorly on new test data. What is the MOST likely cause?

Question 5mediummultiple choice

Read the full ML Model Development explanation →

A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?

Question 6hardmultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer is using SageMaker Automatic Model Tuning (AMT) to optimize hyperparameters for a random forest model. The engineer notices that the tuning job is taking too long and many hyperparameter combinations are being evaluated but not improving the objective metric. Which action should the engineer take to make the tuning more efficient?

Question 7easymultiple choice

Read the full ML Model Development explanation →

A team is developing a model to predict customer churn. The dataset has 10,000 samples with 20 features. The target variable is binary with 15% churn rate. The team wants to use logistic regression. Which data preprocessing step is MOST important to ensure proper convergence?

Question 8mediummulti select

Read the full ML Model Development explanation →

A data scientist is training a deep learning model using SageMaker and wants to use distributed training across multiple GPUs to reduce training time. Which TWO actions should the scientist take to configure distributed training? (Select TWO.)

Question 9hardmulti select

Read the full ML Model Development explanation →

A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)

Question 10mediummulti select

Read the full ML Model Development explanation →

A data scientist is building a text classification model using a pre-trained BERT model from the Hugging Face library on SageMaker. The scientist wants to fine-tune the model on a custom dataset. Which TWO steps are necessary to set up the fine-tuning job? (Select TWO.)

Question 11mediummultiple choice

Read the full ML Model Development explanation →

A data scientist is training a binary classification model using Amazon SageMaker. The dataset has a severe class imbalance (95% negative, 5% positive). The model achieves 99% accuracy but fails to identify positive cases correctly. Which action should the data scientist take to improve the model's ability to detect positive cases?

Question 12hardmultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is deploying a pre-trained NLP model on Amazon SageMaker for real-time inference. The model expects input sequences of variable length, and performance is critical. The engineer wants to minimize latency while handling the variable-length inputs efficiently. Which approach should the engineer choose?

Question 13easymultiple choice

Read the full ML Model Development explanation →

A team wants to track and compare multiple machine learning experiments, including hyperparameters, metrics, and artifacts. They are using Amazon SageMaker. Which AWS service or feature should they use to achieve this?

Question 14mediummultiple choice

Read the full ML Model Development explanation →

A company is using Amazon SageMaker to train a large deep learning model. The training job is taking a very long time. The data scientist suspects that the GPU utilization is low due to inefficient data loading. Which action should the data scientist take to diagnose and address this issue?

Question 15hardmultiple choice

Read the full NAT/PAT explanation →

An MLOps engineer is building an automated retraining pipeline for a fraud detection model. The model must be retrained weekly, and the new model should only be promoted to production if it meets predefined performance thresholds compared to the current model. Which combination of SageMaker capabilities should the engineer use?

Question 16easymultiple choice

Read the full ML Model Development explanation →

A data scientist is training a regression model in Amazon SageMaker. The dataset contains missing values in several features. The scientist wants to handle missing values as part of the training pipeline to ensure consistency between training and inference. Which approach should the scientist use?

Question 17mediummultiple choice

Read the full ML Model Development explanation →

A machine learning team is using Amazon SageMaker to train a model. They notice that the training job is taking longer than expected and the logs show repeated warnings about 'loss not decreasing'. Which SageMaker feature should they use to diagnose and visualize the training process?

Question 18hardmulti select

Read the full ML Model Development explanation →

A data scientist is building a text classification model using Amazon SageMaker. The dataset is stored as a CSV file in Amazon S3. The scientist wants to use the SageMaker built-in BlazingText algorithm. Which of the following steps are required to prepare the data for training? (Choose TWO.)

Question 19mediummulti select

Read the full ML Model Development explanation →

An MLOps team is designing a CI/CD pipeline for deploying machine learning models to production on Amazon SageMaker. They want to ensure that the deployment process is automated and that models are automatically rolled back if performance degrades. Which of the following AWS services or features should they use to achieve this? (Choose THREE.)

Question 20hardmultiple choice

Read the full ML Model Development explanation →

A financial services company is deploying a real-time fraud detection model using Amazon SageMaker. The model is a gradient boosting model (XGBoost) trained on historical transaction data. The inference endpoint uses an ml.m5.2xlarge instance with a single variant. Recently, the company has experienced a 3x increase in transaction volume during peak hours, causing inference latency to exceed the 200ms SLA. The data science team has already optimized the model by reducing the number of trees and feature set, but the latency remains high during spikes. The team considers using SageMaker's built-in scaling policies. They currently have a single endpoint with one production variant. The team wants to maintain low latency without over-provisioning resources. They have ruled out model changes. Which approach should the team take?

Question 21hardmultiple choice

Read the full ML Model Development explanation →

A data science team at a financial services company is deploying a real-time fraud detection model using Amazon SageMaker. The model is a gradient boosting classifier trained on historical transaction data. The model is deployed to a SageMaker endpoint with an ML.M5.LARGE instance for real-time inference. After deployment, the team observes that the endpoint's latency spikes to over 2 seconds during peak hours (10:00-12:00 and 14:00-16:00), causing timeouts for client applications. The average latency during off-peak hours is 200 ms. The team has enabled auto-scaling with a target average CPU utilization of 70%, but the endpoint still experiences high latency during peak hours. The instance count never scales beyond 2 instances during peaks. The model size is 500 MB, and each request includes 200 features. The team needs to reduce latency to under 500 ms at the 99th percentile during peak hours without increasing costs beyond the current budget. Which course of action should the team take?

Question 22mediummultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is developing a text classification model using Amazon SageMaker. The dataset consists of 1 million customer reviews, with labels indicating sentiment (positive, negative, neutral). The engineer uses a pre-trained BERT model from the Hugging Face Model Hub and fine-tunes it on the dataset using SageMaker's Hugging Face estimator with a ml.p3.2xlarge instance. After 2 hours of training, the training job fails with a 'ResourceExhaustedError: CUDA out of memory' error. The error occurs during the forward pass of the first epoch. The engineer confirms that the batch size is set to 32, the maximum sequence length is 512 tokens, and the dataset is stored in a S3 bucket in the same AWS region. The engineer needs to complete fine-tuning without increasing instance costs. Which course of action should the engineer take?

Question 23easymultiple choice

Read the full ML Model Development explanation →

A data scientist is training a binary classification model using a dataset that has a severe class imbalance (90% negative, 10% positive). Which technique should be used to address the imbalance during model training?

Question 24easymultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is training a model using SageMaker's built-in XGBoost algorithm. The training job fails with an error indicating insufficient memory. Which parameter should be adjusted to reduce memory usage?

Question 25mediummultiple choice

Read the full ML Model Development explanation →

A team is tuning hyperparameters for a neural network using SageMaker's HyperparameterTuningJob with Bayesian optimization. After several trials, the objective metric has not improved significantly. Which action is most likely to help continue making progress?

Question 26mediummultiple choice

Read the full ML Model Development explanation →

A data scientist is training a large model on SageMaker and wants to reduce training time by using multiple GPUs. The model is small enough to fit on a single GPU but training is slow. Which SageMaker feature should be used?

Question 27hardmultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is training a deep learning model using TensorFlow in SageMaker. The training runs on an ml.p3.16xlarge instance (8 GPUs). The engineer notices that GPU utilization is low (~30%) and time per epoch is high. The model uses a custom training loop. Which configuration change is most likely to improve GPU utilization?

Question 28easymultiple choice

Read the full ML Model Development explanation →

A data scientist is performing feature engineering on a dataset containing a categorical feature with high cardinality (over 1000 unique values). Which encoding method is most appropriate to use as input for a tree-based model?

Question 29mediummultiple choice

Read the full ML Model Development explanation →

A team is evaluating classification models for a medical diagnosis application. The cost of a false negative is much higher than the cost of a false positive. Which metric should be optimized during model selection?

Question 30hardmultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is using SageMaker to train a model with the built-in LightGBM algorithm. The engineer wants to use early stopping to prevent overfitting. The training job is configured with a validation dataset. Which hyperparameter should be set to enable early stopping?

Question 31mediummultiple choice

Read the full ML Model Development explanation →

A data scientist has trained a model that achieves 95% accuracy on the training set but only 70% on the test set. Which of the following is the most likely cause?

Question 32mediummulti select

Read the full ML Model Development explanation →

A data engineer is preparing a dataset for training a regression model. The dataset contains numerical features with missing values. Which two methods are appropriate for handling missing values? (Choose two.)

Question 33hardmulti select

Read the full ML Model Development explanation →

A machine learning engineer is using SageMaker's HyperparameterTuningJob to optimize a neural network. The engineer observes that the tuning job is taking too long. Which three actions can reduce the tuning time? (Choose three.)

Question 34hardmulti select

Read the full ML Model Development explanation →

A data scientist is training a large transformer model using SageMaker's model parallelism library. The training job is failing with an out-of-memory (OOM) error. Which two actions can help resolve the OOM error? (Choose two.)

Question 35hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist runs a SageMaker training job with the above configuration. The training completes but the model performance is poor. Which change to the hyperparameters is most likely to improve the model's AUC?

Exhibit

{
  "TrainingJobName": "my-xgboost-job",
  "HyperParameters": {
    "num_round": "100",
    "max_depth": "6",
    "eta": "0.3",
    "subsample": "0.8",
    "colsample_bytree": "0.8",
    "objective": "binary:logistic",
    "eval_metric": "auc"
  },
  "InputDataConfig": [
    {
      "ChannelName": "train",
      "DataSource": {
        "S3DataSource": {
          "S3Uri": "s3://my-bucket/train.csv",
          "S3DataType": "S3Prefix"
        }
      }
    },
    {
      "ChannelName": "validation",
      "DataSource": {
        "S3DataSource": {
          "S3Uri": "s3://my-bucket/validation.csv",
          "S3DataType": "S3Prefix"
        }
      }
    }
  ],
  "AlgorithmSpecification": {
    "TrainingImage": "811284229777.dkr.ecr.us-west-2.amazonaws.com/xgboost:1.5-1",
    "TrainingInputMode": "File"
  },
  "RoleArn": "arn:aws:iam::123456789012:role/SageMakerRole",
  "OutputDataConfig": {
    "S3OutputPath": "s3://my-bucket/output"
  },
  "ResourceConfig": {
    "InstanceType": "ml.m5.xlarge",
    "InstanceCount": 1,
    "VolumeSizeInGB": 30
  },
  "StoppingCondition": {
    "MaxRuntimeInSeconds": 86400
  }
}

Question 36mediummultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist receives the above error when running a SageMaker training job. Which action will resolve the issue?

Network Topology

Question 37easymultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist reviews the output of a SageMaker training job. The model has 95% training accuracy and 92% validation accuracy. Which statement is true?

Exhibit

Model Artifacts:
  ModelArtifacts:
    S3ModelArtifacts: s3://my-bucket/output/model.tar.gz
  ModelMetrics:
    Metrics:
      training:accuracy: 0.95
      validation:accuracy: 0.92
  FinalHyperParameters:
    learning_rate: 0.01
    batch_size: 32
    epochs: 10

Question 38mediummultiple choice

Read the full ML Model Development explanation →

A company is building a binary classifier for credit default prediction. The dataset is highly imbalanced (98% no default). They want to maximize recall for the minority class while maintaining reasonable precision. Which metric should be optimized during hyperparameter tuning?

Question 39hardmultiple choice

Read the full ML Model Development explanation →

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

Question 40easymultiple choice

Read the full ML Model Development explanation →

A team uses SageMaker for training. They need to monitor training progress and view metrics like loss and accuracy. Which SageMaker feature should they use?

Question 41mediummultiple choice

Read the full ML Model Development explanation →

A company wants to deploy a machine learning model that makes real-time predictions for a mobile app. The model is a deep neural network with a large model size (500 MB). Which SageMaker endpoint configuration is most cost-effective while meeting low-latency requirements?

Question 42easymultiple choice

Read the full ML Model Development explanation →

A data scientist needs to store training data in Amazon S3 and wants to optimize read performance for iterative training jobs. Which S3 feature should they use?

Question 43hardmultiple choice

Read the full ML Model Development explanation →

A team trained a gradient boosting model with the following hyperparameters: learning_rate=0.1, n_estimators=1000, max_depth=6. The model achieves excellent training accuracy but poor validation accuracy. They suspect overfitting. Which hyperparameter change is LEAST likely to help?

Question 44easymultiple choice

Read the full ML Model Development explanation →

A company uses SageMaker to train a model. The training job is failing with an error "ResourceLimitExceeded". What is the most likely cause?

Question 45mediummultiple choice

Read the full ML Model Development explanation →

A team is using SageMaker for automatic model tuning. They want to minimize the mean absolute error (MAE) and have a budget of 50 training jobs. Which tuning strategy should they choose to best explore the hyperparameter space?

Question 46hardmultiple choice

Read the full ML Model Development explanation →

A model deployed on SageMaker is returning inaccurate predictions for certain customer segments. The team suspects data drift. Which SageMaker feature should they use to continuously monitor input data distribution?

Question 47mediummulti select

Read the full ML Model Development explanation →

A data scientist is using Amazon SageMaker Data Wrangler for data preparation. Which two tasks can be performed using Data Wrangler's built-in transforms? (Choose two.)

Question 48mediummulti select

Read the full ML Model Development explanation →

A data scientist is building a text classification model using Amazon SageMaker. The dataset is large and includes imbalanced classes. Which three techniques can help improve model performance? (Choose three.)

Question 49hardmulti select

Read the full ML Model Development explanation →

A company uses SageMaker to train a model. They want to ensure that training data is encrypted at rest and in transit, and that only authorized users can access the training artifacts. Which three steps should they take? (Choose three.)

Question 50hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. The training job failed. What is the MOST likely cause?

Exhibit

[2024-01-15 10:30:45] Training job 'my-training-job' started.
[2024-01-15 10:31:10] Using algorithm 'built-in' with hyperparameters: {'epochs': 10, 'batch-size': 32, 'learning-rate': 0.001}
[2024-01-15 10:31:15] File system creation failed: No usable scratch space. Error: Input/output error.
[2024-01-15 10:31:15] Retrying with local SSD...
[2024-01-15 10:31:20] Training completed with status 'Failed'.

Question 51easymultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. The data scientist wants to update the endpoint to use a new model version without downtime. Which approach should they use?

Exhibit

{
  "EndpointConfigName": "my-config",
  "ProductionVariants": [
    {
      "VariantName": "variant1",
      "ModelName": "my-model-v1",
      "InitialInstanceCount": 1,
      "InstanceType": "ml.c5.large",
      "InitialVariantWeight": 1.0
    }
  ]
}

Question 52easymultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A user launches a SageMaker notebook instance with this lifecycle configuration. What happens?

Exhibit

#!/bin/bash
set -e
cd /home/ec2-user/SageMaker
git clone https://github.com/org/repo.git
pip install -r requirements.txt

Question 53easymultiple choice

Read the full ML Model Development explanation →

A data scientist is building a model to predict customer churn based on historical data. The dataset has 10 features and 100,000 records, and the target is binary. Which algorithm is most appropriate for this binary classification problem?

Question 54easymultiple choice

Read the full ML Model Development explanation →

A machine learning engineer trains a binary classifier and obtains an accuracy of 95% on the test set. The dataset is imbalanced with 95% positive class. What is the most important metric to evaluate the model's performance?

Question 55easymultiple choice

Read the full ML Model Development explanation →

Which technique is commonly used to handle missing values in a categorical feature?

Question 56mediummultiple choice

Read the full ML Model Development explanation →

A team is using Amazon SageMaker to train a neural network. They want to minimize training time while effectively exploring the hyperparameter space. Which approach should they use?

Question 57mediummultiple choice

Read the full ML Model Development explanation →

A trained model needs to be deployed for real-time inference with low latency. Which AWS service is best suited for this?

Question 58mediummultiple choice

Read the full ML Model Development explanation →

Which feature scaling method is most robust to outliers in the data?

Question 59hardmultiple choice

Read the full ML Model Development explanation →

A model has high training accuracy but low validation accuracy. Which action is least likely to reduce overfitting?

Question 60hardmultiple choice

Read the full ML Model Development explanation →

A company wants to forecast monthly sales that show clear seasonality. Which algorithm is most suitable?

Question 61hardmultiple choice

Read the full ML Model Development explanation →

A company wants to use a pre-trained NLP model from SageMaker JumpStart for sentiment analysis. Which step is required to make predictions?

Question 62easymulti select

Read the full ML Model Development explanation →

Which TWO data storage options are commonly used by Amazon SageMaker Feature Store for offline and online storage?

Question 63mediummulti select

Read the full ML Model Development explanation →

Which THREE steps are part of the typical workflow when using SageMaker built-in algorithms?

Question 64hardmulti select

Read the full ML Model Development explanation →

Which TWO tools are specifically designed for debugging and analyzing training jobs in SageMaker?

Question 65easymultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A SageMaker training job failed. Based on the error message, which action should the engineer take?

Exhibit

{
    "TrainingJobName": "job-123",
    "TrainingJobStatus": "Failed",
    "FailureReason": "ClientError: Review the error message. Training failed due to insufficient instance memory.",
    "AlgorithmSpecification": {
        "TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 30
    }
}

Question 66mediummultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist receives an AccessDenied error when trying to create a training job using SageMaker. What is the most likely cause?

Exhibit

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sagemaker:CreateTrainingJob",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::my-bucket/*"
        }
    ]
}

Question 67hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A SageMaker training job logs show training AUC increasing but validation AUC plateauing at 0.880. What is the most likely issue?

Exhibit

[1] #011train-auc:0.890
[2] #011train-auc:0.895
[3] #011train-auc:0.892
[4] #011validation-auc:0.880

Question 68mediummultiple choice

Read the full ML Model Development explanation →

A data scientist is training a deep learning model on Amazon SageMaker and notices that the training loss decreases but the validation loss starts increasing after a certain number of epochs. The model is likely overfitting. Which SageMaker feature can they use to detect and diagnose this issue during training?

Question 69easymultiple choice

Read the full ML Model Development explanation →

A company wants to build a machine learning model to predict house prices based on features like square footage, number of bedrooms, and location. The target variable is a continuous numeric value. Which Amazon SageMaker built-in algorithm is most appropriate for this task?

Question 70hardmultiple choice

Read the full NAT/PAT explanation →

A machine learning team is training a large natural language processing model on Amazon SageMaker using the SageMaker Hugging Face container. The training job runs on multiple instances and uses Managed Spot Training to reduce costs. However, the job frequently gets interrupted by Spot interruptions, causing long training times. What should the team do to mitigate this issue?

Question 71mediummultiple choice

Read the full ML Model Development explanation →

A data scientist has trained a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). After evaluation, the model shows an accuracy of 99%, but the recall for fraud cases is only 10%. Which metric should the data scientist prioritize to improve the model's performance for fraud detection?

Question 72easymultiple choice

Read the full ML Model Development explanation →

A machine learning engineer is using Amazon SageMaker Experiments to track multiple training runs. They want to compare the performance of different hyperparameter configurations visually. Which SageMaker tool provides an interactive interface to compare experiments?

Question 73hardmultiple choice

Read the full ML Model Development explanation →

A data scientist is running a SageMaker training job with a custom PyTorch image. The training script loads a large dataset into memory, and the job fails with an out-of-memory error after a few minutes. The instance type is ml.m5.xlarge (16 GB RAM). What should the data scientist do to resolve this issue without changing the instance type?

Question 74mediummultiple choice

Read the full ML Model Development explanation →

An ML engineer is using Amazon SageMaker Automatic Model Tuning (AMT) to optimize hyperparameters for a gradient boosting model. The tuning job is taking a long time and has completed many training jobs. The engineer wants to stop training jobs that are unlikely to improve the objective metric. What should they configure?

Question 75easymultiple choice

Read the full ML Model Development explanation →

A company has a trained machine learning model that needs to be deployed as a real-time inference endpoint on Amazon SageMaker. The endpoint must automatically scale based on incoming traffic. Which SageMaker feature should be used?

Question 76hardmultiple choice

Read the full ML Model Development explanation →

A data science team is using Amazon SageMaker Pipelines to orchestrate a multi-step workflow that includes data preprocessing, training, and model evaluation. They want to reuse the preprocessed data across multiple pipeline executions without re-running the preprocessing step if the source data hasn't changed. What should they configure?

Question 77mediummulti select

Read the full NAT/PAT explanation →

A data scientist is preparing text data for a sentiment analysis model using Amazon SageMaker. Which two data preprocessing techniques are commonly used when working with text data for natural language processing? (Choose two.)

Question 78hardmulti select

Read the full ML Model Development explanation →

A machine learning engineer is evaluating a binary classification model for detecting fraudulent transactions. The dataset is highly imbalanced, and the cost of false negatives (missing a fraud) is very high. Which two evaluation metrics should the engineer consider? (Choose two.)

Question 79mediummulti select

Read the full ML Model Development explanation →

A team is training a deep learning model on Amazon SageMaker using a custom Docker container. Which three practices should they follow to optimize training performance? (Choose three.)

Question 80hardmultiple choice

Read the full ML Model Development explanation →

What is the most likely cause of this error?

Exhibit

Refer to the exhibit. You are investigating a failed SageMaker training job. The following message appears in the training job's CloudWatch Logs:

2024-08-15 14:30:45,212 sagemaker-training-toolkit ERROR    ClientError: 
An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: 
The account-level service limit for 'ml.p3.16xlarge for training job usage' is 0.

Question 81mediummultiple choice

Read the full ML Model Development explanation →

What will the debugger do with this configuration?

Exhibit

Refer to the exhibit. You are configuring SageMaker Debugger for a training job. The following is part of the debugger configuration:

{
    "DebugHookConfig": {
        "CollectionConfigurations": [
            {
                "CollectionName": "gradients",
                "Parameters": {
                    "save_interval": "500"
                }
            }
        ]
    },
    "DebugRules": [
        {
            "RuleConfigurationName": "LossNotDecreasing",
            "RuleParameters": {
                "rule_to_use": "LossNotDecreasing",
                "save_interval": "500",
                "patience": "10",
                "threshold": "0.001"
            }
        }
    ]
}

Question 82easymultiple choice

Read the full ML Model Development explanation →

The training job completes successfully but the model performance is poor. What is a likely cause?

Exhibit

Refer to the exhibit. A data scientist creates a SageMaker training job with the following configuration:

{
    "AlgorithmSpecification": {
        "TrainingImage": "382416733822.dkr.ecr.us-west-2.amazonaws.com/xgboost:1",
        "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://my-bucket/train/",
                    "S3DataDistributionType": "FullyReplicated"
                }
            }
        }
    ],
    "HyperParameters": {
        "objective": "reg:squarederror",
        "num_round": "50",
        "max_depth": "10"
    },
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 10
    }
}

Question 83easymultiple choice

Read the full ML Model Development explanation →

A data scientist is training a linear regression model on a dataset with 10 features. After training, the model shows high training accuracy but poor test accuracy. Which of the following is the most likely cause?

Question 84mediummultiple choice

Read the full ML Model Development explanation →

A company uses Amazon SageMaker to train a custom XGBoost model. The training job runs on a single ml.m5.large instance and takes 2 hours. To reduce training time without changing the algorithm, what should the data scientist do?

Question 85hardmultiple choice

Read the full ML Model Development explanation →

A machine learning team is developing a deep learning model for image classification. They observe that the training loss decreases rapidly but the validation loss starts increasing after a few epochs. Which strategy should they implement to address this issue?

Question 86easymultiple choice

Read the full ML Model Development explanation →

A data scientist wants to evaluate the performance of a binary classification model. The dataset is highly imbalanced with only 5% positive class. Which metric should be used to evaluate the model?

Question 87mediummultiple choice

Read the full ML Model Development explanation →

During model training on Amazon SageMaker, the training job fails with a 'ResourceLimitExceeded' error. What is the most likely cause?

Question 88hardmultiple choice

Read the full ML Model Development explanation →

A data scientist is using Amazon SageMaker Debugger to monitor training metrics. They want to stop training automatically if the model is overfitting. Which action should they take?

Question 89easymultiple choice

Read the full ML Model Development explanation →

Which of the following is a recommended practice for preparing training data in Amazon SageMaker?

Question 90mediummultiple choice

Read the full ML Model Development explanation →

A data scientist is training a logistic regression model and wants to use L1 regularization to create a sparse model. Which parameter should be adjusted?

Question 91hardmultiple choice

Read the full ML Model Development explanation →

A team is using Amazon SageMaker's Automatic Model Tuning (AMT) to optimize hyperparameters for a random forest model. After 10 training jobs, the best objective metric value plateaus. The team wants to explore the search space more broadly. Which AMT strategy should they use?

Question 92easymulti select

Read the full ML Model Development explanation →

A data scientist is splitting a dataset into training and test sets. Which two practices should they follow? (Select TWO.)

Question 93mediummulti select

Read the full ML Model Development explanation →

A machine learning engineer is training a neural network using Amazon SageMaker. The training job uses a single GPU instance. To improve training speed using distributed training, which two steps should they take? (Select TWO.)

Question 94hardmulti select

Read the full ML Model Development explanation →

A data scientist is developing a gradient boosting model and observes that the model is overfitting to the training data. Which three techniques can help reduce overfitting? (Select THREE.)

Question 95easymultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist ran a training job using a custom algorithm container. The job failed with the error shown. What is the most likely cause?

Exhibit

{
    "TrainingJobName": "my-training-job",
    "TrainingJobStatus": "Failed",
    "FailureReason": "ClientError: Cannot evaluate expression: loss",
    "AlgorithmSpecification": {
        "TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/custom-latest",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 30
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400
    },
    "OutputDataConfig": {
        "S3OutputPath": "s3://my-bucket/output"
    }
}

Question 96mediummultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist configured an automatic model tuning job for a classification model. The tuning job completed after 20 training jobs, but the best validation accuracy was only 0.65. What is the most effective way to potentially improve the result?

Exhibit

{
    "HyperParameterTuningJobConfig": {
        "Strategy": "Bayesian",
        "HyperParameterTuningJobObjective": {
            "Type": "Maximize",
            "MetricName": "validation:accuracy"
        },
        "ResourceLimits": {
            "MaxNumberOfTrainingJobs": 20,
            "MaxParallelTrainingJobs": 5
        },
        "TrainingJobDefinition": {
            "StaticHyperParameters": {
                "epochs": "50"
            },
            "AlgorithmSpecification": {
                "TrainingImage": "some-image",
                "TrainingInputMode": "File"
            },
            "InputDataConfig": [
                {
                    "ChannelName": "train",
                    "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://bucket/train.csv" } }
                }
            ],
            "OutputDataConfig": { "S3OutputPath": "s3://bucket/output" },
            "ResourceConfig": { "InstanceType": "ml.m5.large", "InstanceCount": 1 },
            "StoppingCondition": { "MaxRuntimeInSeconds": 3600 }
        }
    }
}

Question 97hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist configured SageMaker Debugger to monitor training for overfitting. However, the rule never triggers even though the model appears to be overfitting. What is the most likely reason?

Exhibit

{
    "DebugHookConfig": {
        "S3OutputPath": "s3://my-bucket/debug/",
        "CollectionConfigurations": [
            {"CollectionName": "losses"},
            {"CollectionName": "gradients"}
        ]
    },
    "DebugRuleConfigurations": [
        {
            "RuleConfigurationName": "Overfitting",
            "RuleEvaluatorImage": "...",
            "InstanceType": "ml.t3.medium",
            "VolumeSizeInGB": 5
        }
    ]
}

Question 98mediummultiple choice

Read the full ML Model Development explanation →

A company is training a deep learning model on Amazon SageMaker. The training job started but has been stuck in 'InProgress' state for an unusually long time with low CPU utilization. The data scientist suspects a bottleneck. What should be the first troubleshooting step?

Question 99easymultiple choice

Read the full ML Model Development explanation →

A data engineer needs to split a time-series dataset into training and validation sets for a forecasting model. Which split method should be used to avoid data leakage?

Question 100hardmultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer runs a SageMaker HyperparameterTuningJob with Bayesian optimization strategy. The job terminates earlier than the specified MaxNumberOfTrainingJobs. The engineer notices that the best objective metric value has not improved for several consecutive jobs. What is the most likely adjustment to make?

Question 101mediummultiple choice

Read the full NAT/PAT explanation →

A company deploys a real-time inference endpoint on SageMaker for a customer-facing application. Traffic patterns are unpredictable and sometimes spike. The endpoint must scale automatically to handle load while minimizing cost. Which approach should the company take?

Question 102mediummultiple choice

Read the full ML Model Development explanation →

A data scientist trains a neural network on SageMaker using the TensorFlow framework. The training accuracy is lower than expected, and the scientist suspects vanishing gradients. How can the scientist leverage SageMaker Debugger to diagnose this?

Question 103hardmultiple choice

Read the full ML Model Development explanation →

A machine learning team wants to detect bias in a binary classification model before deployment. They use SageMaker Clarify. Which type of bias metric should they compute to understand whether the model treats different demographic groups unfairly in predictions?

Question 104easymultiple choice

Study the full Python automation breakdown →

A data scientist wants to train an XGBoost model using the SageMaker Python SDK with a custom training script. Which estimator class should be used?

Question 105easymultiple choice

Read the full ML Model Development explanation →

A team uses SageMaker Experiments to track multiple training runs. They need to register the best-performing model in the model registry for approval. Which method ensures the model artifacts and metadata are captured correctly?

Question 106mediummultiple choice

Study the full Python automation breakdown →

An ML engineer creates a SageMaker inference pipeline with two containers: a preprocessor and a predictor. The preprocessor is a lightweight Python script that transforms input data. How should the engineer structure the endpoints to ensure both containers run sequentially?

Question 107hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist ran a SageMaker training job using a built-in algorithm. The job failed with the above error. What is the most likely cause?

Exhibit

{
  "TrainingJobStatus": "Failed",
  "FailureReason": "AlgorithmError: Data does not conform to the expected format. Please check that the input CSV has headers matching the training schema.",
  "TrainingJobName": "my-model-training-20240301"
}

Question 108mediummulti select

Read the full ML Model Development explanation →

Which TWO options are recommended best practices for monitoring model performance in production on SageMaker? (Choose 2.)

Question 109hardmulti select

Read the full ML Model Development explanation →

Which THREE steps should be taken to optimize a large-scale distributed training job on SageMaker? (Choose 3.)

Question 110mediummulti select

Read the full ML Model Development explanation →

Which TWO SageMaker Pipelines steps are essential for automating a complete ML workflow from data processing to model deployment? (Choose 2.)

Question 111hardmultiple choice

Read the full NAT/PAT explanation →

A financial services company is training a large natural language processing (NLP) model using PyTorch on a SageMaker distributed training job. The cluster consists of 4 ml.p3.16xlarge instances (8 GPUs each). The training job runs successfully but takes 72 hours, exceeding the allotted 48-hour window. The team must reduce training time without sacrificing model quality. The model architecture has 1.5 billion parameters and currently uses the SageMaker data parallel library with Horovod for all-reduce. Observing CloudWatch metrics, the team notices that GPU utilization averages only 45% and network throughput is near maximum. Which action will most effectively reduce training time?

Question 112mediummultiple choice

Read the full NAT/PAT explanation →

A retail company uses SageMaker to train a multi-class image classification model with a custom ResNet-50 implemented in TensorFlow. The training data is 500 GB of images stored in S3. The data scientist uses a ml.p3.2xlarge instance with a single GPU. The training takes 10 hours per epoch, and the model does not converge after 5 epochs. The scientist needs to accelerate training and improve model accuracy. The current implementation loads images individually from S3 using TensorFlow's tf.data API. The scientist also notices high I/O wait time. Which combination of actions should the scientist take? (Assume the scientist is aware of best practices.) The answer is a single choice from A-D.

Question 113easymultiple choice

Read the full ML Model Development explanation →

A company is training a binary classifier in SageMaker and observes that the training loss decreases but validation loss increases after a few epochs. What is the most likely issue?

Question 114mediummultiple choice

Read the full ML Model Development explanation →

A data scientist needs to ensure that the same train/test split is used across multiple experiments for reproducibility in SageMaker. Which approach should they take?

Question 115hardmultiple choice

Read the full ML Model Development explanation →

A company uses SageMaker to train a model with a large dataset stored in S3. They notice that the training job is taking longer than expected and the GPU utilization is low. Which action would most likely improve GPU utilization?

Question 116hardmultiple choice

Read the full ML Model Development explanation →

A team is deploying a model that requires low-latency inference for real-time predictions. They are using a SageMaker endpoint with a single instance. During testing, they observe high latency. Which change would most effectively reduce latency?

Question 117easymulti select

Read the full ML Model Development explanation →

A data scientist is using SageMaker Autopilot to automatically build a model. Which TWO aspects does Autopilot handle? (Choose TWO.)

Question 118mediummulti select

Read the full ML Model Development explanation →

A company is training a deep learning model using SageMaker's built-in PyTorch framework. They want to optimize training performance. Which THREE actions should they take? (Choose THREE.)

Question 119hardmulti select

Read the full ML Model Development explanation →

A team is using SageMaker Pipelines to automate a training workflow. They need to ensure that if a step fails, the pipeline can resume from the failed step without reprocessing prior steps. Which TWO configurations are necessary? (Choose TWO.)

Question 120easymultiple choice

Read the full ML Model Development explanation →

A company is building a recommendation system and has trained a matrix factorization model using SageMaker. They want to evaluate the model's performance using precision at k (P@k) and recall at k (R@k). They have a test set of user-item interactions. The data scientist implements a custom evaluation script that computes these metrics, but the precision values are consistently zero. What is the most likely cause?

Question 121mediummultiple choice

Read the full ML Model Development explanation →

A financial services company uses SageMaker to train a fraud detection model. They have imbalanced data with 1% fraud. They trained a Gradient Boosting model using SMOTE for oversampling and achieved 99% accuracy on the test set, but the fraud recall is only 10%. The data scientist is concerned about the model's performance. Which change is most likely to improve fraud recall without sacrificing too much precision?

Question 122hardmultiple choice

Read the full ML Model Development explanation →

A team is using SageMaker to run a large-scale distributed training job for a language model. They are using SageMaker's Pipe mode to stream data from S3 to reduce IO. They observe that the training throughput is lower than expected, and the CPU utilization is high while GPU utilization is low. The training script uses PyTorch's DataLoader with num_workers=0. The data preprocessing is minimal. Which change is most likely to improve GPU utilization?

Question 123easymultiple choice

Read the full ML Model Development explanation →

A data scientist is using SageMaker to train a linear regression model. After training, they evaluate the model on the test set and get an R² of 0.95. However, when they deploy the model to a SageMaker endpoint and run predictions on new data, the predictions are far off. What is the most likely cause?

Question 124mediummultiple choice

Read the full ML Model Development explanation →

A company uses SageMaker Ground Truth to label a dataset for object detection. They set up a labeling job with a private workforce. After labeling, they export the dataset and train a model using SageMaker's built-in object detection algorithm. The model achieves high accuracy on the test set but low accuracy on a small holdout set that was manually labeled by an expert. What might be the issue?

Question 125hardmultiple choice

Read the full ML Model Development explanation →

A team is using SageMaker Pipelines to train a model. The pipeline has multiple steps: data processing, training, evaluation, and registration. They use a Condition step to evaluate the model's accuracy and if it exceeds a threshold, register the model. They run the pipeline and the training step succeeds, but the pipeline fails at the Condition step with an error: 'Unable to evaluate condition: the property 'Accuracy' does not exist.' The evaluation step output is a JSON file with key 'accuracy'. What is the most likely cause?

Question 126easymultiple choice

Read the full ML Model Development explanation →

A data scientist is training a deep learning model on SageMaker and notices that the training loss oscillates and does not converge. They want to debug this issue. Which SageMaker feature can they use to monitor and analyze the training process?

Question 127mediummultiple choice

Read the full ML Model Development explanation →

A company is using SageMaker to train a model for image classification. They have a dataset of 10,000 images. They use SageMaker's built-in image classification algorithm with transfer learning. During training, they notice that the training job completes successfully but the model accuracy on the validation set is very low (~30%). They suspect the model is underfitting. Which action is most likely to improve accuracy?

Question 128mediummultiple choice

Read the full ML Model Development explanation →

An ML team is developing a regression model using Amazon SageMaker. They have a 100 GB CSV dataset stored in Amazon S3. The data is contained in a single large file. They launch a SageMaker training job with an ml.p3.8xlarge instance using a custom Docker container. The training script loads the data using pandas' read_csv from S3 directly. The team observes that the training job takes over 24 hours, and CloudWatch metrics show: GPU utilization is consistently above 90%, but CPU utilization is below 30%. Network I/O is moderate, and disk I/O is low. The team has already tried switching to a larger instance type (ml.p3.16xlarge) with no significant improvement. They need to reduce training time. Which action is MOST likely to achieve this?

Question 129mediummulti select

Read the full ML Model Development explanation →

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (95% negative class, 5% positive class). The model is evaluated on a held-out test set, and the F1 score is 0.12. The data scientist wants to improve the F1 score. Which two actions should the data scientist take? (Choose two.)

Question 130hardmultiple choice

Read the full ML Model Development explanation →

Refer to the exhibit. A data scientist used a SageMaker training job with a custom Scikit-learn script. The training job failed with the error shown. What is the most likely cause of this failure?

Exhibit

{
    "TrainingJobName": "fraud-detection-model-20241015",
    "TrainingJobStatus": "Failed",
    "FailureReason": "AlgorithmError: Encountered an unexpected error during training: ValueError: Expected 2D array, got 1D array instead. Reshape your data using array.reshape(-1, 1) if your data has a single feature.",
    "AlgorithmSpecification": {
        "TrainingImage": "382416733822.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1
    },
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://my-bucket/train/data.csv",
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "text/csv",
            "CompressionType": "None"
        }
    ]
}

Question 131easymultiple choice

Read the full NAT/PAT explanation →

A company is deploying a real-time inference endpoint for a natural language processing model using Amazon SageMaker. The model is a fine-tuned BERT variant. The endpoint has been running for two weeks with acceptable latency (average 200 ms). However, over the past 24 hours, the latency has increased to an average of 800 ms, and the number of simultaneous requests has doubled. The team expects traffic to continue to grow. The current endpoint configuration uses a single ml.m5.large instance. The model is loaded into memory once, and the inference framework is PyTorch. The team needs to maintain latency under 500 ms. Which course of action should the team take to address the latency increase while minimizing cost?

Question 132easymulti select

Read the full ML Model Development explanation →

A data science team needs to track and compare multiple ML training runs, including hyperparameters, metrics, and output artifacts. Which TWO AWS services can be used together to meet this requirement? (Choose two.)

Question 133mediummultiple choice

Read the full ML Model Development explanation →

A machine learning engineer observes that a SageMaker training job fails with the error shown in the exhibit. What is the most likely cause of the failure?

Exhibit

Refer to the exhibit.

```
Training Job Name: my-training-job
Status: Failed
Failure Reason: ClientError: Data download failed. Unable to locate credentials. Please configure your SageMaker Execution Role with the necessary permissions.
```
This is the output from `aws sagemaker describe-training-job --training-job-name my-training-job`.

Question 134hardmultiple choice

Read the full NAT/PAT explanation →

A financial services company is developing a real-time fraud detection model using XGBoost on SageMaker. They have millions of transactions daily and train a model weekly on 6 months of historical data. The training dataset is 500 GB in CSV format stored in S3. The training job uses an ml.p3.16xlarge instance with 8 GPUs, but training takes over 12 hours, which is too long for the weekly cadence. The data scientist notices that GPU utilization averages only 15% during training. The training script uses the SageMaker XGBoost container with default hyperparameters. Which combination of actions would MOST likely reduce training time? (Choose the best answer.)