MLS-C01 Machine Learning Implementation and Operations — All Questions With Answers

Question 1mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?

Question 2easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?

Question 3hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model with a custom algorithm. The training script reads data from an S3 bucket using boto3. The training job fails with an 'AccessDenied' error when trying to access the S3 bucket. The IAM role attached to the SageMaker notebook instance has full S3 access. What is the most likely cause?

Question 4mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model using AWS Lambda for real-time inference. The model is a scikit-learn RandomForestClassifier with 100 trees, serialized as a pickle file of 150 MB. The Lambda function has 3 GB memory allocated. However, the inference requests are timing out after 30 seconds. What is the most likely cause?

Question 5hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is using Amazon SageMaker for hyperparameter tuning. The tuning job uses a Bayesian optimization strategy. After 10 training jobs, the objective metric (validation accuracy) has plateaued at 0.85. The data scientist wants to explore more diverse hyperparameter combinations. What should the data scientist do?

Question 6mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::company-bucket/*"
    }
  ]
}

Question 7easymultiple choice

Review the full subnetting walkthrough →

A DevOps engineer created a SageMaker notebook instance using the Terraform configuration shown. The notebook instance is in a VPC with a public subnet. However, the notebook instance cannot access the internet. What is the most likely cause?

Exhibit

Refer to the exhibit.

resource "aws_sagemaker_notebook_instance" "ml_notebook" {
  name          = "my-notebook"
  role_arn      = "arn:aws:iam::123456789012:role/sagemaker-role"
  instance_type = "ml.t2.medium"
  direct_internet_access = "Enabled"
}

Question 8mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a XGBoost model on a large dataset. The training job is taking a long time. The data scientist wants to reduce training time without sacrificing model accuracy. The dataset is 100 GB in CSV format stored in S3. What is the most effective approach?

Question 9hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to deploy a model for real-time inference. The model endpoint is behind an Application Load Balancer (ALB) for A/B testing. The data scientist notices that the endpoint is returning HTTP 503 errors intermittently. The CloudWatch metrics show that the endpoint's Invocations metric is within limits, but the ModelLatency metric has high variance. What is the most likely cause?

Question 10mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is failing with an OutOfMemory error. The data scientist wants to minimize cost while resolving the issue. Which action should the data scientist take?

Question 11easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a model using Amazon SageMaker for real-time inference. The model is memory-intensive and requires a GPU. Which instance type should be selected for the endpoint?

Question 12hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using AWS Glue to run ETL jobs that transform data for machine learning. The jobs are failing with 'Out of Memory' errors. The data size is growing, and the company needs a cost-effective solution. Which approach should be taken?

Question 13mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker and wants to automatically stop training when the model stops improving. Which feature should be used?

Question 14mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to build a machine learning pipeline. The pipeline includes data preprocessing, training, and evaluation steps. The company wants to ensure that the pipeline is reproducible and that artifacts are versioned. Which TWO actions should be taken? (Choose TWO.)

Question 15hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a model on Amazon SageMaker for real-time inference. The model is a PyTorch model that requires custom inference code. The data scientist needs to handle variable-length inputs and optimize inference latency. Which THREE steps should the data scientist take? (Choose THREE.)

Question 16hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is trying to create a training job named 'test-model' using an IAM role with the attached policy. The creation fails with an AccessDenied error. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sagemaker:CreateTrainingJob",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "sagemaker:TrainingJobName": "*production*"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "sagemaker:CreateTrainingJob",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "sagemaker:TrainingJobName": "*production*"
        }
      }
    }
  ]
}

Question 17mediummultiple choice

Study the full Python automation breakdown →

A company runs a machine learning pipeline on Amazon SageMaker. The pipeline consists of three steps: data preprocessing (using a custom container), training (using a built-in algorithm), and model evaluation (using a custom container). The pipeline is orchestrated using AWS Step Functions. Recently, the pipeline has been failing intermittently at the model evaluation step with a 'TimeoutError'. The evaluation step runs a Python script that loads the trained model and a test dataset from S3, computes metrics, and writes results back to S3. The step is configured with a timeout of 600 seconds. The test dataset size has grown over time. The data science team suspects that the timeout is due to the increased data size. They want a solution that minimizes changes to the existing infrastructure and avoids increasing the timeout arbitrarily. Which approach should the team take?

Question 18hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A media company uses Amazon SageMaker to train a deep learning model for video classification. The training job uses a single ml.p3.2xlarge instance and processes 50 GB of labeled video data stored in Amazon S3. The training completes successfully in 12 hours. However, the data scientists report that the model’s accuracy is lower than expected. They suspect the training data contains labeling errors. To improve model accuracy without incurring significant additional cost, they want to identify and remove mislabeled training examples before retraining. They have a small budget of $50 and need to complete the analysis within 2 hours. Which approach should the data scientists take?

Question 19mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to deploy a machine learning model that performs real-time inference with sub-second latency. The model is a deep neural network with 500 MB of weights. The inference endpoint must scale to zero when not in use to minimize cost. Which AWS service should the company use?

Question 20hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A data science team is training a large deep learning model using Amazon SageMaker. The training job is taking a long time because the model has many layers and the dataset is large. The team wants to reduce training time by distributing the training across multiple GPUs on a single instance, as well as across multiple instances. Which TWO actions should the team take? (Choose two.)

Question 21easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

Exhibit

Refer to the exhibit.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint",
        "sagemaker:InvokeEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "sagemaker.amazonaws.com"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "sagemaker:DeleteEndpoint",
        "sagemaker:DeleteEndpointConfig",
        "sagemaker:DeleteModel"
      ],
      "Resource": "*"
    }
  ]
}

Question 22mediumdrag order

Read the full Machine Learning Implementation and Operations explanation →

Drag and drop the steps to train a model using Amazon SageMaker built-in algorithm in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 23mediumdrag order

Read the full Machine Learning Implementation and Operations explanation →

Drag and drop the steps to evaluate a trained model using SageMaker Model Monitor in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 24mediummatching

Read the full Machine Learning Implementation and Operations explanation →

Match each SageMaker built-in algorithm to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Gradient boosted trees for regression and classification

Word2Vec and text classification

Learning embeddings for pairs of objects

Anomaly detection in IP traffic

Time series forecasting

Question 25mediummatching

Read the full Machine Learning Implementation and Operations explanation →

Match each SageMaker built-in metric to its meaning.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Fraction of correct predictions on validation set

Root mean square error on validation set

Area under ROC curve on validation set

Logistic loss on validation set

Harmonic mean of precision and recall on validation set

Question 26easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a linear regression model on a dataset with 100 features. The model shows high variance on the test set. Which action is MOST likely to reduce overfitting?

Question 27mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to deploy a real-time inference endpoint for a computer vision model. The endpoint receives bursts of traffic with up to 500 requests per second, but the load is unpredictable. Which scaling strategy is MOST cost-effective while maintaining low latency?

Question 28hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is building a fraud detection system using Amazon SageMaker. The training data is highly imbalanced (99% legitimate, 1% fraudulent). They need to maximize the recall of the fraud class while keeping precision above 90%. Which approach should they take?

Question 29easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist wants to use Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is expected to take several hours. Which storage option should be used to minimize data loading time and cost?

Question 30mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML engineer is deploying a model to a SageMaker endpoint for real-time inference. The model requires a custom inference script that preprocesses input data and postprocesses predictions. Which SageMaker feature should be used to implement this custom logic?

Question 31hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train a text classification model. The training data is stored in S3 and contains sensitive personally identifiable information (PII). The company must ensure that the data is encrypted at rest in S3 and that the encryption key is managed by the company's own hardware security module (HSM). Which configuration should be used?

Question 32easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train an XGBoost model on a dataset with missing values. The dataset has both numeric and categorical features. Which preprocessing step is MOST appropriate before training?

Question 33mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML team is using Amazon SageMaker to train a model. They notice that the training job is taking longer than expected and the CloudWatch metrics show high GPU utilization but low CPU utilization. Which action is MOST likely to improve training speed?

Question 34hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to host a model that performs real-time inference. The model receives around 100 requests per second with occasional spikes up to 500 requests per second. The current endpoint uses 2 ml.m5.large instances. During spikes, latency increases significantly, and some requests time out. What is the MOST cost-effective solution to handle the spikes without losing requests?

Question 35mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO configuration steps are necessary to deploy a custom Docker container for training in Amazon SageMaker? (Choose two.)

Question 36hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE actions can help reduce the inference latency of a SageMaker endpoint? (Choose three.)

Question 37easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO services can be used to perform hyperparameter tuning in Amazon SageMaker? (Choose two.)

Question 38mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An IAM policy is attached to a SageMaker notebook instance. The data scientist wants to use the notebook to train a model using data from S3 bucket 'my-bucket'. However, the training job fails with an access denied error. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob"
      ],
      "Resource": "*"
    }
  ]
}

Question 39hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A SageMaker endpoint creation fails with the above CloudWatch Logs excerpt. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

2023-01-15 10:30:00 ERROR - Model server did not start within 300 seconds.
2023-01-15 10:30:00 ERROR - No worker process responded to ping.
2023-01-15 10:30:00 INFO  - Starting model server...
2023-01-15 10:29:55 INFO  - Loading model from /opt/ml/model

Question 40easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist runs the AWS CLI command shown in the exhibit. The output shows that job-2 failed. Which action should the data scientist take to diagnose the failure?

Network Topology

Question 41easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data science team is deploying a machine learning model to production using Amazon SageMaker. The model requires real-time inference with low latency. Which SageMaker feature should they use to deploy the model?

Question 42mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

During training of a deep learning model on a GPU instance in SageMaker, the training job fails with an insufficient memory error. Which step should be taken first to resolve this issue?

Question 43hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to train a model each night. The training data is stored in an S3 bucket with SSE-S3 encryption. The training job fails with an access denied error. Which configuration is needed?

Question 44mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a model to a SageMaker endpoint and needs to optimize for cost while maintaining low latency. Which TWO actions should the data scientist take?

Question 45hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

You are deploying a custom Docker container for a SageMaker model that requires a specific NVIDIA CUDA version. Which THREE steps must you take to ensure the container runs correctly on SageMaker?

Question 46easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning pipeline uses SageMaker Processing jobs for feature engineering. Which TWO are benefits of using SageMaker Processing over running a custom script on an EC2 instance?

Question 47easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A data scientist runs the AWS CLI command to create a SageMaker training job. The training job fails because the input data is not accessible. Which step should the data scientist take to fix the issue?

Exhibit

Refer to the exhibit.
```
aws sagemaker create-training-job \
    --training-job-name my-training \
    --algorithm-specification TrainingImage=... \
    --role-arn arn:aws:iam::123456789012:role/SageMakerRole \
    --input-data-config [{"ChannelName":"train","DataSource":{"S3DataSource":{"S3Uri":"s3://bucket/data","S3DataType":"S3Prefix"}}}] \
    --output-data-config {"S3OutputPath":"s3://bucket/output"} \
    --resource-config {"InstanceCount":2,"InstanceType":"ml.m5.large","VolumeSizeInGB":10}
```

Question 48mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A SageMaker training job uses an IAM role with this policy. The training job writes output to s3://my-bucket/output/. Which statement about the policy is true?

Exhibit

Refer to the exhibit.
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}
```

Question 49hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A SageMaker endpoint is returning 5xx errors. The logs show the above error. Which change will most likely resolve the issue?

Exhibit

Refer to the exhibit.
```
[Container] 2022/08/10 12:00:00 Starting inference server
[Container] 2022/08/10 12:00:05 Model server started
[Container] 2022/08/10 12:00:10 Invoking /invocations endpoint
[Container] 2022/08/10 12:00:15 ERROR: Exception during prediction: OutOfMemoryError
[Container] 2022/08/10 12:00:16 Shutting down
```

Question 50mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is using SageMaker to train a model using the built-in XGBoost algorithm. The training job is taking longer than expected. The team suspects that the data is not being loaded efficiently. Which data format should they use to minimize training time?

Question 51hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker Ground Truth to label images for object detection. After labeling, they notice that the bounding boxes are often misaligned with the objects. Which action should they take to improve label quality?

Question 52easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to create a SageMaker notebook instance with access to a private S3 bucket. The bucket uses SSE-KMS encryption. Which additional configuration is required?

Question 53hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

You are building a CI/CD pipeline for SageMaker using AWS CodePipeline. Which THREE components are essential for a fully automated model training and deployment pipeline?

Question 54mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to run training jobs on a schedule. The training data is stored in an S3 bucket that receives new data every hour. Which TWO approaches can the company use to trigger a training job when new data arrives?

Question 55mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

You are deploying a PyTorch model to a SageMaker endpoint. The model is large (5 GB) and the endpoint is using an ml.c5.2xlarge instance. Inference latency is higher than required. Which change would most effectively reduce latency?

Question 56mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to deploy a PyTorch model for real-time inference. Which AWS service is best suited for this task?

Question 57easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to train a model, but the training job fails due to insufficient memory. What is the most cost-effective way to resolve this?

Question 58hardmultiple choice

Read the full NAT/PAT explanation →

A team wants to automate the retraining of a model weekly using new data that arrives in S3. Which combination of services should they use?

Question 59mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A deployed SageMaker endpoint is returning high latency. The model is a scikit-learn Random Forest. Which action is most likely to reduce latency?

Question 60hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to run a hyperparameter tuning job for a deep learning model. Which SageMaker feature should they use?

Question 61easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to serve predictions from a model using a REST API with low latency. Which SageMaker deployment option is most appropriate?

Question 62hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team notices that a SageMaker training job using TensorFlow is running slower than expected. The training data is in S3 in TFRecord format. Which action is most likely to improve training throughput?

Question 63mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to monitor a deployed model for data drift. Which AWS service should they use?

Question 64easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist has trained a model using SageMaker and wants to deploy it to an endpoint. Which step is required before deployment?

Question 65mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO actions can help reduce inference latency for a SageMaker endpoint?

Question 66hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE factors should be considered when choosing an instance type for a SageMaker training job?

Question 67easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO services can be used to orchestrate a machine learning pipeline?

Question 68mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. Which action will the notebook be able to perform?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateEndpoint",
        "sagemaker:InvokeEndpoint",
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

Question 69hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A SageMaker training job is launched with the CLI command shown. The job fails with an error 'S3 data distribution type not supported for File mode'. What is the most likely fix?

Exhibit

aws sagemaker create-training-job \
    --training-job-name my-job \
    --algorithm-specification TrainingImage=my-image,TrainingInputMode=File \
    --resource-config InstanceType=ml.m5.large,InstanceCount=1,VolumeSizeInGB=30 \
    --input-data-config ChannelName=training,DataSource={S3DataSource={S3Uri=s3://bucket/data,S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated}} \
    --output-data-config S3OutputPath=s3://bucket/output \
    --stopping-condition MaxRuntimeInSeconds=3600

Question 70easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A SageMaker endpoint logs this error. What is the most likely cause?

Exhibit

2019-10-12 15:30:01 - ERROR - Model prediction failed: Input shape mismatch. Expected (None, 10), got (None, 8).

Question 71mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is deploying a model using Amazon SageMaker. The model inference code runs on GPUs and requires a custom container. The team wants to minimize cold start latency. Which SageMaker hosting option should they use?

Question 72hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a deep learning model on a large dataset using SageMaker. The training job is taking too long. Upon reviewing the CloudWatch logs, the scientist notices that the GPU utilization is below 10% most of the time. Which change is MOST likely to improve GPU utilization and reduce training time?

Question 73easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model and wants to track hyperparameter tuning jobs. Which AWS service is BEST suited to store and query metadata such as tuning job configurations and results?

Question 74mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer needs to deploy a model that performs real-time inference with strict latency requirements of under 100 milliseconds. The model is a large ensemble of 10 deep learning models. Which SageMaker deployment strategy is MOST appropriate?

Question 75hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker Autopilot to automatically build a classification model. The dataset is highly imbalanced (1% positive class). Which configuration should the scientist set to handle the class imbalance?

Question 76easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model and wants to automatically retrain the model every week using new data. Which AWS service should be used to orchestrate the retraining pipeline?

Question 77mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The training job fails with an 'AccessDenied' error. Which IAM permission is MOST likely missing from the SageMaker execution role?

Question 78hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The labeling job is taking longer than expected. The team notices that many workers are spending a lot of time on images with no objects. Which labeling strategy should they use to reduce costs and time?

Question 79easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to run a one-time training job on a large dataset using SageMaker. The job requires a specific PyTorch version and custom dependencies. Which approach is MOST efficient?

Question 80mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO factors should be considered when choosing between Amazon SageMaker's real-time endpoints and serverless inference? (Select TWO.)

Question 81hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE measures can help reduce inference latency for a deep learning model deployed on SageMaker real-time endpoints? (Select THREE.)

Question 82easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO actions are best practices for securing a SageMaker notebook instance? (Select TWO.)

Question 83easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a neural network on a GPU instance in Amazon SageMaker. The training job fails with an 'OutOfMemoryError'. Which action should the data scientist take to resolve this issue?

Question 84mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency. The team is concerned about cost. Which SageMaker hosting option should the team use?

Question 85hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model is a PyTorch model that requires a custom inference script. The engineer notices that the endpoint is returning 500 errors after deployment. Which step should the engineer take to debug the issue?

Question 86easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using AWS Glue to prepare training data. The job reads from an S3 bucket, performs transformations, and writes to another S3 bucket. The job is failing due to insufficient memory. Which solution should the data scientist use to fix this?

Question 87mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is building a fraud detection model. The dataset is highly imbalanced (99% legitimate, 1% fraud). The data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model achieves 99% accuracy but only catches 10% of fraud cases. Which technique should the data scientist apply to improve recall for the minority class?

Question 88hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train a model. The training job uses a custom Docker container. The job fails with the error 'CannotStartContainerError: API error (500).' Which of the following is the most likely cause?

Question 89easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to version control datasets used for machine learning experiments. Which AWS service should the data scientist use?

Question 90mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model on a large dataset stored in S3. The training job is taking a long time due to slow data loading. Which action can the data scientist take to reduce data loading time?

Question 91hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a TensorFlow model to an Amazon SageMaker endpoint. The endpoint is behind an Application Load Balancer (ALB) for A/B testing. The engineer notices that the new variant is not receiving any traffic. What is the most likely cause?

Question 92easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. Which TWO actions are necessary to allow SageMaker to access the data?

Question 93mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently. Which THREE practices should the company implement for model versioning and deployment?

Question 94hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to build a custom model. The training job is failing with a 'ResourceLimitExceeded' error. Which TWO actions should the company take to resolve this issue?

Question 95easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is trying to create a SageMaker training job using an execution role with the attached IAM policy. The training job fails with an access denied error when trying to read training data from the S3 bucket 'my-bucket'. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": ["sagemaker:CreateTrainingJob"],
      "Resource": "*"
    }
  ]
}

Question 96mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is reviewing the training logs from a SageMaker training job. The model's loss decreases steadily and accuracy increases. However, when the model is evaluated on a holdout test set, the accuracy is only 0.65. Which issue does this behavior suggest?

Exhibit

Refer to the exhibit.

[2019-10-01 12:00:00] Training job started.
[2019-10-01 12:05:00] Epoch 1/10: loss=2.3456, accuracy=0.5432
[2019-10-01 12:10:00] Epoch 2/10: loss=1.2345, accuracy=0.6543
[2019-10-01 12:15:00] Epoch 3/10: loss=0.9876, accuracy=0.7654
[2019-10-01 12:20:00] Epoch 4/10: loss=0.8765, accuracy=0.7890
[2019-10-01 12:25:00] Epoch 5/10: loss=0.7654, accuracy=0.8123
[2019-10-01 12:30:00] Epoch 6/10: loss=0.6543, accuracy=0.8345
[2019-10-01 12:35:00] Epoch 7/10: loss=0.5432, accuracy=0.8567
[2019-10-01 12:40:00] Epoch 8/10: loss=0.4321, accuracy=0.8789
[2019-10-01 12:45:00] Epoch 9/10: loss=0.3210, accuracy=0.9012
[2019-10-01 12:50:00] Epoch 10/10: loss=0.2109, accuracy=0.9234
[2019-10-01 12:55:00] Training job completed.

Question 97hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist submits a SageMaker training job with the provided configuration. The job fails immediately with the error 'Algorithm not found: 382416733822.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.2-1'. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "AlgorithmSpecification": {
    "TrainingImage": "382416733822.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.2-1",
    "TrainingInputMode": "File"
  },
  "RoleArn": "arn:aws:iam::123456789012:role/SageMakerRole",
  "InputDataConfig": [
    {
      "ChannelName": "train",
      "DataSource": {
        "S3DataSource": {
          "S3DataType": "S3Prefix",
          "S3Uri": "s3://my-bucket/train/"
        }
      },
      "ContentType": "text/csv",
      "CompressionType": "None"
    }
  ],
  "OutputDataConfig": {
    "S3OutputPath": "s3://my-bucket/output/"
  },
  "ResourceConfig": {
    "InstanceType": "ml.m5.large",
    "InstanceCount": 1,
    "VolumeSizeInGB": 10
  },
  "StoppingCondition": {
    "MaxRuntimeInSeconds": 86400
  }
}

Question 98mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a custom image classification model using a PyTorch script. The training job runs successfully but the model accuracy is lower than expected. The scientist wants to debug the training process by inspecting gradients and layer outputs. Which SageMaker feature should be used to capture this internal state during training?

Question 99hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company deploys a real-time inference endpoint using Amazon SageMaker with an ML model that has strict latency requirements. The endpoint currently uses a single ml.c5.xlarge instance. During a load test, the p99 latency exceeds the 100ms threshold. The team adds more instances but latency does not improve because the model is heavily CPU-bound. What is the MOST cost-effective change to meet the latency requirement?

Question 100easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML engineer needs to run a hyperparameter tuning job on Amazon SageMaker. The training algorithm supports distributed training across multiple GPUs. The engineer wants to minimize the total time to find the best hyperparameters. Which strategy should be used?

Question 101mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model for real-time fraud detection using Amazon SageMaker. The model must have a p99 inference latency under 50ms. Which TWO actions should the ML team take to meet the latency requirement?

Question 102hardmulti select

Read the full NAT/PAT explanation →

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The team has multiple models that need to be served, but usage patterns are unpredictable and traffic spikes occur several times a day. The team wants to minimize costs while maintaining low latency. Which THREE actions should the team take?

Question 103easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a large neural network on a GPU instance. The training is taking longer than expected. The scientist wants to reduce training time without changing the model architecture. Which TWO approaches should the scientist consider?

Question 104hardmultiple choice

Read the full NAT/PAT explanation →

An ML engineer is deploying a model on a SageMaker endpoint and wants to ensure that only authorized users and services can invoke the endpoint. The company uses AWS IAM for access control and requires that the endpoint be invoked only from within a specific VPC. What combination of actions should the engineer take? (Choose the single best answer.)

Question 105mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

Question 106easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML team wants to perform batch inference on a large dataset stored in Amazon S3 using a pre-trained model. The team needs to process the data in parallel across multiple instances to reduce processing time. Which approach should they use?

Question 107hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. An ML engineer attaches this IAM policy to a user. The user wants to invoke the SageMaker endpoint my-endpoint from an EC2 instance with public IP 52.1.1.1. What will happen?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateEndpoint",
        "sagemaker:InvokeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:123456789012:endpoint/my-endpoint"
    },
    {
      "Effect": "Deny",
      "Action": "sagemaker:InvokeEndpoint",
      "Resource": "*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "10.0.0.0/8",
            "172.16.0.0/12",
            "192.168.0.0/16"
          ]
        }
      }
    }
  ]
}

Question 108mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A data scientist is deploying a PyTorch model on a SageMaker endpoint. When the endpoint is invoked, the above error appears in CloudWatch logs. What is the MOST likely cause?

Exhibit

2024-03-15 10:23:45,234 - root - ERROR - Failed to load model: 'NoneType' object has no attribute 'shape'
Traceback (most recent call last):
  File "/opt/ml/code/inference.py", line 45, in model_fn
    model = load_model(model_dir)
  File "/opt/ml/code/inference.py", line 30, in load_model
    input_shape = model.input_shape
AttributeError: 'NoneType' object has no attribute 'shape'

Question 109easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. An ML engineer creates a CloudFormation stack with this template. The stack creation succeeds, but when the engineer tries to invoke the endpoint, it returns a ModelError. The CloudWatch logs show that the container exited with error. What is the MOST likely cause?

Exhibit

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointName: my-endpoint
      EndpointConfigName: !Ref MyEndpointConfig
  MyEndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      ProductionVariants:
        - InitialInstanceCount: 2
          InstanceType: ml.m5.large
          ModelName: !Ref MyModel
          VariantName: variant-1
  MyModel:
    Type: AWS::SageMaker::Model
    Properties:
      PrimaryContainer:
        Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest
        ModelDataUrl: s3://my-bucket/model.tar.gz
      ExecutionRoleArn: arn:aws:iam::123456789012:role/SageMakerRole

Question 110hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A financial services company is deploying a machine learning model for credit risk assessment. The model must have an inference latency under 200ms and must be able to handle up to 1000 transactions per second (TPS). The company wants to minimize costs. The model is a gradient boosting model implemented in XGBoost. Which SageMaker deployment option should the team choose?

Question 111mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML team uses Amazon SageMaker to train a deep learning model. The training job runs on a single ml.p3.2xlarge instance and is taking 10 hours. The team wants to reduce the training time to under 2 hours without changing the model architecture. Which approach is MOST effective?

Question 112easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to use Amazon SageMaker to host a model that was trained using a custom algorithm. The model artifact is stored in Amazon S3. The company wants to ensure that the endpoint can automatically scale based on the number of incoming requests. Which configuration should the company use?

Question 113mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train a model. The training job fails with an 'OutOfMemory' error. The training data is stored in S3 and the instance type is ml.m5.xlarge. What is the most efficient way to resolve this issue?

Question 114hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a real-time inference endpoint using SageMaker. The model is a large NLP model requiring GPU for low latency. The endpoint must be highly available across two Availability Zones. Which deployment configuration meets these requirements?

Question 115easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team uses AWS Glue ETL jobs to preprocess data for SageMaker training. The job runs successfully but the output data is empty. What is the most likely cause?

Question 116mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to host a model for real-time predictions. The model is updated weekly. To minimize downtime during model updates, what should the company do?

Question 117hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker Processing jobs to run feature engineering on large datasets. The job takes a long time to complete. Which change would most likely reduce the processing time?

Question 118easymultiple choice

Read the full NAT/PAT explanation →

A company is using SageMaker to train a model. The training data includes personally identifiable information (PII). The company must ensure that the data is encrypted at rest and in transit. Which combination of actions meets these requirements?

Question 119mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist wants to use AWS Step Functions to orchestrate a machine learning workflow including data preprocessing, training, and evaluation. Which SageMaker integration is best suited for this purpose?

Question 120hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to host a model that makes predictions on streaming data from Amazon Kinesis. The model must provide predictions with sub-second latency. Which approach should the company use?

Question 121easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is using SageMaker to train a model. They want to track hyperparameters, metrics, and model artifacts. Which SageMaker feature should they use?

Question 122mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a SageMaker model for real-time inference. The endpoint must be highly available and cost-effective. Which TWO actions should the company take? (Select TWO.)

Question 123hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a model. The training job needs to access data in an S3 bucket in a different AWS account. The data scientist has set up proper S3 bucket policies and IAM roles. Which THREE steps are necessary to allow SageMaker to access the cross-account S3 bucket? (Select THREE.)

Question 124mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to train a model. The training job is taking too long and the data scientist wants to speed it up. Which THREE strategies should the data scientist consider? (Select THREE.)

Question 125hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist attempts to create a SageMaker training job using the IAM policy shown in the exhibit. The training job fails with an access denied error. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sagemaker:CreateTrainingJob",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket/*"
    }
  ]
}

Question 126mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A SageMaker training job fails with the failure reason shown in the exhibit. What is the most likely cause?

Network Topology

Question 127hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A SageMaker endpoint has a CloudWatch alarm configured as shown in the exhibit. The alarm fires when the p99 latency exceeds 500 ms for two consecutive minutes. Which action should the data scientist take to reduce latency?

Exhibit

Refer to the exhibit.

{
  "AlarmName": "HighLatency",
  "MetricName": "ModelLatency",
  "Namespace": "AWS/SageMaker",
  "Statistic": "p99",
  "Period": 60,
  "EvaluationPeriods": 2,
  "Threshold": 500,
  "ComparisonOperator": "GreaterThanThreshold"
}

Question 128easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model on Amazon SageMaker and notices that the training job is taking much longer than expected. The instance type is ml.m5.xlarge and the dataset is 10 GB in CSV format. Which action is MOST likely to reduce training time without changing the instance type?

Question 129mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency (under 100 ms) and high throughput (1000 requests per second). Which SageMaker deployment option is MOST suitable?

Question 130hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker Ground Truth to create labeled datasets for a text classification task. The labeling job uses a private workforce of 10 annotators. After labeling 10,000 items, the quality of labels is inconsistent. Which approach will MOST effectively improve labeling consistency?

Question 131easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model overfits on the training data. Which hyperparameter adjustment is MOST likely to reduce overfitting?

Question 132mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is building a recommendation system using Amazon SageMaker. The training data includes user-item interactions stored in a DataFrame with over 100 million rows. The data scientist wants to perform feature engineering, including one-hot encoding of categorical features with high cardinality. Which approach is MOST cost-effective and scalable?

Question 133hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using Amazon SageMaker Experiments to track multiple training runs. They need to compare the performance of different models based on metrics like accuracy and F1 score. However, when they view the experiment list in SageMaker Studio, the metrics are not displayed. What is the MOST likely cause?

Question 134easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a PyTorch model on a SageMaker endpoint for real-time inference. The model is stored as a .pth file in an S3 bucket. The data scientist wants to use the SageMaker PyTorch inference toolkit. Which file is REQUIRED in the model artifacts to serve the model?

Question 135mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. After the Autopilot job completes, the best model has an accuracy of 0.85 on the validation set. However, the data scientist notices a class imbalance (90% negative, 10% positive). Which metric should the data scientist use to evaluate the model's performance on the positive class?

Question 136hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single p3.2xlarge instance and takes 10 hours. The data scientist wants to reduce training time using distributed training. Which SageMaker feature should be used?

Question 137mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently with new data. Which TWO approaches can be used to update the model without downtime? (Choose TWO.)

Question 138hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error message indicating that the container exited with a non-zero code. Which THREE steps should the data scientist take to diagnose the issue? (Choose THREE.)

Question 139easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is building a machine learning pipeline on AWS. The pipeline includes data ingestion, preprocessing, training, and deployment. Which THREE AWS services can be used to orchestrate the pipeline? (Choose THREE.)

Question 140easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist wants to deploy a PyTorch model for real-time inference with low latency. Which AWS service should they use?

Question 141mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company's ML model training on Amazon SageMaker is taking longer than expected. The training job uses a single ml.p3.2xlarge instance. Which change is most likely to reduce training time?

Question 142hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is using Amazon SageMaker Autopilot to automatically build models. The dataset has 50 features and 1 million rows. After training, Autopilot generates multiple candidates. The team wants to deploy the model with the highest accuracy. What is the best practice to select and deploy the model?

Question 143easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML engineer needs to store and version training datasets and model artifacts. Which AWS service should they use?

Question 144mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is training a large language model using PyTorch on multiple GPUs. The training is taking too long due to inefficient data loading. Which AWS service can help accelerate data loading by caching data close to the GPU instances?

Question 145hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company's ML pipeline uses AWS Step Functions to orchestrate data preprocessing, training, and evaluation. The training step occasionally fails due to a transient error. What is the most robust way to handle this without manual intervention?

Question 146easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to perform hyperparameter optimization for a gradient boosting model. Which built-in Amazon SageMaker feature should they use?

Question 147mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company's ML model is deployed on a SageMaker endpoint. The model's predictions are used in a customer-facing application that requires low latency. Over time, the model's performance degrades due to data drift. What is the most suitable approach to detect this drift automatically?

Question 148hardmultiple choice

Study the full Python automation breakdown →

An ML team is using SageMaker Processing jobs to run feature engineering scripts. The scripts require a specific Python package not included in the default SageMaker image. How should the team provide this package?

Question 149mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO options are valid ways to reduce inference latency for a model deployed on a SageMaker real-time endpoint? (Select TWO.)

Question 150hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE steps should be taken to secure a SageMaker notebook instance that accesses sensitive data? (Select THREE.)

Question 151easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO AWS services can be used to deploy a trained model for serverless inference? (Select TWO.)

Question 152hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company runs a real-time fraud detection model on a SageMaker endpoint. The model is a TensorFlow neural network trained on transactional data. The endpoint uses a single ml.p3.2xlarge instance. Recently, the application’s latency has increased from 50ms to 500ms on average. The CloudWatch metrics show that CPU utilization is at 90%, GPU utilization is at 30%, and memory utilization is at 40%. The number of requests per second has remained stable. The ML team suspects the model is not fully utilizing the GPU. What action should the team take to reduce latency without changing the instance type?

Question 153mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model to production on Amazon SageMaker. The model requires low-latency inference (under 10 ms) for real-time predictions. The data scientist has trained a model using XGBoost and wants to minimize cost while meeting latency requirements. Which SageMaker hosting option should be used?

Question 154hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a deep learning model on Amazon SageMaker using a custom TensorFlow container. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 16 GB GPU memory and 61 GB system memory. The model uses mixed precision training. Which step should the data scientist take to resolve the issue without changing the instance type?

Question 155easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training role has the necessary permissions to decrypt the data. However, the training job fails with an access denied error. What is the most likely cause?

Question 156mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is deploying a model using Amazon SageMaker. They need to automatically retrain the model every week with new data and update the endpoint without downtime. Which approach should they use?

Question 157hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is running a real-time inference endpoint on Amazon SageMaker. The endpoint is using an ml.c5.xlarge instance. Over the past month, the CPU utilization has been consistently below 10%, and the latency is well within requirements. The company wants to reduce costs. What should they do?

Question 158easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model. The training job is taking longer than expected. The data scientist notices that the GPU utilization is low. Which action would most likely improve GPU utilization?

Question 159mediummultiple choice

Review the full subnetting walkthrough →

A company is using Amazon SageMaker to deploy a model for real-time predictions. The model requires access to a DynamoDB table to look up features. The SageMaker endpoint is configured with a VPC and subnet. However, the endpoint cannot connect to DynamoDB. What is the most likely reason?

Question 160hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker. The training dataset is 500 GB and is stored in S3. The data scientist wants to use Pipe input mode to stream data directly from S3 to the training container. However, the training job fails with an error indicating that the container cannot read the data. What is the most likely cause?

Question 161easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to deploy a model. The model is a large ensemble that requires 8 GB of memory. The company wants to minimize endpoint cost. Which instance type should they choose?

Question 162mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A company is using an IAM role with the attached policy to deploy a SageMaker model. The data scientist can create training jobs and models, but when trying to create an endpoint, they receive an access denied error. What is the missing permission?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint"
      ],
      "Resource": "*"
    }
  ]
}

Question 163hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A data scientist is reviewing CloudWatch logs for a SageMaker real-time endpoint. The log shows that a prediction took 15 ms. The endpoint is configured with an ml.c5.large instance and the model is a small scikit-learn model. The latency requirement is under 10 ms. Which action would most likely reduce the latency?

Exhibit

2023-01-01 12:00:00,000 - ERROR - Model prediction took 15 ms for request ID abc123

Question 164mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model. The training data includes sensitive personally identifiable information (PII). The company needs to ensure that the training data is protected and that the trained model does not inadvertently expose PII. Which TWO actions should the company take? (Choose TWO.)

Question 165hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a model on Amazon SageMaker. The model requires inference on images, and the data scientist wants to use a GPU instance for low latency. However, the data scientist is unsure about the instance type to choose for the endpoint. Which TWO factors should the data scientist consider when selecting the instance type? (Choose TWO.)

Question 166mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train models. The data scientist wants to automate the retraining process whenever new data arrives in an S3 bucket. Which THREE services can be used together to achieve this? (Choose THREE.)

Question 167hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company operates a real-time fraud detection system using an Amazon SageMaker endpoint. The model is a gradient boosting model trained on historical transaction data. The endpoint is deployed on an ml.c5.2xlarge instance with auto-scaling enabled based on average latency. Recently, during a flash sale event, the endpoint started returning HTTP 503 errors. The CloudWatch metrics show that the CPU utilization is at 70%, and the average latency has increased from 50 ms to 200 ms. The auto-scaling policy is configured to add one instance when average latency exceeds 100 ms for 5 consecutive minutes, and remove one instance when latency drops below 50 ms for 5 minutes. The current number of instances is 2. The flash sale lasted 30 minutes. What should the company do to prevent this issue in future flash sales?

Question 168mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model on a large dataset (10 TB) stored in S3 in Parquet format. The training job uses an ml.p3.16xlarge instance with multiple GPUs. The data scientist notices that the GPU utilization is low (around 30%) and the training is slow. The dataset consists of hundreds of thousands of small Parquet files. The data scientist suspects that the I/O is bottlenecked. What should the data scientist do to improve GPU utilization and training speed?

Question 169mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company has deployed a machine learning model on a SageMaker endpoint that serves predictions to a web application. The model uses a custom inference container that loads the model artifacts from an ECR repository. After updating the model with new training data, the data scientist creates a new model and updates the endpoint. However, some users report that they still get predictions from the old model. The data scientist confirms that the endpoint configuration points to the new model. What is the most likely cause?

Question 170mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a model. The training job is failing with a 'ResourceLimitExceeded' error. Which action should be taken to resolve this issue?

Question 171hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a real-time inference endpoint using SageMaker. The model has a high memory footprint and requires GPU acceleration. Which instance type and configuration should be used to minimize cost while meeting latency requirements?

Question 172easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using AWS Glue to prepare data for training. They notice that the ETL job takes a long time to process large datasets. Which change is most likely to improve performance?

Question 173hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker Ground Truth to label images for a computer vision model. After launching the labeling job, they notice that the labeling throughput is lower than expected. What should they do to increase throughput?

Question 174mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker Debugger to monitor a training job. The training loss is not decreasing as expected. Which Debugger feature can help identify the issue?

Question 175mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon Rekognition to detect objects in images stored in S3. They want to reduce costs by processing images only when they are uploaded. Which AWS service should be used to trigger Rekognition automatically?

Question 176easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer needs to deploy a model that requires custom inference code with dependencies. Which SageMaker deployment option should be used?

Question 177hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is training a deep learning model on SageMaker using multiple GPUs. The training is slow due to inefficient data loading. Which TWO actions can improve I/O performance?

Question 178mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to build a model for fraud detection. The dataset is highly imbalanced. Which THREE techniques should be applied to address class imbalance?

Question 179mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker Autopilot to automatically build ML models. They want to ensure that the generated models are reproducible. Which TWO settings should they configure?

Question 180hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company operates a real-time fraud detection system using SageMaker. The model is deployed on an ml.c5.xlarge instance behind an Application Load Balancer (ALB). Recently, during a sales event, traffic spiked and the endpoint returned HTTP 503 errors. The team scaled the instance count from 2 to 5, but errors persisted. CloudWatch metrics show low CPU utilization (~30%) and high memory usage (~90%). The model loads a large dictionary file (2GB) into memory at startup. Which action should resolve the issue?

Question 181mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A research lab is using SageMaker to train deep learning models on a custom dataset stored in S3. Each training job uses a single ml.p3.2xlarge instance. Recently, training jobs have been failing intermittently with 'NetworkError: Connection reset by peer' during the data download phase. The data scientist notices that the dataset is 50GB and the network throughput is low. The training script uses the default S3 download method (boto3) to copy data from S3 to the local instance storage. Which solution should the data scientist implement to resolve the issue?

Question 182mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A media company uses SageMaker to train a recommendation model. The training data is stored in an S3 bucket with versioning enabled. The data pipeline updates the training data daily by overwriting objects with new data. Recently, the model's performance degraded, and the team suspects that the training data was corrupted on a specific day. They want to train the model using the data from a previous version. How can the team retrieve the previous version of the training data?

Question 183mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a machine learning model on Amazon SageMaker for real-time inference. The model requires low-latency predictions and must be able to handle up to 1000 requests per second. Which TWO actions should the data scientist take to ensure the endpoint can meet the performance requirements? (Choose 2.)

Question 184hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using Amazon SageMaker to train a deep learning model on a large dataset stored in Amazon S3. The training job is taking too long. The team wants to reduce training time without modifying the model architecture. Which THREE actions should the team take? (Choose 3.)

Question 185mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a custom XGBoost model for real-time inference on Amazon SageMaker. The model was trained using the SageMaker XGBoost built-in algorithm. The endpoint is deployed with an ml.m5.large instance and is receiving around 50 requests per second. The engineer notices that the endpoint's latency is around 200 ms, but the requirement is under 100 ms. The model's serialized format is a .tar.gz file. The engineer wants to reduce inference latency without modifying the model or retraining. What should the engineer do?

Question 186hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using Amazon SageMaker to train a PyTorch model on a dataset that is 500 GB in size. The training job runs on a single ml.p3.2xlarge instance, but the training takes over 48 hours, which exceeds the maximum allowed time. The team wants to reduce training time to under 24 hours. They are open to using multiple instances and have budget for up to 4 instances. The dataset is stored in Amazon S3 and can be split into shards by a key. The model architecture must remain unchanged. What should the team do?

Question 187mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to host a model for real-time inference. The model was trained using SageMaker's built-in Linear Learner algorithm. The endpoint has been running for a week, and the operations team notices that the endpoint's latency has increased from 50 ms to 150 ms over the past few days. The number of requests per second has remained steady at about 200. The team suspects a memory leak in the inference container. What should the team do to diagnose the issue?

Question 188hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a TensorFlow model on a dataset that includes sensitive personal information (PII). The data is stored in Amazon S3 with server-side encryption using AWS KMS (SSE-KMS). The training job fails with an Access Denied error when trying to read from S3. The data scientist has already verified that the SageMaker execution role has s3:GetObject permissions on the S3 bucket. What additional configuration is needed?

Question 189mediummultiple choice

Read the full NAT/PAT explanation →

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint uses an ml.c5.xlarge instance. The company wants to reduce costs without affecting performance. The current traffic pattern shows a daily peak of 500 requests per second for 2 hours, and the rest of the day sees fewer than 50 requests per second. The model has a cold start time of about 30 seconds. What should the company do?

Question 190hardmultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer is deploying a model on Amazon SageMaker that was trained using a custom Docker container. The container is stored in Amazon ECR. The engineer creates a SageMaker model and endpoint configuration, but when creating the endpoint, it fails with an error: 'Could not find the inference code at the expected path.' The engineer verified that the container image is correct and the model artifacts are in S3. What is the most likely cause?

Question 191mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model using the built-in XGBoost algorithm. The training job uses a hyperparameter tuning job to optimize hyperparameters. The tuning job has been running for 3 hours and has completed 20 training jobs. The data scientist wants to stop the tuning job early if it is not making progress. What should the data scientist do to accomplish this?

Question 192mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires GPU inference. The company wants to minimize latency and cost. Which instance type and deployment strategy should be used?

Question 193hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker with a custom Docker container. The training job fails with an error: 'Resource exhausted: Out of memory'. The training data is stored in S3. What should the data scientist do to resolve this issue?

Question 194easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer needs to deploy a model that performs real-time fraud detection. The model must be highly available and scalable. Which AWS service should be used to host the model?

Question 195mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model. The training job is taking too long. The data scientist notices that the GPU utilization is low. Which action should be taken to improve training performance?

Question 196hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker Debugger to monitor training jobs. The training loss is decreasing but then suddenly spikes. What is the most likely cause and how should it be addressed?

Question 197easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to perform automated hyperparameter tuning for a model. Which Amazon SageMaker feature should be used?

Question 198mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is building a pipeline using Amazon SageMaker Pipelines. The pipeline has multiple steps including data preprocessing, training, and evaluation. Which statement about SageMaker Pipelines is correct?

Question 199hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives variable traffic and the company wants to optimize cost while maintaining responsiveness. Which scaling policy should be used?

Question 200mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to process a large dataset (100 TB) for training a machine learning model. The data is stored in Amazon S3. Which approach is most cost-effective and efficient for data processing?

Question 201hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using Amazon SageMaker. The model must be updated frequently without downtime. Which TWO strategies can achieve this? (Choose two.)

Question 202mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model and wants to track experiments, including parameters and metrics. Which THREE actions should be taken? (Choose three.)

Question 203easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is setting up a training job in Amazon SageMaker. Which THREE components are required to define a training job? (Choose three.)

Question 204easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is training a deep learning model on Amazon SageMaker. The training job is failing with an out-of-memory error. Which SageMaker feature should the company use to resolve this issue without changing the instance type?

Question 205mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is deploying a machine learning model using SageMaker and wants to automate the retraining pipeline. The training data is updated daily in an S3 bucket. Which combination of AWS services should the data scientist use to trigger a new training job when new data arrives?

Question 206hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to train a large NLP model. The training job is taking too long due to high I/O wait time. The data is stored as CSV files in S3. Which optimization should the company implement to reduce I/O wait time?

Question 207easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker to build a model. They need to track hyperparameter tuning experiments, compare results, and visualize metrics. Which SageMaker feature should they use?

Question 208mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company has deployed a model on SageMaker for real-time inference. The endpoint is experiencing high latency during traffic spikes. Which action should the company take to reduce latency?

Question 209hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a model with a custom algorithm. The training script uses TensorFlow and runs on GPU instances. The training job fails with 'CUDA_ERROR_OUT_OF_MEMORY'. What is the most likely cause?

Question 210easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to use SageMaker to host multiple models behind a single endpoint to reduce costs. Which SageMaker feature should they use?

Question 211mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker to train a model. They want to ensure that the training data is encrypted at rest in the S3 bucket and that the data is also encrypted during transit. Which configuration should they use?

Question 212hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to train a model with a large dataset that is stored in S3. The training job is taking a long time due to high I/O latency. The team has already converted the data to RecordIO format. What should they do next to reduce I/O latency?

Question 213easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO of the following are benefits of using SageMaker Managed Spot Training? (Select TWO.)

Question 214mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE of the following are valid ways to deploy a model using SageMaker? (Select THREE.)

Question 215hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO of the following are valid configurations for SageMaker Training Job resource limits? (Select TWO.)

Question 216mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An IAM policy attached to a SageMaker notebook role is shown in the exhibit. A data scientist is trying to run a training job from the notebook, but the job fails with an access denied error. The training job needs to read data from 'my-bucket' and write output to 'my-bucket'. What is the most likely cause of the failure?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    }
  ]
}

Question 217hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A SageMaker training job log shows the exhibit. The training job fails immediately after starting. The training data is supposed to be provided via Pipe mode from S3. What is the most likely cause?

Exhibit

Refer to the exhibit.

2023-01-01 12:00:00,123 INFO - Starting training
2023-01-01 12:00:01,456 ERROR - Unable to read data from /opt/ml/input/data/training
2023-01-01 12:00:01,457 INFO - Training completed

Question 218easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A SageMaker endpoint configuration is shown in the exhibit. The company wants to deploy the model to a real-time endpoint. What is missing from this configuration to successfully create the endpoint?

Exhibit

Refer to the exhibit.

{
  "EndpointConfigName": "my-endpoint-config",
  "ProductionVariants": [
    {
      "VariantName": "variant1",
      "ModelName": "my-model",
      "InitialInstanceCount": 1,
      "InstanceType": "ml.m5.large"
    }
  ]
}

Question 219easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model. Training is taking longer than expected. The scientist notices that the training job is using a single instance type with limited GPU memory. Which action will MOST likely reduce training time?

Question 220mediummultiple choice

Read the full NAT/PAT explanation →

An ML team deploys a real-time inference endpoint on Amazon SageMaker. Users report high latency. The model is a PyTorch model using a custom container. Which combination of changes should the team implement to reduce latency? (Choose the best answer.)

Question 221hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to host a model for fraud detection. The model uses a custom XGBoost container. The endpoint receives about 100 requests per second, each with 50 features. The team notices that the model's predictions are occasionally incorrect for a subset of requests. Which approach should the team take to debug the issue?

Question 222easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer needs to deploy a TensorFlow model to a SageMaker endpoint. The model expects a specific input format. The engineer has the model artifacts stored in an S3 bucket. Which step is REQUIRED to deploy the model?

Question 223mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is building a recommendation system using Amazon SageMaker. The data is stored in a large S3 bucket with millions of small CSV files. The team wants to train a factorization machines model. Which data ingestion strategy will be MOST efficient?

Question 224hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML team is using SageMaker Autopilot to automatically build a binary classification model. The dataset has 500,000 rows and 200 columns, with a severe class imbalance (1% positive). Which configuration should the team set to address the imbalance?

Question 225easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to use Amazon Rekognition to detect objects in images stored in an S3 bucket. The images are uploaded by users. Which IAM policy statement is necessary to allow Rekognition to read from the bucket?

Question 226mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a deep learning model. The training script uses TensorFlow and runs on a single p3.2xlarge instance. The scientist wants to reduce training time by using multiple GPUs. What should the scientist do?

Question 227hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team has deployed a SageMaker endpoint for a sentiment analysis model. The model was trained on text data from social media. After deployment, the team notices that the model's accuracy has dropped significantly after 3 months. Which action should the team take to detect and address this issue?

Question 228mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO actions can reduce inference latency for a SageMaker real-time endpoint? (Choose 2.)

Question 229hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE are valid considerations when deploying a large deep learning model (10 GB) on a SageMaker endpoint? (Choose 3.)

Question 230easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO SageMaker features can be used to monitor and debug training jobs? (Choose 2.)

Question 231mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A developer has this IAM policy attached to an IAM role used by SageMaker. When attempting to create an endpoint, the operation fails with an access denied error. What is the MOST likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "arn:aws:ecr:us-east-1:123456789012:repository/sagemaker-inference"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/model/*"
    }
  ]
}

Question 232hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. A data scientist is training a PyTorch model on a SageMaker ml.p3.2xlarge instance (16 GB GPU memory). The training fails with the shown error. Which change should the scientist make to resolve the error?

Exhibit

2023-01-15 10:30:45,123 INFO - Training job started
2023-01-15 10:30:50,567 INFO - Epoch 1/10: loss=2.345, accuracy=0.45
2023-01-15 10:31:00,789 INFO - Epoch 2/10: loss=2.123, accuracy=0.52
2023-01-15 10:31:10,012 INFO - Epoch 3/10: loss=1.987, accuracy=0.58
...
2023-01-15 10:32:30,456 ERROR - OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.90 GiB total capacity; 14.00 GiB already allocated; 1.50 GiB free; 14.10 GiB reserved in total by PyTorch)
2023-01-15 10:32:30,457 ERROR - Training terminated

Question 233mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. An administrator has attached this IAM policy to a user. The user tries to start a SageMaker training job that uses a custom Docker image from Amazon ECR. The training job fails with an access denied error. What is the MOST likely reason?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sagemaker:*",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::my-bucket"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
    }
  ]
}

Question 234mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single ml.p3.2xlarge instance. The data is stored in S3 and is about 50 GB in size. Which action would MOST likely reduce training time?

Question 235hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model using Amazon SageMaker. The model is a PyTorch model that performs real-time inference with low latency requirements. The engineer wants to use automatic scaling based on the number of concurrent requests. Which SageMaker feature should be used to achieve this?

Question 236easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to build a binary classification model. The dataset is highly imbalanced, with 95% negative class and 5% positive class. Which technique should be used to address the class imbalance?

Question 237hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An ML team is deploying a model to a SageMaker endpoint for real-time inference. The model is large (2 GB) and requires GPU for low-latency inference. The team wants to minimize cost while maintaining a response time of under 200 ms. Which instance configuration and SageMaker feature would be best?

Question 238mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker and wants to track hyperparameter tuning jobs, training jobs, and model metrics. The team also needs to compare experiments visually. Which AWS service should be used?

Question 239easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket in a different AWS account. Which IAM policy configuration is required to allow SageMaker to access the data?

Question 240hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model on SageMaker and needs to ensure that the endpoint can handle a sudden spike in traffic. The engineer expects traffic to increase by 10x during a promotional event. Which scaling strategy should be used?

Question 241mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model and wants to use a custom Docker container for training. The container requires access to a private Amazon ECR repository. Which IAM role configuration is needed?

Question 242easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to use Amazon SageMaker to train a model using data that is updated daily. The training data is stored in an S3 bucket, and the team wants to automate the training process whenever new data arrives. Which AWS service should be used to trigger the SageMaker training job?

Question 243hardmulti select

Review the full routing breakdown →

A company is deploying a machine learning model on Amazon SageMaker. The model needs to be updated frequently with new versions. The team wants to minimize downtime and test the new model version before routing all traffic to it. Which TWO strategies should be used together?

Question 244mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using Amazon SageMaker and wants to reduce the training time. The training job uses a single GPU instance. Which THREE actions can reduce training time?

Question 245easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A company wants to deploy a machine learning model on Amazon SageMaker and needs to monitor the model's performance in production. Which TWO AWS services can be used to set up monitoring?

Question 246mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is deploying a PyTorch model to Amazon SageMaker for real-time inference. The model runs on a large instance but inference latency is too high. Which action is MOST likely to reduce latency without sacrificing accuracy?

Question 247easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is using Amazon SageMaker to train a linear regression model on a dataset with 10 features. After training, they notice the model has high bias. Which action is MOST likely to reduce bias?

Question 248hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is failing with a 'ResourceLimitExceeded' error. The engineer checks the account limits and sees that the current limit for the instance type is 2, and they are already using 2 instances for other jobs. Which approach would resolve the issue MOST cost-effectively?

Question 249mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model. The training job uses a custom Docker image stored in Amazon ECR. The training job fails with an error 'CannotPullContainerError'. Which TWO actions should the data scientist take to resolve this issue? (Choose TWO.)

Question 250hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using Amazon SageMaker. To reduce costs, they want to use SageMaker Managed Spot Training. Which THREE conditions must be met for the training job to use spot instances? (Choose THREE.)

Question 251easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A data engineer is building a data pipeline for a machine learning project using Amazon SageMaker. The raw data is stored in Amazon S3. Which TWO steps are essential to ensure data privacy and security before training? (Choose TWO.)

Question 252mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An IAM policy attached to a SageMaker execution role is shown in the exhibit. When a data scientist tries to create a training job that writes logs to CloudWatch Logs, the job fails. What is the MOST likely reason?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:ListTrainingJobs"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    }
  ]
}

Question 253hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An engineer runs the AWS CLI command in the exhibit to create a SageMaker endpoint configuration. The endpoint is created successfully, but when invoked, the inference response is slow. The engineer wants to test with a different instance type. Which action should the engineer take?

Network Topology

Question 254easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

An engineer sees the error in the exhibit when trying to deploy a model from a model registry in SageMaker. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

[ERROR] 2023-01-15 10:23:45,123 - sagemaker - Could not find model package with arn:aws:sagemaker:us-east-1:123456789012:model-package/my-model/1

Question 255mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist creates a model resource in SageMaker using the JSON configuration in the exhibit. When creating an endpoint, the deployment fails with an error 'ModelError: Cannot find inference code'. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

{
  "ContainerDefinitions": [
    {
      "Image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-custom-image:latest",
      "ModelDataUrl": "s3://my-bucket/model.tar.gz",
      "Environment": {
        "SAGEMAKER_PROGRAM": "train.py"
      }
    }
  ],
  "InferenceSpecification": {
    "Containers": [
      {
        "Image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-custom-image:latest",
        "ModelDataUrl": "s3://my-bucket/model.tar.gz",
        "Environment": {}
      }
    ]
  }
}

Question 256hardmultiple choice

Read the full NAT/PAT explanation →

A company is using Amazon SageMaker to train a large natural language processing model. The training job uses a GPU instance and is expected to take several hours. The data scientist wants to monitor GPU utilization in real-time. Which approach is MOST effective?

Question 257easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to deploy a trained model to Amazon SageMaker for real-time inference. The model is stored as a .tar.gz file in Amazon S3. Which AWS service is used to create a SageMaker endpoint?

Question 258mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to train machine learning models. The training data contains personally identifiable information (PII). The company needs to ensure that the data is encrypted in transit between S3 and SageMaker. Which configuration is REQUIRED?

Question 259hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is using Amazon SageMaker to train a model. The training job is taking too long. The engineer suspects the data loading is a bottleneck. Which action would MOST effectively diagnose the issue?

Question 260mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to run a hyperparameter tuning job. The tuning job uses Bayesian optimization. Which THREE statements about Bayesian optimization are correct? (Choose THREE.)

Question 261easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist wants to deploy a PyTorch model for real-time inference. Which SageMaker deployment option provides the lowest latency for single-digit millisecond responses?

Question 262mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is training a large NLP model using SageMaker. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 61 GB GPU memory. Which action should the team take to resolve the issue without changing the model architecture?

Question 263hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker Pipelines to automate model retraining. The pipeline fails intermittently at the Preprocess step with a 'ResourceLimitExceeded' error. The team uses a ml.m5.xlarge instance. What is the most likely cause?

Question 264easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer needs to store and version datasets for reproducibility. Which AWS service is designed for this purpose?

Question 265mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist uses SageMaker to train a model. The training job takes 10 hours, but the team needs to reduce costs. Which approach is MOST cost-effective?

Question 266hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company deploys a SageMaker endpoint for real-time inference. After a week, the response latency increases from 50 ms to 500 ms. CPU utilization is at 30%. What is the most likely cause?

Question 267easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team needs to automatically retrain a model every week using new data. Which SageMaker feature is designed to schedule and automate this workflow?

Question 268mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A model deployed on a SageMaker endpoint is producing predictions that are consistently biased against a certain demographic. Which step should the team take FIRST to address this issue?

Question 269hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker to train a model with a custom Docker container. The training script runs locally but fails on SageMaker with a 'Permission denied' error when writing to /opt/ml/model. What is the likely cause?

Question 270easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A company wants to monitor SageMaker endpoints for data drift. Which TWO services can be used together to detect and alert on drift?

Question 271mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to deploy a model with a custom inference container. Which THREE requirements must the container meet for SageMaker hosting?

Question 272hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using SageMaker Pipelines to orchestrate a multi-step workflow. The pipeline fails with a 'ThrottlingException' when submitting a training job. Which TWO actions can reduce the likelihood of throttling?

Question 273mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a deep learning model on Amazon SageMaker using the built-in Object Detection algorithm. The training job is failing with a 'ResourceLimitExceeded' error when trying to launch multiple GPU instances. Which of the following is the MOST likely cause?

Question 274hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is deploying a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (<100 ms). The model is a PyTorch model with custom pre- and post-processing logic. The team uses a SageMaker Model with a custom inference container. After deployment, they observe that the endpoint takes over 500 ms for the first request, but subsequent requests are fast (~50 ms). What is the MOST likely cause?

Question 275easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket. The data scientist wants to use the Pipe mode for training to stream data directly from S3 instead of downloading it first. Which of the following is a prerequisite for using Pipe mode?

Question 276mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is performing hyperparameter tuning using Amazon SageMaker Automatic Model Tuning (AMT). The job uses a random search strategy. After 20 training jobs, the best objective metric value has plateaued. The data scientist wants to explore more of the hyperparameter space. Which action should the data scientist take?

Question 277hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon Forecast for demand forecasting. The data includes time series data for multiple items. The company wants to ensure that the forecast is updated daily as new data arrives. Which approach should be used to automate this process?

Question 278easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model requires GPU for inference. Which instance type should be selected?

Question 279mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The team has limited budget and wants to minimize labeling costs while ensuring high-quality labels. Which approach is MOST cost-effective?

Question 280hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a model on data stored in S3. The training job needs to access data from an S3 bucket in a different AWS account. The data owner has granted cross-account access via a bucket policy. However, the training job fails with an AccessDenied error. What is the MOST likely cause?

Question 281mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model with a custom Docker container. The training script reads data from an S3 bucket and writes the model artifact to an S3 bucket. The training job fails with a 'NoSuchKey' error. What is the MOST likely cause?

Question 282hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a machine learning model. The training job is configured to use the File mode to download data from S3 to the training instances. The training data is stored in a single S3 bucket with multiple prefixes. Which TWO actions are required to ensure the training job can access the data? (Choose TWO.)

Question 283mediummulti select

Read the full NAT/PAT explanation →

A data scientist is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives a large number of requests with variable traffic patterns. The team wants to minimize cost while ensuring low latency. Which THREE actions should the team take? (Choose THREE.)

Question 284easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using Amazon SageMaker to train a model. The training job uses spot instances to reduce cost. However, the training job is frequently interrupted. Which TWO actions can help mitigate the impact of spot interruptions? (Choose TWO.)

Question 285mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using SageMaker. The model is a PyTorch model that requires GPU for inference. The company wants to minimize costs while ensuring low latency. Which instance type should be used for the SageMaker endpoint?

Question 286hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a deep learning model on SageMaker using a custom Docker container. The training job fails with an error indicating that the container exited with a non-zero status. The CloudWatch logs show 'FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/data.csv''. What is the most likely cause?

Question 287easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses SageMaker to host a real-time inference endpoint. The endpoint is receiving a large number of requests, but the latency is higher than expected. The data scientist observes that the CPU utilization is low but memory utilization is high. Which action should be taken to reduce latency?

Question 288hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model using SageMaker and wants to use automatic scaling for the endpoint based on the number of concurrent requests. The engineer has defined a scaling policy using the SageMakerVariantInvocationsPerInstance metric. However, the scaling is not triggering as expected. What could be the issue?

Question 289mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker Ground Truth to create a labeled dataset for object detection. After the labeling job completes, the scientist notices that the output manifest file contains incorrect labels. What is the most efficient way to correct these labels?

Question 290easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to train a linear regression model on a dataset that fits into memory on a single instance. The training job is taking longer than expected. The data scientist wants to reduce training time without changing the algorithm. Which approach is most effective?

Question 291hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to host a model that performs real-time fraud detection. The model receives high request volumes with occasional spikes. The company wants to ensure that the endpoint can handle spikes without throttling while minimizing cost. Which scaling strategy should be used?

Question 292mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a model using the built-in XGBoost algorithm. The training job fails with the error 'AlgorithmError: Framework error: No module named 'xgboost''. What is the most likely cause?

Question 293easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to deploy a model for real-time inference. The model requires low latency, and the company wants to test the endpoint before production. Which approach should be used to validate endpoint performance?

Question 294mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using SageMaker to train a model and wants to track experiments, including hyperparameters and metrics. Which TWO actions should the scientist take to set up experiment tracking? (Choose TWO.)

Question 295hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model to a SageMaker endpoint and wants to ensure that the endpoint is resilient to instance failures. Which THREE steps should the company take to achieve high availability? (Choose THREE.)

Question 296mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a model using SageMaker and wants to use spot instances to reduce costs. Which THREE considerations should the scientist evaluate? (Choose THREE.)

Question 297easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist wants to deploy a PyTorch model for real-time inference with latency under 100 ms. Which AWS service is most suitable?

Question 298mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is training an XGBoost model using SageMaker with a large dataset in S3 (100 GB). Training is taking too long. Which change will most likely reduce training time without sacrificing accuracy?

Question 299hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company deploys a SageMaker model for inference. After a few days, response times increase significantly. CloudWatch metrics show high CPU utilization and memory usage. The model is a large ensemble. What is the most cost-effective solution?

Question 300easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to run a one-time SQL query on a large dataset in S3 to create a training dataset. The query involves aggregations and joins. Which service is most suitable?

Question 301mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team uses SageMaker to train a deep learning model. They notice the training job is using only a fraction of the GPU memory. Which configuration change would most improve GPU utilization?

Question 302hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to serve a scikit-learn model via SageMaker. The inference code requires a custom preprocessing step that is not in the default scikit-learn container. What is the simplest way to deploy?

Question 303easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to store and version machine learning models, along with metadata such as hyperparameters and metrics. Which AWS service is designed for this purpose?

Question 304mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to train a linear learner algorithm. The training log shows that the algorithm converges but the final loss is still high. Which change is most likely to improve the model?

Question 305hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to automate the retraining of a model weekly using new data. The training script is in a SageMaker notebook. Which implementation is most maintainable?

Question 306easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to select a model training infrastructure that supports distributed training across multiple GPUs and provides automatic model parallelism. Which TWO AWS services should the scientist consider?

Question 307mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

An ML team is deploying a model for real-time inference. They require A/B testing to compare a new model against the existing one. Which THREE steps should they take to set up this test?

Question 308hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to train a model and wants to ensure that the training data is encrypted at rest and in transit, and that the trained model artifacts are also encrypted. Which THREE actions should the company take?

Question 309mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist has this IAM policy attached to their IAM role. They are trying to run a SageMaker training job that reads data from 'my-bucket' and writes output to 'my-bucket'. The job fails. What is the most likely reason?

Exhibit

Refer to the exhibit.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob"
      ],
      "Resource": "*"
    }
  ]
}
```

Question 310hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A training job log shows this error. The training instance is an ml.m5.large with 8 GB EBS storage. The training data is 500 MB, and the model size is expected to be 200 MB. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
2024-01-15 10:23:45,123 - sagemaker - INFO - Training job created
2024-01-15 10:23:46,456 - sagemaker - INFO - Starting training...
2024-01-15 10:23:50,789 - root - ERROR - OSError: [Errno 28] No space left on device
```

Question 311hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train and deploy a fraud detection model. The model is a gradient boosting machine (GBM) trained on a dataset with 10 million rows and 50 features. The training job runs on an ml.m5.2xlarge instance with 8 vCPUs and 32 GB memory. The training completes successfully, and the model is deployed to a real-time endpoint. After deployment, the inference latency is around 200 ms per request, which is acceptable. However, after a week, the company observes that latency increases to over 1 second during peak hours (12:00-13:00 UTC). CloudWatch metrics show CPU utilization on the endpoint instance reaches 95% during these peaks. The endpoint is configured with a single ml.m5.large instance. The company wants to maintain latency under 500 ms during peak hours without incurring unnecessary cost during off-peak hours. Which solution should the company implement?

Question 312easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a TensorFlow model on a single GPU instance. The training is taking too long. Which AWS service should be used to reduce training time by distributing the workload across multiple GPUs?

Question 313mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company deploys a machine learning model on Amazon SageMaker for real-time inference. The model receives requests with large payloads (up to 5 MB) and the inference latency is high. Which configuration change would MOST likely reduce latency?

Question 314hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning team is using Amazon SageMaker to train a model with a custom algorithm packaged in a Docker container. The training job fails with the error 'Error: Unable to locate sagemaker-training toolkit.' What is the MOST likely cause?

Question 315easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist needs to perform hyperparameter optimization for a model. Which AWS service provides built-in hyperparameter tuning jobs?

Question 316mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a model to an Amazon SageMaker endpoint for real-time inference. The model requires a GPU for low-latency predictions. Which instance type should be chosen?

Question 317hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is using AWS Step Functions to orchestrate a SageMaker training job followed by a Lambda function for post-processing. The training job completes successfully, but the Lambda function fails with a timeout error. What is the MOST likely cause?

Question 318easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company wants to track and compare metrics from multiple machine learning experiments. Which Amazon SageMaker feature should be used?

Question 319mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist uses SageMaker to train a model and wants to automatically stop the training job if the loss is not improving after a certain number of steps. Which feature should be used?

Question 320hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using SageMaker to host a model for real-time inference. They notice that the endpoint's latency increases over time. The model is stateless and the inference code does not log any errors. What is the MOST likely cause?

Question 321easymulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO AWS services can be used to deploy a machine learning model for serverless inference? (Choose 2.)

Question 322mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

Which THREE actions should be taken to ensure data security when training a model using Amazon SageMaker with data stored in Amazon S3? (Choose 3.)

Question 323hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

Which TWO approaches can reduce inference latency on a SageMaker real-time endpoint? (Choose 2.)

Question 324hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A financial services company uses Amazon SageMaker to train a fraud detection model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training job is configured to use a custom Docker container that reads data from S3 and writes model artifacts back to S3. The training job fails with the error: 'Unable to write model artifact to s3://my-bucket/output/model.tar.gz. Access Denied.' The IAM role used by the training job has the following permissions: s3:GetObject and s3:PutObject on the bucket, and kms:Decrypt on the KMS key. The training job is not using a VPC. What is the MOST likely cause of the failure?

Question 325mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A media company uses SageMaker to deploy a real-time inference endpoint for content recommendation. The model is a PyTorch model that uses GPU. The endpoint is deployed with an ml.p3.2xlarge instance. Over time, the endpoint's latency increases significantly during peak hours. The company has enabled auto scaling based on CPU utilization. However, the latency spikes occur even when CPU utilization is low. The model is stateless and the inference code is efficient. What is the MOST likely cause of the latency spikes?

Question 326easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A startup is using SageMaker to train a model using the built-in XGBoost algorithm. The training job runs successfully but the resulting model performs poorly on the test data. The data scientist suspects overfitting. The training data is relatively small (10,000 rows). Which action should be taken to reduce overfitting?

Question 327easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long and the scientist wants to reduce the training time by distributing the workload across multiple GPUs. Which SageMaker feature should be used to achieve this?

Question 328mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A team is deploying a machine learning model to production using Amazon SageMaker. They want to automatically scale the endpoint based on the incoming request volume, and they also need to ensure that the endpoint can handle sudden bursts of traffic without dropping requests. Which scaling policy should they use?

Question 329hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company has a real-time inference endpoint on Amazon SageMaker that uses a custom container. The endpoint is experiencing high latency and occasional 502 errors. The logs from the container show that the model inference time is low, but the overall response time is high. Which step is MOST likely to reduce the latency?

Question 330easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is deploying a model that was trained on a large dataset stored in Amazon S3. The model needs to be retrained daily with new data. Which approach is the MOST cost-effective for storing the training data while allowing quick access for retraining?

Question 331mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data science team is using Amazon SageMaker to train a model. The training job is failing with an 'OutOfMemory' error. The team is using a p3.2xlarge instance with 61 GB of memory. They need to resolve this issue as quickly as possible. Which action should they take?

Question 332hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to deploy a model for real-time predictions. The model is updated weekly. The company wants to ensure that the new model version is gradually rolled out to a small percentage of traffic before full deployment, and that it can be rolled back quickly if issues are detected. Which deployment strategy should be used?

Question 333easymultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer is building a pipeline to preprocess data and train a model using Amazon SageMaker. The data is stored in Amazon S3 and the preprocessing step is computationally intensive. The engineer wants to minimize costs while ensuring that the preprocessing step does not fail due to instance termination. Which instance type should be used for the preprocessing step?

Question 334mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train and deploy machine learning models. The data science team wants to track and compare model versions, hyperparameters, and metrics across multiple training jobs. Which TWO AWS services should they use together to achieve this? (Choose TWO.)

Question 335hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model to an Amazon SageMaker endpoint. The model receives requests with sensitive data that must be encrypted in transit and at rest. Additionally, the company needs to control access to the endpoint using AWS IAM. Which THREE steps should the company take to meet these requirements? (Choose THREE.)

Question 336easymulti select

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to build a custom training algorithm. The algorithm requires a specific library that is not included in the default SageMaker containers. The scientist wants to create a custom container that includes this library. Which TWO steps are required? (Choose TWO.)

Question 337mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

Refer to the exhibit. An IAM policy is attached to an IAM role used by a SageMaker training job. The training job fails with an access denied error when trying to write model artifacts to an S3 bucket. What is the most likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training-data/*"
    }
  ]
}

Question 338hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company has deployed a machine learning model on Amazon SageMaker for real-time inference. The endpoint uses a single ml.c5.xlarge instance. Recently, the traffic has increased, and the endpoint is returning HTTP 503 (Service Unavailable) errors during peak hours. The CloudWatch metrics show that the CPU utilization is consistently above 90% during peak times, and the Invocations metric shows that requests are being throttled. The data science team has already optimized the model to reduce inference time by 20%, but the errors persist. The company needs to resolve the issue without increasing costs significantly. Which course of action should be taken?

Question 339mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is responsible for deploying a model that was trained using a custom algorithm in Amazon SageMaker. The engineer has built a Docker container that includes the inference code and has tested it locally. The engineer now wants to deploy the container to a SageMaker endpoint for real-time inference. The engineer has already created the model in SageMaker by specifying the image URI and the model artifacts location in S3. However, when the engineer tries to create an endpoint configuration, the operation fails with an error indicating that the model is not in an 'Active' state. What should the engineer do to resolve this issue?

Question 340easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job uses a large dataset stored in Amazon S3, and the scientist wants to use pipe mode to stream the data directly from S3 to the training instance, reducing the time needed to download the data. The training job is configured with 'InputMode' set to 'Pipe'. However, the training job fails with an error indicating that the algorithm does not support pipe mode. What should the scientist do to resolve this issue?

Question 341mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company has a SageMaker endpoint that serves predictions for a mobile app. The endpoint is deployed on a single ml.m5.large instance. Recently, users have reported that the app sometimes returns outdated predictions. The data science team has confirmed that the model is updated daily by retraining with new data and creating a new endpoint configuration. However, the endpoint still returns predictions from the old model for some requests. The team has verified that the new endpoint configuration is associated with the endpoint and that the endpoint is in service. What is the most likely cause of this issue?

Question 342mediummulti select

Read the full Machine Learning Implementation and Operations explanation →

A data science team is deploying a machine learning model using Amazon SageMaker. The model requires GPU inference and must handle variable traffic with low latency. Which TWO options should the team implement to meet these requirements? (Choose TWO.)

Question 343hardmulti select

Read the full Machine Learning Implementation and Operations explanation →

A machine learning engineer is designing an automated ML pipeline for training and deploying models. The pipeline must include data validation, model training, hyperparameter tuning, and model deployment. The engineer wants to use AWS services that integrate well and provide version control. Which THREE services should be combined to achieve this? (Choose THREE.)

Question 344hardmultiple choice

Read the full NAT/PAT explanation →

A company uses Amazon SageMaker to train machine learning models. The data science team has developed a training script that uses TensorFlow. They want to run the training job on a GPU instance (ml.p3.2xlarge) and store the model artifact in Amazon S3. The training job completes successfully, but the model artifact is not saved to S3. The team has confirmed that the S3 bucket policy allows write access from the SageMaker execution role. The training script uses the TensorFlow estimator with the following configuration:

``` tensorflow_estimator = TensorFlow( entry_point='train.py', role='arn:aws:iam::123456789012:role/SageMakerExecutionRole', instance_count=1, instance_type='ml.p3.2xlarge', output_path='s3://my-bucket/output', framework_version='2.3', py_version='py37', ) ```

The train.py script saves the model using `model.save('/opt/ml/model')`. What is the MOST likely reason the model artifact is not being saved to S3?

Question 345mediummultiple choice

Read the full NAT/PAT explanation →

A company is using Amazon SageMaker to host a real-time inference endpoint for a natural language processing model. The endpoint is configured with an ml.m5.large instance. After deployment, the company observes that the inference latency is higher than expected, and the endpoint is experiencing CPU utilization near 100% during peak hours. The model is a PyTorch model that uses a transformer architecture. The company wants to reduce latency without increasing cost significantly. Which approach should the company take?

Question 346easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job is taking a long time, and the data scientist wants to improve performance by using a larger instance type with more vCPUs. The training job is currently using an ml.m5.large instance. The data scientist changes the instance type to ml.m5.4xlarge and resubmits the training job. However, the training time does not decrease significantly. What is the MOST likely reason?

Question 347mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is deploying a machine learning model using AWS Lambda for real-time inference. The model is a large ensemble model that takes approximately 500 MB of memory. The Lambda function is configured with 1024 MB of memory and a timeout of 15 seconds. The company observes that the function frequently times out during inference. The company wants to keep using Lambda for its serverless benefits. Which solution should the company implement to reduce inference time?

Question 348hardmultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker Ground Truth to build a training dataset for an image classification model. The company has a large number of unlabeled images stored in Amazon S3. The data science team wants to use a private workforce consisting of internal employees to label the images. The team creates a labeling job with a private workforce. After starting the job, the team notices that the labeling tasks are not being assigned to any workers. The workers have been added to the private workforce and have received their login credentials. What is the MOST likely cause of this issue?

Question 349mediummultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company is using Amazon SageMaker to train a deep learning model. The training job uses a script that reads data from Amazon S3 using the SageMaker SDK's `s3_input` method. The training job runs on a single ml.p3.2xlarge instance. The data scientist notices that the GPU utilization is very low during training, often below 20%. The training dataset is large, approximately 50 GB, stored as TFRecord files in S3. What is the MOST likely cause of low GPU utilization?

Question 350easymultiple choice

Read the full Machine Learning Implementation and Operations explanation →

A company uses Amazon SageMaker to deploy a model for real-time inference. The model is a linear regression model that was trained using the SageMaker built-in Linear Learner algorithm. The endpoint is configured with an ml.m5.large instance. After deployment, the company notices that the endpoint returns incorrect predictions. The training data was normalized, but the inference requests send raw feature values without normalization. What should the company do to fix the issue?

Question 351hardmultiple choice

Read the full NAT/PAT explanation →

A company is using Amazon SageMaker to train a model using a custom Docker container. The training script writes model artifacts to the `/opt/ml/model` directory. The training job completes successfully, but the model artifacts are not uploaded to the S3 output path specified in the training job. The company has verified that the SageMaker execution role has the necessary S3 permissions. The Docker container is built using a base image that is not one of the official SageMaker Docker images. What is the MOST likely reason for the failure to upload model artifacts?