Knowledge + Practice

CCNA Machine Learning Implementation and Operations Questions

75 of 351 questions · Page 1/5 · Machine Learning Implementation and Operations · Answers revealed

Practice these questions Domain overview All questions

1

MCQmedium

A company uses Amazon SageMaker to train a model. The training job fails with an 'OutOfMemory' error. The training data is stored in S3 and the instance type is ml.m5.xlarge. What is the most efficient way to resolve this issue?

A.Enable managed spot training

B.Reduce the batch size in the training script

C.Increase the number of instances using distributed training

D.Use a larger instance type, such as ml.m5.2xlarge

AnswerD

Larger instance provides more memory.

Why this answer

Option C is correct: Increasing the instance type provides more memory. Option A (increasing instance count) does not increase memory per instance. Option B (reducing batch size) may help but is less efficient.

Option D (enabling spot instances) does not address memory.

Practice this question →

2

MCQmedium

A company has deployed a machine learning model on a SageMaker endpoint that serves predictions to a web application. The model uses a custom inference container that loads the model artifacts from an ECR repository. After updating the model with new training data, the data scientist creates a new model and updates the endpoint. However, some users report that they still get predictions from the old model. The data scientist confirms that the endpoint configuration points to the new model. What is the most likely cause?

A.The new model artifacts are not correctly uploaded to S3

B.The endpoint is behind a load balancer that is not updated

C.The inference container is cached and not pulling the new image

D.DNS caching on the client side is resolving to the old endpoint IP address

AnswerD

DNS caching can cause stale responses.

Why this answer

Option B is correct because DNS caching at the client side may resolve the endpoint's DNS name to an old IP address, especially if the endpoint's underlying instances have not changed. Option A (incorrect model artifact) would affect all users. Option C (load balancer) is not part of SageMaker.

Option D (CloudFront caching) could be an issue if CloudFront is in front, but the question does not mention it.

Practice this question →

3

MCQmedium

A data scientist is using Amazon SageMaker to train a model with a custom Docker container. The training script reads data from an S3 bucket and writes the model artifact to an S3 bucket. The training job fails with a 'NoSuchKey' error. What is the MOST likely cause?

A.The training script is not compatible with the Docker image.

B.The training data path specified in the input data channel is incorrect.

C.The Docker image is not available in Amazon ECR.

D.The SageMaker execution role does not have s3:GetObject permission.

AnswerB

NoSuchKey means the S3 key does not exist.

Why this answer

Option B is correct because the error indicates that the specified file or key does not exist in the S3 bucket. Option A is wrong because a missing ECR image would cause a different error. Option C is wrong because insufficient permissions would cause AccessDenied.

Option D is wrong because the training image is not the issue.

Practice this question →

4

Multi-Selecthard

A machine learning engineer is designing an automated ML pipeline for training and deploying models. The pipeline must include data validation, model training, hyperparameter tuning, and model deployment. The engineer wants to use AWS services that integrate well and provide version control. Which THREE services should be combined to achieve this? (Choose THREE.)

Select 3 answers

A.AWS Glue

B.AWS Step Functions

C.AWS CodePipeline

D.Amazon EMR

E.Amazon SageMaker

AnswersB, C, E

Correct: Orchestrates the ML pipeline steps.

Why this answer

Option A (AWS Step Functions) orchestrates the pipeline steps. Option C (Amazon SageMaker) provides training, tuning, and deployment. Option D (AWS CodePipeline) manages version control and automation of the entire CI/CD workflow.

Option B (AWS Glue) is for ETL but not pipeline orchestration. Option E (Amazon EMR) is for big data processing, not ML pipeline management.

Practice this question →

5

Multi-Selecthard

Which THREE actions can help reduce the inference latency of a SageMaker endpoint? (Choose three.)

Select 3 answers

A.Use a larger instance type with more CPU/GPU

B.Enable SageMaker Batch Transform to process predictions offline

C.Enable data compression to reduce payload size

D.Use a multi-model endpoint to share instances across models

E.Increase the number of instances in the endpoint

AnswersA, B, C

More compute power reduces per-request latency.

Why this answer

Options A, B, and E are correct. Using a larger instance with more compute power, enabling SageMaker Batch Transform for offline predictions, and enabling data compression reduce latency. Option C (multi-model endpoint) helps with memory but not latency.

Option D (more instances) improves throughput but not per-request latency.

Practice this question →

6

MCQmedium

You are deploying a PyTorch model to a SageMaker endpoint. The model is large (5 GB) and the endpoint is using an ml.c5.2xlarge instance. Inference latency is higher than required. Which change would most effectively reduce latency?

A.Reduce the batch size in the inference code

B.Decrease the number of model server workers

C.Enable SageMaker Elastic Inference

D.Use a GPU instance type such as ml.p3.2xlarge

AnswerD

GPU accelerates matrix operations in PyTorch.

Why this answer

Option D is correct because using a GPU instance accelerates PyTorch inference. Option A is wrong because Elastic Inference could help but may not be as effective as full GPU. Option B is wrong because smaller batch size may increase latency per request.

Option C is wrong because reducing workers can increase latency.

Practice this question →

7

MCQeasy

Refer to the exhibit. A SageMaker endpoint logs this error. What is the most likely cause?

A.The model is corrupted

B.There is a network connectivity issue

C.The input data type is incorrect

D.The input data has fewer features than the model expects

AnswerD

The error explicitly states shape mismatch: expected 10 features, got 8.

Why this answer

Option A is correct because the error indicates input shape mismatch. Option B is wrong because data type is not mentioned. Option C is wrong because model corruption would cause different errors.

Option D is wrong because network issues would not cause shape mismatch.

Practice this question →

8

MCQmedium

A machine learning team is using SageMaker to train a model. They want to ensure that the training data is encrypted at rest in the S3 bucket and that the data is also encrypted during transit. Which configuration should they use?

A.Use client-side encryption and transfer data via HTTP

B.Use SSE-S3 encryption on the S3 bucket and enforce HTTPS

C.Use SSE-KMS encryption on the S3 bucket and disable HTTP

D.Use SSE-C encryption on the S3 bucket and HTTPS

E.Use no encryption on S3 but use HTTPS

AnswerB

SSE-S3 encrypts at rest; HTTPS encrypts in transit.

Why this answer

Option D is correct because SSE-S3 encrypts data at rest, and HTTPS ensures encryption in transit. Option A (SSE-KMS) also works but requires KMS keys. Option B (client-side encryption) is not managed by SageMaker.

Option C (SSE-C) requires customer keys. Option E (HTTP) is not encrypted.

Practice this question →

9

MCQmedium

A data scientist wants to use AWS Step Functions to orchestrate a machine learning workflow including data preprocessing, training, and evaluation. Which SageMaker integration is best suited for this purpose?

A.Implement each step as an AWS Lambda function and call Step Functions

B.Use the SageMaker SDK with Step Functions service integrations

C.Use SageMaker Pipelines to define the workflow

D.Use AWS Batch to run the steps sequentially

AnswerB

Step Functions has built-in integrations for SageMaker training, processing, and endpoints.

Why this answer

Option B is correct: SageMaker SDK with Step Functions integration allows direct orchestration. Option A (SageMaker Pipelines) is a different orchestration tool. Option C (Lambda) adds complexity.

Option D (Batch) is for non-interactive jobs.

Practice this question →

10

Multi-Selecthard

Which THREE measures can help reduce inference latency for a deep learning model deployed on SageMaker real-time endpoints? (Select THREE.)

Select 3 answers

A.Enable SageMaker Neo to compile the model.

B.Increase the batch size for inference.

C.Use GPU instances for inference.

D.Reduce the input data size (e.g., lower resolution images).

E.Use a multi-model endpoint to share the instance.

AnswersA, C, D

Neo optimizes models for target hardware, reducing latency.

Why this answer

To reduce latency, use GPU instances, enable model compilation with SageMaker Neo, reduce input size, and use multi-model endpoints to share resources. However, multi-model endpoints add latency when switching models. Increasing batch size usually increases latency per request but can improve throughput.

The three correct measures are: use GPU instances, enable SageMaker Neo, and reduce input data size.

Practice this question →

11

MCQmedium

A data scientist is using SageMaker Ground Truth to create a labeled dataset for object detection. After the labeling job completes, the scientist notices that the output manifest file contains incorrect labels. What is the most efficient way to correct these labels?

A.Create an incremental labeling job that includes only the mislabeled items.

B.Delete the labeling job and start over with a different set of workers.

C.Use the SageMaker console to edit the incorrect labels directly in the manifest file.

D.Create a new labeling job with the same dataset and manually verify all labels.

AnswerA

Efficiently corrects only errors.

Why this answer

Option B is correct because Ground Truth allows creating an incremental labeling job to correct only the mislabeled items, avoiding re-labeling all data. Option A is wrong because it would require re-labeling everything, which is inefficient. Option C is wrong because it does not address labeling errors.

Option D is wrong because it would lose the correctly labeled data.

Practice this question →

12

Multi-Selectmedium

Which THREE actions should be taken to ensure data security when training a model using Amazon SageMaker with data stored in Amazon S3? (Choose 3.)

Select 3 answers

A.Use a VPC to isolate the SageMaker training job

B.Apply an S3 bucket policy that denies all access except from the SageMaker service

C.Attach an EBS volume for storing training data

D.Enable server-side encryption on the S3 bucket

E.Use an IAM role with least privilege permissions

AnswersA, D, E

Network isolation.

Why this answer

Encrypt data at rest (Option A), use VPC (Option B), and use an IAM role with least privilege (Option D). Option C is wrong because SageMaker uses S3, not EBS, for training data. Option E is wrong because SageMaker does not support S3 bucket policies directly; IAM roles are used.

Practice this question →

13

Multi-Selecteasy

A machine learning pipeline uses SageMaker Processing jobs for feature engineering. Which TWO are benefits of using SageMaker Processing over running a custom script on an EC2 instance?

Select 2 answers

A.Automatically manages the compute resources

B.Integrates with SageMaker Experiments for tracking

C.Provides a built-in VPC for network isolation

D.Allows use of custom Docker images from any registry

E.Supports multiple programming languages

AnswersA, B

SageMaker provisions and tears down resources.

Why this answer

Options B and D are correct. Option B: SageMaker Processing manages infrastructure. Option D: Integrates with SageMaker experiments.

Option A is wrong because EC2 can also use various runtimes. Option C is wrong because EC2 instance can be in a VPC. Option E is wrong because both can use Git.

Practice this question →

14

Multi-Selectmedium

A data scientist needs to deploy a model with a custom inference container. Which THREE requirements must the container meet for SageMaker hosting?

Select 3 answers

A.Provide a training script at /opt/ml/input/data

B.Use the SageMaker Python SDK to load the model

C.Implement a /ping endpoint for health checks

D.Serve on port 8080

E.Implement a /invocations endpoint for predictions

AnswersC, D, E

SageMaker uses /ping to check container health.

Why this answer

SageMaker requires containers to implement the /invocations and /ping endpoints, and to serve on port 8080. Options A, B, and E are correct. Option C is for training, Option D is optional.

Practice this question →

15

Multi-Selecthard

A data scientist is using SageMaker to train a model. The training job needs to access data in an S3 bucket in a different AWS account. The data scientist has set up proper S3 bucket policies and IAM roles. Which THREE steps are necessary to allow SageMaker to access the cross-account S3 bucket? (Select THREE.)

Select 3 answers

A.Configure the S3 bucket policy to grant access to the SageMaker execution role ARN from the training account

B.Create a VPC endpoint for S3 in the training account

C.Create an IAM role in the data account with permissions to read from the S3 bucket

D.Use an AWS KMS key to encrypt the data in transit

E.Configure the SageMaker execution role in the training account to assume the IAM role in the data account

AnswersA, C, E

Bucket policy must allow cross-account access.

Why this answer

Options A, B, and C are correct. A: Role must have S3 access. B: Bucket policy must allow cross-account access.

C: SageMaker execution role must be able to assume the cross-account role. D (VPC endpoint) is not required for cross-account access. E (KMS) is only if encryption is used.

Practice this question →

16

MCQhard

A machine learning engineer is deploying a model using Amazon SageMaker. The model is a PyTorch model that performs real-time inference with low latency requirements. The engineer wants to use automatic scaling based on the number of concurrent requests. Which SageMaker feature should be used to achieve this?

A.Create an AWS Auto Scaling group for the SageMaker endpoint.

B.Enable Elastic Load Balancing for the endpoint.

C.Use Amazon SageMaker automatic scaling with a target tracking scaling policy.

D.Deploy the model behind Amazon API Gateway with a Lambda function.

AnswerC

This scales based on invocations per instance.

Why this answer

Option B is correct because SageMaker's application auto scaling with a target tracking scaling policy based on the SageMakerVariantInvocationsPerInstance metric automatically adjusts the number of instances. Option A is wrong because SageMaker does not have an integrated Elastic Load Balancer; it uses a built-in load balancer. Option C is wrong because SageMaker does not natively support AWS Auto Scaling groups for endpoints.

Option D is wrong because Amazon API Gateway is used for REST APIs, not for scaling SageMaker endpoints.

Practice this question →

17

MCQmedium

A data scientist is training a model using Amazon SageMaker and wants to track hyperparameter tuning jobs, training jobs, and model metrics. The team also needs to compare experiments visually. Which AWS service should be used?

A.Amazon CloudWatch Logs

B.AWS Glue DataBrew

C.AWS Step Functions

D.Amazon SageMaker Experiments

AnswerD

SageMaker Experiments tracks and compares experiments.

Why this answer

Option C is correct because SageMaker Experiments is purpose-built for tracking and comparing ML experiments. Option A is wrong because AWS Glue is for ETL, not experiment tracking. Option B is wrong because CloudWatch Logs stores logs but does not provide experiment comparison.

Option D is wrong because Step Functions is for workflow orchestration.

Practice this question →

18

MCQmedium

A deployed SageMaker endpoint is returning high latency. The model is a scikit-learn Random Forest. Which action is most likely to reduce latency?

A.Reduce the number of trees in the ensemble

B.Prune decision trees in the model

C.Increase the number of instances behind the endpoint

D.Switch to a GPU instance type

AnswerA

Fewer trees reduce computation time per inference.

Why this answer

Option D is correct because reducing the number of trees directly speeds up inference. Option A is wrong because increasing instance count may not reduce per-request latency. Option B is wrong because using GPU may not help tree-based models.

Option C is wrong because pruning reduces accuracy but is not the best approach.

Practice this question →

19

MCQhard

A machine learning engineer is deploying a TensorFlow model to an Amazon SageMaker endpoint. The endpoint is behind an Application Load Balancer (ALB) for A/B testing. The engineer notices that the new variant is not receiving any traffic. What is the most likely cause?

A.The new variant's health checks are failing.

B.The ALB target group weight for the new variant is set to 0.

C.The model is not compatible with the ALB's protocol.

D.The ALB is not configured to route to SageMaker endpoints.

AnswerB

Weight of 0 means no traffic is sent.

Why this answer

Option A is correct because the ALB target group weight determines traffic distribution. Option B is wrong because routing is based on weights, not health checks if healthy. Option C is wrong because SageMaker endpoint is the target.

Option D is wrong because it would cause errors, not zero traffic.

Practice this question →

20

Multi-Selecteasy

A company wants to deploy a machine learning model on Amazon SageMaker and needs to monitor the model's performance in production. Which TWO AWS services can be used to set up monitoring?

Select 2 answers

A.Amazon CloudWatch

B.AWS X-Ray

C.Amazon Inspector

D.Amazon SageMaker Model Monitor

E.AWS Config

AnswersA, D

CloudWatch monitors endpoint metrics like latency and invocations.

Why this answer

Options A and D are correct. Amazon CloudWatch can monitor endpoint metrics, and SageMaker Model Monitor can detect data drift. Option B is wrong because AWS Config is for resource compliance, not model monitoring.

Option C is wrong because Amazon Inspector is for security assessment. Option E is wrong because AWS X-Ray is for tracing requests, not model performance.

Practice this question →

21

Multi-Selectmedium

A data scientist is deploying a model to a SageMaker endpoint and needs to optimize for cost while maintaining low latency. Which TWO actions should the data scientist take?

Select 2 answers

A.Use a larger instance type

B.Deploy to a single instance

C.Switch to batch transform

D.Use SageMaker Serverless Inference

E.Enable Auto Scaling on the endpoint

AnswersD, E

Pay per inference, scales automatically, cost-effective.

Why this answer

Options A and D are correct. Option A: Serverless Inference automatically scales and costs are based on usage. Option D: Auto Scaling adjusts instance count based on traffic, reducing cost during low demand.

Option B is wrong because larger instances increase cost. Option C is wrong because batch transform is not real-time. Option E is wrong because reducing instances increases latency.

Practice this question →

22

MCQeasy

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training role has the necessary permissions to decrypt the data. However, the training job fails with an access denied error. What is the most likely cause?

A.The S3 bucket policy does not grant access to the training role

B.The training image is not compatible with encrypted data

C.The training role does not have kms:Decrypt permission for the KMS key

D.CloudTrail logging is disabled

E.The training job is not in the same VPC as the S3 bucket

AnswerC

KMS requires explicit decrypt permission.

Why this answer

Option B is correct because SageMaker needs explicit permissions to use the KMS key for decryption. Option A (S3 bucket policy) is less likely if the role has S3 access. Option C (VPC) is unrelated to KMS.

Option D (training image) is not relevant. Option E (CloudTrail) is for logging.

Practice this question →

23

MCQeasy

A team is using SageMaker to train a model. They want to track hyperparameters, metrics, and model artifacts. Which SageMaker feature should they use?

A.SageMaker Pipelines

B.SageMaker Experiments

C.SageMaker Debugger

D.SageMaker Model Registry

AnswerB

Experiments track hyperparameters, metrics, and artifacts.

Why this answer

Option D is correct: SageMaker Experiments tracks parameters, metrics, and artifacts. Option A (Model Registry) manages model versions. Option B (Pipelines) orchestrates workflows.

Option C (Debugger) monitors training.

Practice this question →

24

MCQeasy

A data scientist wants to deploy a PyTorch model for real-time inference with latency under 100 ms. Which AWS service is most suitable?

A.Amazon SageMaker real-time endpoint

B.Amazon SageMaker Processing

C.AWS Lambda with container image

D.Amazon SageMaker Batch Transform

AnswerA

Provides low-latency inference suitable for real-time applications.

Why this answer

Option B is correct because Amazon SageMaker real-time endpoints provide low-latency inference. Option A (SageMaker Batch Transform) is for batch predictions, not real-time. Option C (Lambda) has limited runtime and scalability for large models.

Option D (SageMaker Processing) is for data processing, not inference.

Practice this question →

25

Multi-Selectmedium

Which TWO actions can help reduce inference latency for a SageMaker endpoint?

Select 2 answers

A.Switch to batch transform

B.Use SageMaker Neo to optimize the model

C.Use a larger instance type

D.Enable SageMaker Endpoint Cache

E.Use a multi-model endpoint

AnswersB, D

Neo optimizes models for target hardware, reducing latency.

Why this answer

Options A and D are correct. Option A reduces model size, Option D enables caching. Option B is wrong because larger instances may not reduce latency.

Option C is wrong because batch transforms are for batch, not real-time. Option E is wrong because multi-model endpoints are for hosting multiple models, not latency reduction.

Practice this question →

26

MCQmedium

A data scientist is reviewing the training logs from a SageMaker training job. The model's loss decreases steadily and accuracy increases. However, when the model is evaluated on a holdout test set, the accuracy is only 0.65. Which issue does this behavior suggest?

A.The model is overfitting to the training data.

B.The learning rate is too high.

C.The model is underfitting the training data.

D.The training data is corrupted.

AnswerA

High training accuracy but low test accuracy is classic overfitting.

Why this answer

Option C is correct because low test accuracy indicates overfitting. Option A is wrong because loss and accuracy improve. Option B is wrong because it's not a data issue.

Option D is wrong because it's not underfitting.

Practice this question →

27

MCQhard

A company's ML pipeline uses AWS Step Functions to orchestrate data preprocessing, training, and evaluation. The training step occasionally fails due to a transient error. What is the most robust way to handle this without manual intervention?

A.Implement a retry policy with exponential backoff on the training step in the state machine

B.Configure a CloudWatch alarm to notify the team when the step fails

C.Use a parallel state to run multiple training instances simultaneously

D.Use a custom Lambda function to catch the error and restart the training step

AnswerA

Step Functions supports retry policies for transient errors.

Why this answer

Retry with exponential backoff handles transient errors. Option A is wrong because a separate Lambda adds complexity. Option B is wrong because it only alerts, doesn't fix.

Option D is wrong because it restarts the whole workflow, wasting resources.

Practice this question →

28

MCQhard

A team has deployed a SageMaker endpoint for a sentiment analysis model. The model was trained on text data from social media. After deployment, the team notices that the model's accuracy has dropped significantly after 3 months. Which action should the team take to detect and address this issue?

A.Use SageMaker A/B testing to compare with a new model.

B.Enable SageMaker Model Monitor to detect data drift and trigger a retraining pipeline.

C.Re-deploy the model using the same training script.

D.Create a CloudWatch alarm on invocation errors.

AnswerB

Model Monitor can detect drift and trigger automated retraining.

Why this answer

Setting up SageMaker Model Monitor to detect drift and triggering a retraining pipeline (Option D) automates detection and correction. Option A (re-deploy) does not address root cause. Option B (CloudWatch alarm) only monitors latency, not accuracy.

Option C (A/B testing) helps compare but does not detect drift automatically.

Practice this question →

29

MCQhard

A company uses SageMaker to train a model each night. The training data is stored in an S3 bucket with SSE-S3 encryption. The training job fails with an access denied error. Which configuration is needed?

A.Configure the training job to run in a VPC with S3 VPC Endpoint

B.Create an IAM role with S3 read access and assign it to the SageMaker training job

C.Enable SSE-KMS on the S3 bucket

D.Add a bucket policy allowing s3:GetObject for all principals

AnswerB

SageMaker needs a role with permissions to read S3 data.

Why this answer

Option C is correct because SageMaker needs an IAM role with S3 permissions and the role must be passed to the training job. Option A is wrong because KMS is not used with SSE-S3. Option B is wrong because bucket policy alone is insufficient without the SageMaker role.

Option D is wrong because VPC is not related to encryption.

Practice this question →

30

Multi-Selectmedium

A company is deploying a machine learning model for real-time fraud detection using Amazon SageMaker. The model must have a p99 inference latency under 50ms. Which TWO actions should the ML team take to meet the latency requirement?

Select 2 answers

A.Use a multi-model endpoint to reduce cold starts.

B.Use SageMaker Neo to compile and optimize the model for the target instance type.

C.Use SageMaker Batch Transform for near-real-time inference.

D.Configure automatic scaling to add instances based on CPU utilization.

E.Select a GPU instance type such as ml.g4dn.xlarge.

AnswersB, E

Neo optimizes the model to run faster on specific hardware.

Why this answer

Using SageMaker Neo (A) optimizes the model for the target hardware, reducing latency. Using GPU instances (B) can accelerate inference for compute-intensive models. SageMaker Batch Transform (C) is for offline inference.

Automatic scaling (D) handles throughput, not latency. Multi-model endpoint (E) can help with many models but not single-model latency.

Practice this question →

31

Multi-Selectmedium

A company is using Amazon SageMaker to train a model. The training data includes sensitive personally identifiable information (PII). The company needs to ensure that the training data is protected and that the trained model does not inadvertently expose PII. Which TWO actions should the company take? (Choose TWO.)

Select 2 answers

A.Encrypt the training data in S3 using AWS KMS

B.Use server-side encryption with S3-managed keys

C.Use SageMaker's data processing to redact PII before training

D.Enable AWS CloudTrail to log all access to the data

E.Grant public read access to the training data for faster access

AnswersA, C

Encryption protects data at rest.

Why this answer

Options A and D are correct. A: Encrypt the data at rest using KMS. D: Use SageMaker's data processing to redact PII before training.

Option B (grant public access) is wrong. Option C (CloudTrail) is for auditing, not protection. Option E (server-side encryption with S3-managed keys) is less secure than KMS but still encryption; however, the question asks for TWO actions and A is the best encryption choice, while D directly addresses PII exposure.

Practice this question →

32

MCQmedium

A company is deploying a machine learning model using AWS Lambda for real-time inference. The model is a large ensemble model that takes approximately 500 MB of memory. The Lambda function is configured with 1024 MB of memory and a timeout of 15 seconds. The company observes that the function frequently times out during inference. The company wants to keep using Lambda for its serverless benefits. Which solution should the company implement to reduce inference time?

A.Increase the Lambda function memory to 3008 MB to provide more CPU resources.

B.Deploy the model on Amazon SageMaker hosting instead of Lambda.

C.Use AWS Step Functions to invoke the Lambda function asynchronously.

D.Use Amazon ElastiCache to cache model predictions and reduce computation.

AnswerA

Correct: More memory allocates more CPU, reducing inference time.

Why this answer

Lambda has a maximum memory of 10,240 MB and a maximum timeout of 15 minutes. Increasing memory to 3008 MB gives more CPU power and reduces inference time. Option A is correct.

Option B (ElastiCache) adds latency and cost. Option C (Step Functions) adds orchestration overhead. Option D (SageMaker) moves away from serverless, which the company wants to keep.

Practice this question →

33

MCQeasy

A data scientist is training a neural network on a GPU instance in Amazon SageMaker. The training job fails with an 'OutOfMemoryError'. Which action should the data scientist take to resolve this issue?

A.Enable automatic hyperparameter tuning.

B.Switch to distributed training across multiple instances.

C.Use a smaller instance type with less GPU memory.

D.Reduce the batch size in the training script.

AnswerD

Smaller batch size reduces memory footprint.

Why this answer

Option C is correct because reducing the batch size reduces memory usage per iteration. Option A is wrong because using a smaller instance type would have less memory. Option B is wrong because hyperparameter tuning does not directly reduce memory.

Option D is wrong because distributed training typically increases memory usage per node.

Practice this question →

34

Multi-Selectmedium

A data scientist is using Amazon SageMaker to train a model. The training job uses a custom Docker image stored in Amazon ECR. The training job fails with an error 'CannotPullContainerError'. Which TWO actions should the data scientist take to resolve this issue? (Choose TWO.)

Select 2 answers

A.Use a public Docker image instead of a custom one

B.Confirm that the image tag exists in the ECR repository

C.Verify that the IAM role used for training has permissions to pull from ECR

D.Increase the training job timeout

E.Ensure the training instance has internet access

AnswersB, C

A missing tag causes CannotPullContainerError.

Why this answer

Options B and D are correct because checking ECR permissions and verifying the image tag are common causes. Option A is unrelated to pulling. Option C might be needed but not directly.

Option E is irrelevant to the error.

Practice this question →

35

MCQhard

A data scientist is training a deep learning model on Amazon SageMaker using a custom TensorFlow container. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 16 GB GPU memory and 61 GB system memory. The model uses mixed precision training. Which step should the data scientist take to resolve the issue without changing the instance type?

A.Reduce the batch size

B.Use gradient accumulation to simulate a larger batch size

C.Use model parallelism across multiple GPUs

D.Enable automatic mixed precision (AMP)

E.Increase the instance type to ml.p3.8xlarge

AnswerA

Smaller batch size reduces memory usage.

Why this answer

Option D is correct because reducing the batch size lowers memory usage during training. Option A (increase instance type) increases cost and may not be necessary. Option B (use gradient accumulation) simulates larger batch sizes without increasing memory footprint, but does not reduce memory usage per step.

Option C (enable automatic mixed precision) is already in use. Option E (use model parallelism) is complex and may not be applicable for a single model fitting in memory with batch size reduction.

Practice this question →

36

Multi-Selectmedium

Which THREE of the following are valid ways to deploy a model using SageMaker? (Select THREE.)

Select 3 answers

A.Deploy to AWS Lambda

B.Deploy to a SageMaker batch transform job

C.Deploy to a SageMaker asynchronous endpoint

D.Deploy to a SageMaker real-time endpoint

E.Deploy to Amazon EC2 directly

AnswersB, C, D

Batch transform processes large batches of data asynchronously.

Why this answer

Options A, C, and E are correct. SageMaker can deploy to a real-time endpoint (A), a batch transform job (C), or an asynchronous endpoint (E). Option B is wrong because SageMaker does not deploy directly to Lambda.

Option D is wrong because SageMaker does not deploy to EC2 directly without containerization.

Practice this question →

37

MCQeasy

A company is using Amazon SageMaker to deploy a model. The model is a large ensemble that requires 8 GB of memory. The company wants to minimize endpoint cost. Which instance type should they choose?

A.ml.c5.xlarge

B.ml.m5.large

C.ml.t2.medium

D.ml.c5.large

E.ml.c5.2xlarge

AnswerA

8 GB memory, minimum required.

Why this answer

Option B is correct because ml.c5.xlarge has 8 GB memory, sufficient for the model. Option A (ml.c5.large) has 4 GB, not enough. Option C (ml.c5.2xlarge) has 16 GB, over-provisioned.

Option D (ml.m5.large) has 8 GB but is more expensive than c5. Option E (ml.t2.medium) has 4 GB.

Practice this question →

38

MCQmedium

Refer to the exhibit. A developer has this IAM policy attached to an IAM role used by SageMaker. When attempting to create an endpoint, the operation fails with an access denied error. What is the MOST likely cause?

A.The policy is missing ecr:DescribeRepositories.

B.The policy is missing s3:ListBucket on the model bucket.

C.The policy is missing sagemaker:DescribeEndpoint.

D.The policy is missing sagemaker:InvokeEndpoint.

AnswerB

SageMaker needs to list the bucket to access model artifacts.

Why this answer

The policy grants permissions to create the model, endpoint config, and endpoint, but it does not include sagemaker:InvokeEndpoint (Option A). The error is likely due to missing sagemaker:InvokeEndpoint, but the question asks about creating an endpoint. Actually, creating an endpoint does not require InvokeEndpoint.

The correct answer is that the policy is missing s3:ListBucket (Option B) because SageMaker needs to list objects in the bucket when accessing the model artifacts. Option C (ecr:DescribeRepositories) is not needed. Option D (sagemaker:DescribeEndpoint) is not required for creation.

Practice this question →

39

MCQmedium

A machine learning team is using SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The training job fails with an 'AccessDenied' error. Which IAM permission is MOST likely missing from the SageMaker execution role?

A.s3:GetObject

B.s3:ListBucket

C.kms:Decrypt

D.kms:GenerateDataKey

AnswerC

To read encrypted objects, SageMaker needs kms:Decrypt permission.

Why this answer

SageMaker needs kms:Decrypt permission to read encrypted data from S3. The s3:GetObject permission is also needed, but the error specifically for encrypted data often points to missing KMS permissions. s3:ListBucket is for listing, not reading. kms:GenerateDataKey is for writing. kms:Encrypt is for writing encrypted data.

Practice this question →

40

Multi-Selectmedium

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently with new data. Which TWO approaches can be used to update the model without downtime? (Choose TWO.)

Select 2 answers

A.Delete the existing endpoint and create a new one with the updated model.

B.Directly update the model artifact in the existing endpoint configuration.

C.Use SageMaker A/B testing to gradually shift traffic to the new model variant.

D.Stop the endpoint, update the model, and restart the endpoint.

E.Use a blue/green deployment by deploying the new model on a separate endpoint and then updating the DNS record.

AnswersC, E

A/B testing with production variants allows traffic shifting without downtime.

Why this answer

Option A and C are correct because A/B testing allows gradual rollout, and blue/green deployment with production variants ensures zero downtime. Option B is wrong because recreating the endpoint causes downtime. Option D is wrong because updating the model artifact while endpoint is active is not supported without redeployment.

Option E is wrong because stopping the endpoint causes downtime.

Practice this question →

41

MCQmedium

A data scientist needs to process a large dataset (100 TB) for training a machine learning model. The data is stored in Amazon S3. Which approach is most cost-effective and efficient for data processing?

A.Use AWS Glue ETL jobs.

B.Use Amazon EMR with Apache Spark.

C.Use Amazon Athena to run SQL queries.

D.Use Amazon SageMaker Processing with a single large instance.

AnswerB

Distributed processing is efficient for large data.

Why this answer

Option A is correct because Amazon EMR with Spark is designed for large-scale data processing and is cost-effective. Option B is wrong because Athena is for querying, not complex processing. Option C is wrong because Glue is for ETL but may be slower for large data.

Option D is wrong because SageMaker processing with a single instance may not handle 100 TB efficiently.

Practice this question →

42

MCQeasy

A company wants to use Amazon SageMaker to host a model that was trained using a custom algorithm. The model artifact is stored in Amazon S3. The company wants to ensure that the endpoint can automatically scale based on the number of incoming requests. Which configuration should the company use?

A.Create a SageMaker multi-model endpoint with automatic scaling.

B.Create a SageMaker real-time endpoint and configure automatic scaling using a target tracking policy.

C.Use SageMaker Serverless Inference which scales automatically.

D.Use SageMaker Batch Transform with a scheduled job.

AnswerB

Real-time endpoints with auto-scaling adjust instance count based on load.

Why this answer

SageMaker real-time endpoints support automatic scaling using Application Auto Scaling, which can adjust the number of instances based on metrics like request count. Multi-model endpoints (A) are for serving multiple models. Batch Transform (B) is for offline inference.

Serverless Inference (D) scales automatically but has limitations; the question asks for endpoint scaling, and real-time endpoints with auto-scaling is the standard approach.

Practice this question →

43

MCQeasy

A data scientist needs to perform hyperparameter optimization for a gradient boosting model. Which built-in Amazon SageMaker feature should they use?

A.Amazon SageMaker Automatic Model Tuning

B.Amazon SageMaker Clarify

C.Amazon SageMaker Debugger

D.Amazon SageMaker Neo

AnswerA

Performs hyperparameter optimization.

Why this answer

SageMaker Automatic Model Tuning performs hyperparameter optimization. Option A is wrong because SageMaker Debugger is for monitoring. Option C is wrong because SageMaker Neo is for model compilation.

Option D is wrong because SageMaker Clarify is for bias detection.

Practice this question →

44

Multi-Selectmedium

A data scientist is using SageMaker to build a model for fraud detection. The dataset is highly imbalanced. Which THREE techniques should be applied to address class imbalance?

Select 3 answers

A.Train the model only on the majority class.

B.Use accuracy as the evaluation metric.

C.Apply SMOTE to generate synthetic samples of the minority class.

D.Use class weights in the loss function.

E.Undersample the majority class.

AnswersC, D, E

SMOTE creates synthetic examples to balance classes.

Why this answer

Options A, C, and E are correct. Oversampling the minority class (e.g., SMOTE) and undersampling the majority class help balance the dataset. Using class weights in the loss function penalizes misclassifications of minority class more.

Option B is incorrect because using accuracy as the metric can be misleading for imbalanced datasets; precision-recall or AUC is better. Option D is incorrect because training only on majority class would ignore minority class entirely.

Practice this question →

45

MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker using the built-in Object Detection algorithm. The training job is failing with a 'ResourceLimitExceeded' error when trying to launch multiple GPU instances. Which of the following is the MOST likely cause?

A.The training script has a syntax error.

B.The dataset is too large for the selected instance type.

C.The account has reached the limit for the number of GPU instances in the current AWS Region.

D.The S3 bucket containing the training data has insufficient permissions.

AnswerC

ResourceLimitExceeded indicates service limit reached; contact AWS to increase limits.

Why this answer

Option D is correct because the 'ResourceLimitExceeded' error typically indicates that the requested instance type or count exceeds the account's service limit for SageMaker training instances. Option A is wrong because SageMaker does not enforce a limit on the total dataset size. Option B is wrong because S3 bucket permissions would cause a different error (e.g., AccessDenied).

Option C is wrong because a corrupted training script would cause a different error (e.g., ModuleNotFoundError).

Practice this question →

46

MCQmedium

A data scientist creates a model resource in SageMaker using the JSON configuration in the exhibit. When creating an endpoint, the deployment fails with an error 'ModelError: Cannot find inference code'. What is the MOST likely cause?

A.The model.tar.gz file is missing the model weights

B.The ECR image does not exist

C.The inference container environment does not specify SAGEMAKER_PROGRAM

D.The training container does not have the SAGEMAKER_PROGRAM variable

AnswerC

The inference container needs the SAGEMAKER_PROGRAM variable to point to the inference script.

Why this answer

Option D is correct because the inference container's Environment is empty; it should define SAGEMAKER_PROGRAM for the inference script. Option A is not necessarily missing. Option B would cause a different error.

Option C is not required for inference.

Practice this question →

47

MCQmedium

A data scientist is deploying a PyTorch model to Amazon SageMaker for real-time inference. The model runs on a large instance but inference latency is too high. Which action is MOST likely to reduce latency without sacrificing accuracy?

A.Compile the model using SageMaker Neo

B.Switch from a GPU instance to a CPU instance

C.Quantize the model weights from FP32 to INT8

D.Deploy the model to a multi-model endpoint

AnswerA

Neo optimizes the model for the target hardware, reducing latency without retraining or accuracy loss.

Why this answer

Option B is correct because SageMaker Neo optimizes trained models for target hardware, reducing latency without retraining. Option A may reduce latency but could affect accuracy. Option C changes instance type but not necessarily optimize the model.

Option D changes endpoint type, not latency.

Practice this question →

48

MCQeasy

A company uses SageMaker to host a real-time inference endpoint. The endpoint is receiving a large number of requests, but the latency is higher than expected. The data scientist observes that the CPU utilization is low but memory utilization is high. Which action should be taken to reduce latency?

A.Switch to an instance type with more memory or optimize the model to reduce memory footprint.

B.Enable VPC traffic mirroring to diagnose network issues.

C.Use an instance type with more vCPUs.

D.Increase the number of instances in the endpoint.

AnswerA

Addresses memory bottleneck.

Why this answer

Option A is correct because high memory utilization suggests the model is memory-bound; increasing instance memory or using a model with lower memory footprint can reduce latency. Option B is wrong because CPU utilization is low, so more CPU cores won't help. Option C is wrong because increasing instance count can help throughput but not necessarily latency per request; also it may increase cost.

Option D is wrong because the issue is memory, not network.

Practice this question →

49

MCQeasy

A company is using Amazon SageMaker to train a model and wants to track hyperparameter tuning jobs. Which AWS service is BEST suited to store and query metadata such as tuning job configurations and results?

A.Amazon CloudWatch Logs

B.Amazon S3 with Amazon Athena

C.Amazon SageMaker Experiments

D.Amazon DynamoDB

AnswerC

SageMaker Experiments is the native solution for tracking tuning jobs and their results.

Why this answer

Amazon SageMaker Experiment Management is the native service to track experiments, trials, and components. SageMaker automatically logs hyperparameters and metrics. Amazon DynamoDB could be used but requires custom integration.

CloudWatch Logs stores logs, not structured metadata. Athena queries data in S3 but is not optimized for real-time tracking. S3 is storage, not a queryable metadata store.

Practice this question →

50

MCQeasy

A company is using Amazon SageMaker to train a model and wants to automatically retrain the model every week using new data. Which AWS service should be used to orchestrate the retraining pipeline?

A.Amazon CloudWatch Events

B.AWS Lambda

C.AWS Step Functions

D.AWS Data Pipeline

AnswerC

Step Functions can orchestrate multiple SageMaker API calls and handle retries.

Why this answer

AWS Step Functions is the recommended service for orchestrating complex workflows, including SageMaker training jobs, model creation, and deployment. Lambda is for short functions. CloudWatch Events can trigger on a schedule but not orchestrate a pipeline.

Data Pipeline is older and less flexible. S3 Events trigger on object creation but not scheduling.

Practice this question →

51

Multi-Selectmedium

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently. Which THREE practices should the company implement for model versioning and deployment?

Select 3 answers

A.Use AWS CodePipeline to automate the training and deployment pipeline.

B.Use the SageMaker Model Registry to catalog model versions.

C.Manually update the endpoint configuration each time.

D.Store all training datasets in a single S3 bucket without versioning.

E.Deploy new model versions using canary deployments with SageMaker endpoints.

AnswersA, B, E

CodePipeline automates CI/CD.

Why this answer

Option A is correct for versioning. Option C is correct for canary deployments. Option E is correct for automated pipelines.

Option B is wrong because it's for data. Option D is wrong because instances should be monitored.

Practice this question →

52

MCQmedium

A data scientist is deploying a machine learning model using SageMaker and wants to automate the retraining pipeline. The training data is updated daily in an S3 bucket. Which combination of AWS services should the data scientist use to trigger a new training job when new data arrives?

A.Amazon SQS queue to store S3 events and a cron job to poll and start training

B.Use SageMaker Pipelines with a schedule to check for new data every hour

C.Amazon S3 event notification to directly start a SageMaker training job

D.Amazon CloudWatch Events to run an AWS Step Functions state machine that starts a SageMaker training job

E.Amazon CloudWatch Events to invoke an AWS Lambda function that starts a SageMaker training job

AnswerE

CloudWatch Events can capture S3 events and invoke Lambda to start training.

Why this answer

Option C is correct because S3 Events can trigger a Lambda function, which starts a SageMaker training job. Option A uses CloudWatch Events but is more complex for simple triggers. Option B misses the Lambda function.

Option D uses Step Functions unnecessarily. Option E uses SageMaker Pipelines, which can be triggered by events but requires additional setup.

Practice this question →

53

MCQeasy

A data scientist is deploying a model using Amazon SageMaker for real-time inference. The model is memory-intensive and requires a GPU. Which instance type should be selected for the endpoint?

A.i3.2xlarge

B.c5.2xlarge

C.r5.2xlarge

D.p3.2xlarge

AnswerD

GPU instance suitable for memory-intensive models.

Why this answer

The p3.2xlarge instance is correct because it provides a GPU (NVIDIA Tesla V100) with high memory bandwidth, which is essential for memory-intensive deep learning models requiring GPU acceleration for real-time inference. SageMaker endpoints for GPU-based models must use instance types from the P or G families, as CPU-only instances like i3, c5, or r5 lack the parallel processing capabilities needed for efficient GPU inference.

Exam trap

Cisco often tests the distinction between CPU-optimized instance families (c5, r5, i3) and GPU-accelerated families (p3, g4dn), where candidates mistakenly assume that high RAM (r5) or high compute (c5) can substitute for a GPU, ignoring the fundamental hardware requirement for GPU-based inference.

How to eliminate wrong answers

Option A (i3.2xlarge) is wrong because it is a storage-optimized instance with NVMe SSD storage, designed for high I/O workloads, not for GPU-accelerated inference. Option B (c5.2xlarge) is wrong because it is a compute-optimized instance with only CPUs, lacking a GPU, which is explicitly required for the memory-intensive model. Option C (r5.2xlarge) is wrong because it is a memory-optimized instance with high RAM but no GPU, making it unsuitable for GPU-dependent inference tasks.

Practice this question →

54

Multi-Selectmedium

A company uses SageMaker to train a model. The training job is taking too long and the data scientist wants to speed it up. Which THREE strategies should the data scientist consider? (Select THREE.)

Select 3 answers

A.Reduce the number of training epochs

B.Use a GPU instance type like ml.p3.2xlarge

C.Use distributed training with multiple instances

D.Use Pipe input mode to stream data from S3

E.Increase the batch size in the training script

AnswersB, C, D

GPUs accelerate training for deep learning.

Why this answer

Options A, C, and D are correct. A: Distributed training reduces time. C: Using a GPU instance accelerates compute.

D: Using Pipe mode reduces I/O time. B (reducing epochs) may reduce accuracy. E (increasing batch size) can help but may cause memory issues.

Practice this question →

55

Multi-Selecthard

A data science team is training a large deep learning model using Amazon SageMaker. The training job is taking a long time because the model has many layers and the dataset is large. The team wants to reduce training time by distributing the training across multiple GPUs on a single instance, as well as across multiple instances. Which TWO actions should the team take? (Choose two.)

Select 2 answers

A.Use SageMaker's distributed data parallelism (SMDDP) library to shard the model across GPUs.

B.Configure the training job to use SageMaker's model parallelism (SMP) library for pipeline or tensor parallelism.

C.Use SageMaker's managed training with a single instance containing multiple GPUs and enable data parallelism.

D.Use Horovod for data parallelism across multiple instances.

E.Set the instance type to a single GPU instance and rely on automatic model parallelism.

AnswersB, D

SMP allows splitting the model across multiple GPUs and instances, reducing memory footprint per GPU and enabling training of large models that would otherwise not fit. This complements data parallelism.

Why this answer

Option B is correct because the SageMaker model parallelism (SMP) library is specifically designed to split large deep learning models across multiple GPUs using pipeline or tensor parallelism. This allows the team to train models that are too large to fit on a single GPU and to reduce training time by parallelizing computation across devices within and across instances.

Exam trap

The trap here is that candidates often confuse data parallelism (which shards data) with model parallelism (which shards the model), and assume that simply using multiple GPUs on a single instance automatically distributes the model, when in fact explicit model parallelism libraries like SMP are required for large models that do not fit in GPU memory.

Practice this question →

56

MCQmedium

A data scientist is using SageMaker Debugger to monitor a training job. The training loss is not decreasing as expected. Which Debugger feature can help identify the issue?

A.Automatic hyperparameter tuning

B.Saving tensors every step

C.Deploying a model endpoint for real-time monitoring

D.Built-in rules to detect training anomalies

AnswerD

Rules like vanishing gradient can pinpoint issues.

Why this answer

Option B is correct because Debugger's built-in rules can detect vanishing gradients, overfitting, etc. Option A is incorrect because Debugger does not automatically tune hyperparameters. Option C is incorrect because saving tensors does not identify the issue automatically.

Option D is incorrect because Debugger is not for model deployment.

Practice this question →

57

Multi-Selecteasy

A machine learning team is using Amazon SageMaker to train a model. The training job uses spot instances to reduce cost. However, the training job is frequently interrupted. Which TWO actions can help mitigate the impact of spot interruptions? (Choose TWO.)

Select 2 answers

A.Increase the number of training instances.

B.Use a larger instance type that is less likely to be interrupted.

C.Use managed spot training with SageMaker's 'ManagedSpotTraining' parameter set to True.

D.Enable checkpointing to save intermediate results to Amazon S3.

E.Switch to on-demand instances.

AnswersC, D

Managed spot training handles interruptions.

Why this answer

Options B and D are correct. Option B: Using checkpoints allows resuming from the last saved state. Option D: Using managed spot training automatically handles interruptions.

Option A is wrong because more instances increase cost and interruption risk. Option C is wrong because larger instances are more expensive and not guaranteed to be less interrupted. Option E is wrong because on-demand instances cost more.

Practice this question →

58

MCQmedium

A company deploys a machine learning model on Amazon SageMaker for real-time inference. The model receives requests with large payloads (up to 5 MB) and the inference latency is high. Which configuration change would MOST likely reduce latency?

A.Pre-load multiple model containers on the same endpoint

B.Reduce the batch size for inference requests

C.Use a larger instance type with more memory and compute

D.Enable payload compression using SageMaker built-in compression

AnswerC

Larger instances can process large payloads faster.

Why this answer

Using multi-model endpoints reduces latency by loading only the required model into memory, but for large payloads, increasing instance size (Option B) helps handle compute and memory needs. Option A is wrong because SageMaker does not support payload compression natively. Option C is wrong because reducing batch size increases latency.

Option D is wrong because containers are automatically loaded; pre-loading is not an option.

Practice this question →

59

Multi-Selecthard

A company is using SageMaker to train a model and wants to ensure that the training data is encrypted at rest and in transit, and that the trained model artifacts are also encrypted. Which THREE actions should the company take?

Select 3 answers

A.Specify a KMS key in the SageMaker training job configuration to encrypt the ML storage volume

B.Enable SageMaker model encryption using a KMS key

C.Configure the training job to run in a VPC with no internet access

D.Enable AWS CloudTrail to log all API calls

E.Enable S3 server-side encryption (SSE-KMS) on the training data bucket

AnswersA, B, E

Encrypts the training instance's storage volume.

Why this answer

Options A, B, and D are correct. A: Enable S3 encryption for the data bucket. B: Use a KMS key for SageMaker training volumes.

D: Enable SageMaker model encryption. Option C (CloudTrail) is for auditing, not encryption. Option E (VPC) is for network isolation, not encryption.

Practice this question →

60

MCQmedium

A company is using Amazon SageMaker to train a XGBoost model on a large dataset. The training job is taking a long time. The data scientist wants to reduce training time without sacrificing model accuracy. The dataset is 100 GB in CSV format stored in S3. What is the most effective approach?

A.Reduce the number of instances to avoid communication overhead.

B.Use Pipe mode to stream data from S3 instead of downloading it first.

C.Use random sampling to reduce the dataset size to 10 GB.

D.Use SageMaker Managed Spot Training to reduce cost, but training time may increase due to interruptions.

AnswerB

Pipe mode reduces I/O time by streaming data directly to the algorithm.

Why this answer

Option B is correct because SageMaker's Pipe mode streams data directly from S3 to the training algorithm without writing it to disk, eliminating the I/O bottleneck of downloading the full 100 GB dataset. This reduces training time significantly by overlapping data loading with computation, while preserving model accuracy since the entire dataset is still used.

Exam trap

The trap here is that candidates often confuse cost optimization (Spot Training) with performance optimization, or incorrectly assume that reducing instances or data size is the only way to speed up training, ignoring SageMaker's specialized data streaming capability.

How to eliminate wrong answers

Option A is wrong because reducing the number of instances increases per-instance data load and can increase training time due to less parallelism, and communication overhead is negligible compared to I/O for large datasets. Option C is wrong because random sampling reduces dataset size, which sacrifices model accuracy by discarding potentially important data patterns, and the goal is to reduce time without sacrificing accuracy. Option D is wrong because SageMaker Managed Spot Training reduces cost, not training time; interruptions can actually increase training time due to checkpoint restarts, making it ineffective for the stated goal.

Practice this question →

61

MCQmedium

A media company uses SageMaker to deploy a real-time inference endpoint for content recommendation. The model is a PyTorch model that uses GPU. The endpoint is deployed with an ml.p3.2xlarge instance. Over time, the endpoint's latency increases significantly during peak hours. The company has enabled auto scaling based on CPU utilization. However, the latency spikes occur even when CPU utilization is low. The model is stateless and the inference code is efficient. What is the MOST likely cause of the latency spikes?

A.The model uses stateful processing that accumulates requests

B.Auto scaling is configured based on CPU utilization, but the bottleneck is GPU utilization

C.The inference container has a memory leak that causes gradual slowdown

D.The instance type is too small for the model

AnswerB

GPU metrics should be used for auto scaling.

Why this answer

Option A is correct because GPU utilization is the correct metric for GPU-based models. Auto scaling based on CPU utilization does not capture GPU load. Option B is wrong because there is no mention of memory issues.

Option C is wrong because the model is stateless. Option D is wrong because the model is already on GPU.

Practice this question →

62

MCQeasy

A team needs to automatically retrain a model every week using new data. Which SageMaker feature is designed to schedule and automate this workflow?

A.SageMaker Pipelines

B.SageMaker Automatic Model Tuning

C.SageMaker Model Monitor

D.SageMaker Data Wrangler

AnswerA

Pipelines can define and schedule training workflows.

Why this answer

SageMaker Pipelines allows building end-to-end ML workflows with scheduling. Option A is correct. Option B is for model monitoring.

Option C is for feature engineering. Option D is for automatic model tuning.

Practice this question →

63

MCQhard

A company is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single p3.2xlarge instance and takes 10 hours. The data scientist wants to reduce training time using distributed training. Which SageMaker feature should be used?

A.Use the SageMaker distributed data parallelism library with multiple p3.2xlarge instances.

B.Use SageMaker Managed Spot Training to reduce cost, but training time remains the same.

C.Use SageMaker Hyperparameter Tuning to find optimal hyperparameters faster.

D.Use the SageMaker distributed model parallelism library with a single p3dn.24xlarge instance.

AnswerA

Data parallelism divides the batch across GPUs and synchronizes gradients, scaling training.

Why this answer

Option A is correct because SageMaker's distributed data parallelism automatically splits batches across GPUs and synchronizes gradients, reducing training time. Option B is wrong because model parallelism is for models too large for a single GPU. Option C is wrong because Hyperparameter Tuning does not distribute training.

Option D is wrong because Managed Spot Training saves cost but does not reduce training time.

Practice this question →

64

MCQhard

A company is running a real-time inference endpoint on Amazon SageMaker. The endpoint is using an ml.c5.xlarge instance. Over the past month, the CPU utilization has been consistently below 10%, and the latency is well within requirements. The company wants to reduce costs. What should they do?

A.Use a smaller instance type

B.Set up a scaling policy to scale down to zero

C.Switch to a multi-model endpoint

D.Use a batch transform job instead

E.Move to a serverless inference endpoint

AnswerA

A smaller instance can reduce cost while meeting performance.

Why this answer

Option B is correct because a smaller instance type (e.g., ml.c5.large) can handle the load with lower cost. Option A (multi-model endpoint) can reduce cost by sharing instance among models, but may not be beneficial if only one model. Option C (scaling policy) is not needed if utilization is low.

Option D (serverless) may have cold starts and is suitable for sporadic traffic, not constant low utilization. Option E (batch transform) is for offline inference.

Practice this question →

65

MCQhard

A company is using Amazon SageMaker Ground Truth to create labeled datasets for a text classification task. The labeling job uses a private workforce of 10 annotators. After labeling 10,000 items, the quality of labels is inconsistent. Which approach will MOST effectively improve labeling consistency?

A.Remove annotations from annotators with low agreement after the job completes.

B.Increase the number of annotators to 20 to average out inconsistencies.

C.Configure the labeling job to use annotation consolidation with majority voting and require multiple annotations per item.

D.Use active learning to automatically label the most confident samples and only send uncertain ones to annotators.

AnswerC

Consensus from multiple annotators and majority voting yields more consistent labels.

Why this answer

Option D is correct because using a consensus pipeline with majority voting and annotation consolidation reduces individual annotator bias. Option A is wrong because increasing workforce size does not directly improve consistency. Option B is wrong because removing outliers after labeling may discard valid data.

Option C is wrong because active learning selects examples for labeling, but does not improve annotator consistency.

Practice this question →

66

MCQeasy

A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?

A.Deploy each model to a separate endpoint and use a load balancer.

B.Use a single endpoint with multiple instances behind it.

C.Use a SageMaker batch transform job to process inference requests in batches.

D.Use a SageMaker multi-model endpoint to host all models on one or more instances.

AnswerD

Multi-model endpoints efficiently host multiple models on shared instances, reducing cost.

Why this answer

A SageMaker multi-model endpoint (MME) allows hosting multiple models on a single or few instances, dynamically loading them from Amazon S3 into memory as needed. This is the most cost-effective option for a large ensemble of 500 MB models because it avoids the expense of separate endpoints or multiple instances per model, while still supporting low-latency real-time inference by keeping frequently used models cached.

Exam trap

The trap here is that candidates may confuse multi-model endpoints with multi-container endpoints or assume that a single endpoint cannot host multiple models, leading them to choose the expensive separate-endpoint approach (Option A) or the memory-inefficient single-endpoint approach (Option B).

How to eliminate wrong answers

Option A is wrong because deploying each model to a separate endpoint and using a load balancer would incur high costs (10 endpoints × instance costs) and add network latency from the load balancer, making it neither cost-effective nor optimal for low latency. Option B is wrong because a single endpoint with multiple instances behind it would require all 10 models to be loaded on every instance, consuming excessive memory (5 GB per instance) and increasing cost without leveraging model-sharing efficiencies. Option C is wrong because SageMaker batch transform is designed for asynchronous, offline inference on large datasets, not for real-time requests, and would introduce unacceptable latency for live inference.

Practice this question →

67

Multi-Selectmedium

A data scientist is training a model using SageMaker and wants to use spot instances to reduce costs. Which THREE considerations should the scientist evaluate? (Choose THREE.)

Select 3 answers

A.Spot instances have a fixed, lower price than on-demand.

B.The training job must support checkpointing to save progress.

C.Spot instances are only available for inference, not training.

D.The training algorithm must be fault-tolerant to handle interruptions.

E.Spot instances can be reclaimed with a two-minute notice.

AnswersB, D, E

Needed to resume after interruption.

Why this answer

Option A is correct because spot instances can be interrupted, so the training job must be checkpointed to resume. Option C is correct because spot instances are typically cheaper, but they can be reclaimed, affecting cost savings if interruptions are frequent. Option D is correct because model training is often fault-tolerant and can handle interruptions.

Option B is wrong because spot instances are dynamically priced, not fixed. Option E is wrong because spot instances are available for training, not just inference.

Practice this question →

68

MCQhard

A company uses Amazon SageMaker to deploy a model for real-time predictions. The model is updated weekly. The company wants to ensure that the new model version is gradually rolled out to a small percentage of traffic before full deployment, and that it can be rolled back quickly if issues are detected. Which deployment strategy should be used?

A.Blue/green deployment

B.A/B testing with a holdout group

C.Canary deployment using SageMaker endpoint variants

D.Rolling deployment across multiple endpoints

AnswerC

Canary deployment allows sending a small percentage of traffic to the new variant and can be rolled back by shifting traffic back.

Why this answer

Option D is correct because SageMaker supports canary deployments using endpoint variants, allowing a small percentage of traffic to be directed to the new model version. Option A is wrong because blue/green deployment switches all traffic at once. Option B is wrong because A/B testing is for comparing models, not gradual rollout.

Option C is wrong because rolling deployment is not natively supported by SageMaker endpoints.

Practice this question →

69

MCQeasy

A data scientist needs to create a SageMaker notebook instance with access to a private S3 bucket. The bucket uses SSE-KMS encryption. Which additional configuration is required?

A.Add a lifecycle configuration script

B.Modify the bucket policy to allow s3:GetObject

C.Place the notebook instance in a VPC

D.Attach a policy to the notebook's IAM role that allows kms:Decrypt

AnswerD

Needed to decrypt objects encrypted with SSE-KMS.

Why this answer

Option D is correct because the notebook instance needs permission to use the KMS key. Option A is wrong because VPC is optional. Option B is wrong because lifecycle configuration does not handle encryption.

Option C is wrong because bucket policy is separate.

Practice this question →

70

MCQmedium

A data scientist is using SageMaker to train a model using the built-in XGBoost algorithm. The training job fails with the error 'AlgorithmError: Framework error: No module named 'xgboost''. What is the most likely cause?

A.The training data is not in CSV format.

B.The training job is using a custom container that does not have XGBoost installed.

C.The IAM role does not have permission to access SageMaker.

D.The S3 output path is incorrect.

AnswerB

Missing module indicates container issue.

Why this answer

Option B is correct because the built-in XGBoost algorithm requires the 'xgboost' Python package; SageMaker's built-in algorithms provide the necessary environment, but if the container is overridden or the wrong image is used, the module may be missing. Option A is wrong because the error is about missing module, not data format. Option C is wrong because the error is not about permissions.

Option D is wrong because the error is not about output path.

Practice this question →

71

Multi-Selectmedium

A company is using SageMaker Autopilot to automatically build ML models. They want to ensure that the generated models are reproducible. Which TWO settings should they configure?

Select 2 answers

A.Set a random seed.

B.Specify a validation split.

C.Use multiple trials.

D.Enable early stopping.

E.Enable automatic feature engineering.

AnswersA, B

Random seeds make train/test split and model initialization deterministic.

Why this answer

Options B and D are correct. Setting a random seed ensures that random processes (like train/test split) are deterministic. Disabling automatic data splitting and manually providing a validation split gives control over the split.

Option A is incorrect because enabling early stopping can vary based on training dynamics. Option C is incorrect because using multiple trials introduces randomness. Option E is incorrect because automatic feature engineering may introduce non-deterministic transforms.

Practice this question →

72

MCQmedium

A company is using Amazon SageMaker to train a model. The training job is taking too long. The data scientist notices that the GPU utilization is low. Which action should be taken to improve training performance?

A.Increase the number of training instances.

B.Use spot instances to reduce cost.

C.Decrease the batch size to reduce memory usage.

D.Increase the batch size in the training script.

AnswerD

Larger batch size keeps GPU busy.

Why this answer

Option C is correct because increasing the batch size can improve GPU utilization by keeping the GPU busy. Option A is wrong because increasing the number of instances may not help if each instance is underutilized. Option B is wrong because using spot instances can reduce cost but does not improve utilization.

Option D is wrong because decreasing batch size would reduce utilization further.

Practice this question →

73

MCQmedium

A data science team is using Amazon SageMaker to train a model. The training job is failing with an 'OutOfMemory' error. The team is using a p3.2xlarge instance with 61 GB of memory. They need to resolve this issue as quickly as possible. Which action should they take?

A.Use a larger instance type, such as p3.8xlarge

B.Reduce the batch size in the training script

C.Use a spot instance to save costs

D.Enable distributed training across multiple instances

AnswerA

Larger instance types have more memory and can handle the workload.

Why this answer

Option A is correct because switching to a larger instance type with more memory will immediately resolve the out-of-memory error. Option B is wrong because reducing batch size may help but requires code changes and might still not be enough. Option C is wrong because using spot instances does not affect memory.

Option D is wrong because using distributed training adds complexity and may not resolve memory issues on a single instance.

Practice this question →

74

Matchingmedium

Match each SageMaker built-in algorithm to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Gradient boosted trees for regression and classification

Word2Vec and text classification

Learning embeddings for pairs of objects

Anomaly detection in IP traffic

Time series forecasting

Why these pairings

These are some of the built-in algorithms in SageMaker.

Practice this question →

75

MCQhard

A machine learning team is using Amazon SageMaker to train a PyTorch model on a dataset that is 500 GB in size. The training job runs on a single ml.p3.2xlarge instance, but the training takes over 48 hours, which exceeds the maximum allowed time. The team wants to reduce training time to under 24 hours. They are open to using multiple instances and have budget for up to 4 instances. The dataset is stored in Amazon S3 and can be split into shards by a key. The model architecture must remain unchanged. What should the team do?

A.Use SageMaker distributed data parallelism with 4 ml.p3.2xlarge instances.

B.Use SageMaker Processing to split the data and train separate models.

C.Change the instance type to ml.p3.16xlarge.

D.Switch to Pipe input mode to stream data faster.

AnswerA

Distributed training can reduce time proportionally with data parallelism.

Why this answer

Option D is correct because SageMaker's distributed data parallelism library (SMDDP) can efficiently split data across multiple GPUs with minimal code changes. Option A is wrong because increasing instance type alone may not halve training time. Option B is wrong because Pipe mode reduces I/O but not computation time.

Option C is wrong because SageMaker Processing is for preprocessing, not training.

Practice this question →

Page 1 of 5 · 351 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Machine Learning Implementation and Operations questions.

Start 20-question session