Knowledge + Practice

CCNA Machine Learning Implementation and Operations Questions

75 of 351 questions · Page 2/5 · Machine Learning Implementation and Operations · Answers revealed

Practice these questions Domain overview All questions

76

Multi-Selecthard

You are building a CI/CD pipeline for SageMaker using AWS CodePipeline. Which THREE components are essential for a fully automated model training and deployment pipeline?

Select 3 answers

A.AWS CodeCommit to store the training script and model code

B.AWS CodeBuild to run the training job as a build step

C.AWS Lambda function to create or update the SageMaker endpoint

D.AWS CodeDeploy to deploy the model to an endpoint

E.AWS CloudFormation to define the infrastructure

AnswersA, B, C

Source control is essential for CI/CD.

Why this answer

Options A, B, and D are correct. Option A: CodeCommit stores code. Option B: CodeBuild can run training jobs.

Option D: Lambda can trigger SageMaker endpoints. Option C is wrong because CodeDeploy is for EC2, not SageMaker. Option E is wrong because CloudFormation is optional.

Practice this question →

77

MCQhard

A machine learning engineer is deploying a model using SageMaker and wants to use automatic scaling for the endpoint based on the number of concurrent requests. The engineer has defined a scaling policy using the SageMakerVariantInvocationsPerInstance metric. However, the scaling is not triggering as expected. What could be the issue?

A.A scheduled scaling action must be created first.

B.The scaling policy does not have a cooldown period configured, or the cooldown period is too long.

C.The metric must be published to CloudWatch manually.

D.The metric is not available for automatic scaling.

AnswerB

Cooldown prevents scaling actions from triggering too frequently.

Why this answer

Option D is correct because scaling policies require a cooldown period (default 300 seconds) to prevent rapid scaling. Without it, the policy may not activate. Option A is wrong because the metric is valid.

Option B is wrong because the metric is emitted by default. Option C is wrong because scaling policy can be defined without a scheduled action.

Practice this question →

78

MCQeasy

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job uses a large dataset stored in Amazon S3, and the scientist wants to use pipe mode to stream the data directly from S3 to the training instance, reducing the time needed to download the data. The training job is configured with 'InputMode' set to 'Pipe'. However, the training job fails with an error indicating that the algorithm does not support pipe mode. What should the scientist do to resolve this issue?

A.Change the 'InputMode' to 'File'

B.Use a different instance type that supports pipe mode

C.Use AWS Glue to stream the data to the training instance

D.Switch to a different built-in algorithm that supports pipe mode

AnswerA

File mode downloads the data first; it is supported by all algorithms.

Why this answer

Option A is correct because not all built-in algorithms support pipe mode; the scientist should use a file mode instead. Option B is wrong because the issue is not with the instance type. Option C is wrong because changing to a different algorithm may not be necessary if the current algorithm works with file mode.

Option D is wrong because SageMaker does not support using Glue to stream data directly to training jobs.

Practice this question →

79

MCQeasy

A team uses AWS Glue ETL jobs to preprocess data for SageMaker training. The job runs successfully but the output data is empty. What is the most likely cause?

A.There is a data type mismatch between source and target

B.The source data is partitioned and only a subset of partitions is read

C.The filter transformation condition is too restrictive, removing all rows

D.The Glue job runs out of memory and fails silently

AnswerC

Filtering all rows results in empty output.

Why this answer

Option A is correct: If the filter condition excludes all records, output is empty. Option B (partition pruning) would not cause empty output if data exists. Option C (data type mismatch) causes errors, not empty output.

Option D (insufficient memory) causes job failure, not empty output.

Practice this question →

80

Multi-Selectmedium

A company uses Amazon SageMaker to train models. The data scientist wants to automate the retraining process whenever new data arrives in an S3 bucket. Which THREE services can be used together to achieve this? (Choose THREE.)

Select 3 answers

A.Amazon S3

B.Amazon EC2

C.AWS Lambda

D.Amazon SageMaker

E.AWS Glue

AnswersA, C, D

S3 events can trigger the pipeline.

Why this answer

Options A, C, and D are correct. A: Amazon S3 can trigger events on new data. C: AWS Lambda can process the event and start the training job.

D: SageMaker can run the training job. Option B (Amazon EC2) is not needed. Option E (AWS Glue) is for ETL, not directly for triggering retraining.

Practice this question →

81

MCQmedium

A machine learning engineer is responsible for deploying a model that was trained using a custom algorithm in Amazon SageMaker. The engineer has built a Docker container that includes the inference code and has tested it locally. The engineer now wants to deploy the container to a SageMaker endpoint for real-time inference. The engineer has already created the model in SageMaker by specifying the image URI and the model artifacts location in S3. However, when the engineer tries to create an endpoint configuration, the operation fails with an error indicating that the model is not in an 'Active' state. What should the engineer do to resolve this issue?

A.Check the CloudWatch logs for the container to ensure the inference server starts correctly

B.Create the endpoint configuration with a different model name

C.Delete and re-create the model, then wait for a few minutes

D.Re-create the model using a different image URI

AnswerA

The health check requires the container to respond to a ping request. Logs will show if the server failed to start.

Why this answer

Option C is correct because the model must be in an 'Active' state before it can be deployed, and this requires the container to pass SageMaker's health check. The engineer should check the CloudWatch logs for the container to diagnose the health check failure. Option A is wrong because re-creating the model with the same image will not fix the health check issue.

Option B is wrong because the model is already created; the issue is the state. Option D is wrong because the endpoint configuration cannot be created if the model is not active.

Practice this question →

82

MCQeasy

A company is using Amazon SageMaker to build a binary classification model. The dataset is highly imbalanced, with 95% negative class and 5% positive class. Which technique should be used to address the class imbalance?

A.Use a weighted loss function during training.

B.Use accuracy as the primary evaluation metric.

C.Perform random under-sampling of the majority class.

D.Remove all examples from the majority class.

AnswerA

Weighted loss penalizes errors on minority class more heavily.

Why this answer

Option D is correct because using a weighted loss function during training assigns higher weight to the minority class, helping the model learn better from imbalanced data. Option A is wrong because removing the majority class reduces data size and may lose important patterns. Option B is wrong because random under-sampling can discard useful data.

Option C is wrong because using accuracy as the evaluation metric is inappropriate for imbalanced data; precision/recall or AUC are better.

Practice this question →

83

MCQmedium

A data scientist uses SageMaker to train a model and wants to automatically stop the training job if the loss is not improving after a certain number of steps. Which feature should be used?

A.SageMaker Experiments

B.SageMaker Debugger

C.SageMaker Automatic Model Tuning

D.SageMaker Ground Truth

AnswerB

Debugger can monitor and stop jobs based on rules.

Why this answer

SageMaker Debugger can monitor loss and trigger actions like stopping the job. Option D is correct. Option A is wrong because automatic tuning is for hyperparameter optimization.

Option B is wrong because Experiments is for tracking. Option C is wrong because Ground Truth is for labeling.

Practice this question →

84

MCQhard

An ML team is using SageMaker Processing jobs to run feature engineering scripts. The scripts require a specific Python package not included in the default SageMaker image. How should the team provide this package?

A.Include 'pip install <package>' in the processing script

B.Use the SageMaker prebuilt deep learning container with the package

C.Place a requirements.txt file in the input data S3 bucket

D.Create a custom Docker image that includes the package and use it for the Processing job

AnswerD

Standard best practice for custom dependencies.

Why this answer

A custom container allows full control over dependencies. Option A is wrong because pip install in script is not persisted. Option B is wrong because SageMaker doesn't support requirements.txt directly.

Option D is wrong because a prebuilt image may not have the package.

Practice this question →

85

MCQmedium

A team is training an XGBoost model using SageMaker with a large dataset in S3 (100 GB). Training is taking too long. Which change will most likely reduce training time without sacrificing accuracy?

A.Reduce the number of training instances

B.Configure Pipe mode for data input

C.Enable SageMaker Managed Spot Training

D.Use a larger instance type with more vCPUs

AnswerB

Pipe mode streams data directly from S3, reducing I/O bottleneck and training time.

Why this answer

Option D is correct: enabling SageMaker Managed Spot Training reduces cost but does not accelerate training. Option A (increasing instance count) can reduce training time via distributed training. Option B (using Pipe mode) streams data for faster I/O.

Option C (reducing instance count) would increase time. Option D is about cost, not speed.

Practice this question →

86

MCQhard

A data scientist is training a deep learning model on a large dataset using SageMaker. The training job is taking too long. Upon reviewing the CloudWatch logs, the scientist notices that the GPU utilization is below 10% most of the time. Which change is MOST likely to improve GPU utilization and reduce training time?

A.Increase the batch size in the training script.

B.Use a different optimizer that requires less computation.

C.Switch to a smaller instance type to reduce data transfer overhead.

D.Reduce the size of the training dataset.

AnswerA

Increasing batch size can improve GPU utilization by processing more data per step.

Why this answer

Low GPU utilization often indicates a data loading bottleneck. Increasing the batch size can improve GPU utilization by feeding more data at once, but it may also cause memory issues. Using a larger instance type with more GPU memory could help if the model is large.

However, the most common fix is to use SageMaker Pipe Mode or Fast File Mode to stream data efficiently, reducing I/O wait. Among the options, increasing batch size is a direct way to increase GPU utilization.

Practice this question →

87

MCQeasy

A data scientist has trained a model using SageMaker and wants to deploy it to an endpoint. Which step is required before deployment?

A.Upload the training data to S3

B.Create a custom Docker image

C.Retrain the model with more data

D.Register the model in SageMaker Model Registry

AnswerD

Model must be registered to be deployable.

Why this answer

Option A is correct because a model must be registered. Option B is wrong because training data is not needed after training. Option C is wrong because the model is already trained.

Option D is wrong because Docker images are not required for built-in algorithms.

Practice this question →

88

MCQeasy

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

A.The pipeline tries to update an existing endpoint, but the sagemaker:UpdateEndpoint action is not allowed.

B.The pipeline tries to create a new endpoint, but the sagemaker:CreateEndpoint action is denied.

C.The pipeline tries to delete the old endpoint, but the sagemaker:DeleteEndpoint action is denied by a Deny statement.

D.The pipeline attempts to invoke the endpoint, but the sagemaker:InvokeEndpoint action is denied.

AnswerA

The policy does not include sagemaker:UpdateEndpoint, which is required to update an existing endpoint. Without this permission, the update fails.

Why this answer

The pipeline is attempting to deploy an updated model to an existing SageMaker endpoint, which requires the sagemaker:UpdateEndpoint action. The IAM policy shown in the exhibit (not provided here but implied) does not include this action, so the API call fails with an access denied error. Without explicit permission to update the endpoint, the CI/CD pipeline cannot modify the deployed configuration.

Exam trap

The trap here is that candidates may confuse the actions required for updating an existing endpoint (UpdateEndpoint) with those for creating a new one (CreateEndpoint), leading them to incorrectly select Option B when the pipeline is actually performing an update.

How to eliminate wrong answers

Option B is wrong because the pipeline is not creating a new endpoint; it is updating an existing one, so sagemaker:CreateEndpoint is not the required action. Option C is wrong because the pipeline does not need to delete the old endpoint; SageMaker endpoints are updated in-place via UpdateEndpoint, which handles traffic shifting automatically. Option D is wrong because the pipeline is not invoking the endpoint during deployment; InvokeEndpoint is used for inference requests, not for model deployment operations.

Practice this question →

89

MCQeasy

An ML engineer needs to store and version training datasets and model artifacts. Which AWS service should they use?

A.Amazon DynamoDB

B.Amazon Simple Storage Service (S3)

C.Amazon Elastic File System (EFS)

D.Amazon Elastic Block Store (EBS)

AnswerB

S3 supports versioning and is commonly used for ML artifacts.

Why this answer

Amazon S3 provides scalable object storage with versioning. Option A is wrong because EBS is block storage for single EC2 instances. Option C is wrong because EFS is file storage for shared access.

Option D is wrong because DynamoDB is a NoSQL database.

Practice this question →

90

MCQmedium

An IAM policy attached to a SageMaker notebook role is shown in the exhibit. A data scientist is trying to run a training job from the notebook, but the job fails with an access denied error. The training job needs to read data from 'my-bucket' and write output to 'my-bucket'. What is the most likely cause of the failure?

A.The policy does not allow s3:ListBucket

B.The training job execution role does not have the same permissions

C.The policy does not allow sagemaker:CreateTrainingJob

D.The S3 bucket is not specified in the Resource

E.The policy does not allow s3:GetObject

AnswerB

The notebook role is used for the notebook; the training job uses an execution role that may lack permissions.

Why this answer

Option C is correct because the policy allows s3:PutObject on 'my-bucket/*', but the training job may need to write to a different bucket or path. Option A is wrong because the actions are allowed. Option B is wrong because s3:PutObject is allowed.

Option D is wrong because the bucket is specified. Option E is wrong because the training job role is separate from the notebook role.

Practice this question →

91

MCQeasy

A data scientist needs to run a one-time training job on a large dataset using SageMaker. The job requires a specific PyTorch version and custom dependencies. Which approach is MOST efficient?

A.Create a custom Docker container and push to ECR.

B.Launch a SageMaker notebook instance, install dependencies, and run training script.

C.Use the SageMaker PyTorch estimator with a pre-built container.

D.Use the SageMaker generic container and install PyTorch via a lifecycle configuration.

AnswerC

The framework estimator manages the container and allows adding custom dependencies via source_dir.

Why this answer

SageMaker provides pre-built deep learning containers (DLCs) for PyTorch. Using a SageMaker framework estimator with a pre-built PyTorch container is the easiest and most efficient. The framework estimator automatically handles the container and dependencies.

Creating a custom container is more work. Using the generic container requires manually installing dependencies. Using a notebook instance is for interactive development, not one-time training.

Practice this question →

92

MCQmedium

A data scientist is performing hyperparameter tuning using Amazon SageMaker Automatic Model Tuning (AMT). The job uses a random search strategy. After 20 training jobs, the best objective metric value has plateaued. The data scientist wants to explore more of the hyperparameter space. Which action should the data scientist take?

A.Change the tuning strategy from Random to Bayesian.

B.Enable early stopping.

C.Decrease the maximum number of training jobs.

D.Increase the number of parallel training jobs.

AnswerA

Bayesian search uses past results to guide exploration.

Why this answer

Option C is correct because switching to Bayesian search will explore new regions based on previous results, potentially finding better hyperparameters. Option A is wrong because increasing the number of parallel jobs does not change the search strategy. Option B is wrong because decreasing the number of training jobs reduces exploration.

Option D is wrong because early stopping does not change the search strategy.

Practice this question →

93

MCQeasy

A data scientist is training a model on Amazon SageMaker and notices that the training job is taking much longer than expected. The instance type is ml.m5.xlarge and the dataset is 10 GB in CSV format. Which action is MOST likely to reduce training time without changing the instance type?

A.Change the instance type to ml.p3.2xlarge (GPU) for faster computation.

B.Reduce the number of training epochs to speed up convergence.

C.Convert the dataset to RecordIO or Parquet format before training.

D.Increase the batch size to the maximum supported by the instance memory.

AnswerC

RecordIO and Parquet are columnar formats that reduce I/O and allow faster data loading in SageMaker.

Why this answer

Option B is correct because converting CSV to optimized formats like Parquet or RecordIO reduces I/O overhead and improves throughput. Option A is wrong because increasing batch size may help but does not address I/O. Option C is wrong because changing to a GPU instance is not allowed per stem.

Option D is wrong because reducing epochs reduces accuracy, not training time effectively.

Practice this question →

94

MCQeasy

A data scientist is using Amazon SageMaker to train a model. Training is taking longer than expected. The scientist notices that the training job is using a single instance type with limited GPU memory. Which action will MOST likely reduce training time?

A.Configure the training job to use distributed data parallelism across multiple instances.

B.Use SageMaker Managed Spot Training to lower cost.

C.Use batch normalization layers.

D.Enable SageMaker Debugger for real-time monitoring.

AnswerA

Distributed data parallelism splits the dataset across multiple GPUs/instances, reducing per-worker memory and training time.

Why this answer

Distributed data parallelism (Option B) splits the data across multiple GPUs, reducing per-worker memory load and speeding up training. Option A (batch normalization) does not reduce training time. Option C (Spot Instances) introduces interruptions and may increase total time.

Option D (SageMaker Debugger) is for monitoring, not performance.

Practice this question →

95

Multi-Selectmedium

A company is using Amazon SageMaker to build a machine learning pipeline. The pipeline includes data preprocessing, training, and evaluation steps. The company wants to ensure that the pipeline is reproducible and that artifacts are versioned. Which TWO actions should be taken? (Choose TWO.)

Select 2 answers

A.Use a naming convention for training jobs that includes the date.

B.Use SageMaker Pipelines to create the pipeline and enable versioning on the pipeline artifacts.

C.Create a requirements.txt file with specific library versions for the training script.

D.Use AWS CodePipeline to trigger the pipeline on code changes.

E.Store the training dataset in a versioned S3 bucket.

AnswersB, C

SageMaker Pipelines version artifacts automatically.

Why this answer

SageMaker Pipelines provides a native way to define, orchestrate, and version machine learning pipelines. By enabling versioning on pipeline artifacts (e.g., via the `Pipeline` object's `version` parameter or by using SageMaker Model Registry), each pipeline run is tracked with a unique version, ensuring reproducibility. This directly addresses the requirement for reproducible pipelines and versioned artifacts.

Exam trap

The trap here is that candidates often confuse data versioning (Option E) with pipeline versioning, or assume that a naming convention (Option A) or CI/CD trigger (Option D) is sufficient for reproducibility, when in fact only a purpose-built pipeline orchestration service with artifact versioning (Option B) combined with environment pinning (Option C) meets both requirements.

Practice this question →

96

MCQmedium

A data scientist is training a model using Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single ml.p3.2xlarge instance. The data is stored in S3 and is about 50 GB in size. Which action would MOST likely reduce training time?

A.Enable automatic data sharding in the SageMaker training job.

B.Enable S3 server-side encryption on the training data.

C.Use a larger instance type, such as ml.p3.16xlarge.

D.Change the input mode from File to Pipe.

AnswerD

Pipe mode streams data directly from S3, reducing I/O time.

Why this answer

Option C is correct because using Pipe input mode streams data directly from S3 to the training algorithm without writing to disk, reducing I/O latency. Option A is wrong because increasing instance size may help but incurs higher cost and may not address data loading bottleneck. Option B is wrong because SageMaker automatically manages data channels.

Option D is wrong because enabling data compression on the S3 objects does not reduce training time significantly and adds CPU overhead for decompression.

Practice this question →

97

MCQhard

A machine learning engineer is using Amazon SageMaker to train a model. The training job is taking too long. The engineer suspects the data loading is a bottleneck. Which action would MOST effectively diagnose the issue?

A.Monitor CPU utilization in CloudWatch

B.Enable SageMaker Model Monitor

C.Use SageMaker Debugger to profile the training job

D.Increase the instance type to a larger one

AnswerC

Debugger can capture detailed metrics like data loading time.

Why this answer

Option C is correct because SageMaker Debugger can profile the training job and identify bottlenecks. Option A may add overhead. Option B is not detailed.

Option D is for inference.

Practice this question →

98

MCQhard

A company is using Amazon SageMaker to train and deploy a fraud detection model. The model is a gradient boosting machine (GBM) trained on a dataset with 10 million rows and 50 features. The training job runs on an ml.m5.2xlarge instance with 8 vCPUs and 32 GB memory. The training completes successfully, and the model is deployed to a real-time endpoint. After deployment, the inference latency is around 200 ms per request, which is acceptable. However, after a week, the company observes that latency increases to over 1 second during peak hours (12:00-13:00 UTC). CloudWatch metrics show CPU utilization on the endpoint instance reaches 95% during these peaks. The endpoint is configured with a single ml.m5.large instance. The company wants to maintain latency under 500 ms during peak hours without incurring unnecessary cost during off-peak hours. Which solution should the company implement?

A.Reduce the number of instances to zero during off-peak hours and manually launch a new endpoint every day at 12:00

B.Switch to SageMaker Batch Transform and have the application send requests in batches

C.Configure SageMaker endpoint auto scaling with a target CPU utilization of 70% and a minimum instance count of 1

D.Replace the endpoint instance type with ml.m5.4xlarge to handle peak load

AnswerC

Auto scaling dynamically adjusts instance count to handle load, keeping latency low and cost efficient.

Why this answer

Option A is correct: configuring auto scaling based on CPU utilization adds instances during peak and removes them during off-peak, meeting latency and cost goals. Option B (Batch Transform) is for offline inference. Option C (larger instance) is less cost-effective.

Option D (scale down) would worsen latency.

Practice this question →

99

MCQhard

A company is using SageMaker to train a large NLP model. The training job is taking too long due to high I/O wait time. The data is stored as CSV files in S3. Which optimization should the company implement to reduce I/O wait time?

A.Convert CSV files to RecordIO format

B.Use SageMaker Pipe mode to stream data directly from S3

C.Use SageMaker batch transform before training

D.Use SageMaker File mode with larger instance storage

E.Use SageMaker ShardedByS3Key data distribution

AnswerB

Pipe mode avoids disk I/O by streaming data.

Why this answer

Option B is correct because Pipe mode streams data directly from S3, reducing disk I/O. Option A (File mode) writes to disk first, causing I/O wait. Option C (ShardedByS3Key) is for distributed training but does not reduce I/O.

Option D (RecordIO) reduces file size but still uses File mode. Option E (SageMaker batch transform) is for inference, not training.

Practice this question →

100

MCQmedium

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. After the Autopilot job completes, the best model has an accuracy of 0.85 on the validation set. However, the data scientist notices a class imbalance (90% negative, 10% positive). Which metric should the data scientist use to evaluate the model's performance on the positive class?

A.Area Under the ROC Curve (AUC)

B.Recall

C.Accuracy

D.Precision

AnswerA

AUC is robust to class imbalance and evaluates overall ranking performance.

Why this answer

Option D is correct because AUC measures the model's ability to distinguish between classes irrespective of threshold, and is suitable for imbalanced datasets. Option A is wrong because accuracy is misleading when classes are imbalanced. Option B is wrong because precision only considers positive predictions.

Option C is wrong because recall only considers actual positives.

Practice this question →

101

Multi-Selectmedium

An ML team is deploying a model for real-time inference. They require A/B testing to compare a new model against the existing one. Which THREE steps should they take to set up this test?

Select 3 answers

A.Set up a second production variant for the existing model

B.Set up a SageMaker Batch Transform job for each model

C.Configure CloudWatch alarms to trigger variant switching

D.Configure the endpoint to route a percentage of traffic to each variant

E.Create a SageMaker production variant for the new model

AnswersA, D, E

Both models must be variants to split traffic.

Why this answer

Options A, B, and D are correct. A: Create a production variant with the new model. B: Set up a production variant for the existing model.

D: Configure the endpoint to route traffic between variants. Option C (CloudWatch) is monitoring, not part of setup. Option E (Batch Transform) is for batch, not real-time.

Practice this question →

102

Multi-Selecthard

A company is using Amazon SageMaker to train a machine learning model. The training job is configured to use the File mode to download data from S3 to the training instances. The training data is stored in a single S3 bucket with multiple prefixes. Which TWO actions are required to ensure the training job can access the data? (Choose TWO.)

Select 2 answers

A.Grant the SageMaker execution role s3:GetObject permission for the data bucket.

B.Configure the training job to use Pipe mode.

C.Specify the S3 data channel with the correct prefix.

D.Concatenate all data files into a single file.

E.Convert the data to RecordIO-protobuf format.

AnswersA, C

Needed to read objects.

Why this answer

Options A and D are correct. Option A: The IAM role must have s3:GetObject permission for the bucket. Option D: The input data channel must specify the S3 URI with the correct prefix.

Option B is wrong because File mode does not require RecordIO. Option C is wrong because Pipe mode is not used. Option E is wrong because File mode does not require data in a single file.

Practice this question →

103

Multi-Selecthard

Which THREE are valid considerations when deploying a large deep learning model (10 GB) on a SageMaker endpoint? (Choose 3.)

Select 3 answers

A.Enable SageMaker Data Compression for network transfer.

B.Use GPU instances (e.g., p3, inf1) for faster inference.

C.Use SageMaker Multi-Model Endpoints to serve multiple models.

D.Use SageMaker Serverless Inference to avoid managing instances.

E.Attach Elastic Inference accelerators.

AnswersA, B, C

Compression reduces data transfer time.

Why this answer

Using GPU instances (Option A), enabling data compression (Option C), and using multi-model endpoints (Option D) are valid considerations. Option B (Elastic Inference) is deprecated and not recommended. Option E (serverless inference) has a payload limit and cold starts unsuitable for large models.

Practice this question →

104

MCQhard

A media company uses Amazon SageMaker to train a deep learning model for video classification. The training job uses a single ml.p3.2xlarge instance and processes 50 GB of labeled video data stored in Amazon S3. The training completes successfully in 12 hours. However, the data scientists report that the model’s accuracy is lower than expected. They suspect the training data contains labeling errors. To improve model accuracy without incurring significant additional cost, they want to identify and remove mislabeled training examples before retraining. They have a small budget of $50 and need to complete the analysis within 2 hours. Which approach should the data scientists take?

A.Use SageMaker Ground Truth to create a new labeling job for the entire dataset, then compare the new labels with the original labels to identify discrepancies.

B.Use SageMaker Clarify to generate a bias report for the training data and remove instances that contribute to bias.

C.Train a small, fast model on a random sample of the data (e.g., 1 GB) using a cheaper instance like ml.m5.xlarge, then use the model's prediction confidence to flag low-confidence examples as potential mislabels for manual review.

D.Manually review all 50 GB of video data to correct labels.

AnswerC

This approach is cost-effective (within $50) and fast (under 2 hours). The small model can identify likely mislabeled examples by low confidence, allowing targeted manual review.

Why this answer

Option C is correct because training a small, fast model on a 1 GB random sample using a cheaper instance (ml.m5.xlarge) allows the team to quickly identify low-confidence predictions, which are strong indicators of mislabeled examples. This approach fits within the $50 budget and 2-hour time constraint, as it avoids processing the full 50 GB dataset and leverages a lightweight model for rapid iteration. By flagging only suspicious samples for manual review, the team can efficiently clean the training data without incurring the cost of re-labeling the entire dataset.

Exam trap

The trap here is that candidates may choose SageMaker Ground Truth (Option A) assuming it is the standard tool for label correction, but they overlook the strict budget and time constraints that make it infeasible for the full dataset.

How to eliminate wrong answers

Option A is wrong because using SageMaker Ground Truth to create a new labeling job for the entire 50 GB dataset would exceed the $50 budget and 2-hour time limit, as labeling large video datasets is expensive and time-consuming. Option B is wrong because SageMaker Clarify is designed for detecting bias in data and models, not for identifying individual mislabeled examples; it generates bias reports but cannot pinpoint which specific labels are erroneous. Option D is wrong because manually reviewing all 50 GB of video data is impractical within the 2-hour window and would far exceed the $50 budget, requiring significant human effort and cost.

Practice this question →

105

Multi-Selecthard

A company is deploying a machine learning model to a SageMaker endpoint and wants to ensure that the endpoint is resilient to instance failures. Which THREE steps should the company take to achieve high availability? (Choose THREE.)

Select 3 answers

A.Deploy the endpoint in a VPC with subnets in at least two Availability Zones.

B.Use a single instance type with the largest size to handle capacity.

C.Configure the endpoint with an initial instance count of at least 2.

D.Use a single Availability Zone for simplicity.

E.Enable auto-scaling to automatically replace unhealthy instances.

AnswersA, C, E

Provides AZ redundancy.

Why this answer

Option A is correct because deploying across multiple Availability Zones provides zone redundancy. Option C is correct because using multiple instances in the endpoint configuration ensures that if one instance fails, others handle traffic. Option D is correct because enabling auto-scaling can replace failed instances.

Option B is wrong because a single instance in one AZ is not resilient. Option E is wrong because a single AZ does not protect against AZ failure.

Practice this question →

106

Multi-Selecthard

A data scientist is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error message indicating that the container exited with a non-zero code. Which THREE steps should the data scientist take to diagnose the issue? (Choose THREE.)

Select 3 answers

A.Retry the training job with the same configuration; the error might be transient.

B.Use the SageMaker Debugger to capture system metrics and output tensors for analysis.

C.Check the CloudWatch Logs for the training job to see the container's stdout and stderr.

D.Increase the number of training instances to distribute the workload.

E.Run the container locally using SageMaker Local Mode to simulate the training environment.

AnswersB, C, E

Debugger can capture detailed metrics that help identify why the container exited.

Why this answer

Option A, B, and D are correct because checking CloudWatch logs, testing locally with SageMaker Local Mode, and using the SageMaker Debugger can help identify the error. Option C is wrong because increasing instance count does not fix the error. Option E is wrong because Retry may mask the issue.

Practice this question →

107

MCQeasy

A data scientist is trying to create a SageMaker training job using an execution role with the attached IAM policy. The training job fails with an access denied error when trying to read training data from the S3 bucket 'my-bucket'. What is the most likely cause?

A.The S3 bucket policy explicitly denies access to the role.

B.The IAM policy does not include s3:ListBucket permission.

C.The S3 bucket is in a different AWS account.

D.The sagemaker:CreateTrainingJob action is not allowed.

AnswerA

Even if IAM allows, bucket policy can deny.

Why this answer

Option D is correct because the bucket policy may deny access. Option A is wrong because the role has s3:GetObject. Option B is wrong because it's allowed.

Option C is wrong because it's allowed.

Practice this question →

108

MCQhard

A machine learning engineer is deploying a model on Amazon SageMaker that was trained using a custom Docker container. The container is stored in Amazon ECR. The engineer creates a SageMaker model and endpoint configuration, but when creating the endpoint, it fails with an error: 'Could not find the inference code at the expected path.' The engineer verified that the container image is correct and the model artifacts are in S3. What is the most likely cause?

A.The container is not compatible with the SageMaker inference environment.

B.The SageMaker execution role does not have ECR pull permissions.

C.The model artifacts are not in the correct format.

D.The inference code is not placed in the /opt/ml/model directory inside the container.

AnswerD

SageMaker expects code in /opt/ml/model for custom containers.

Why this answer

Option C is correct because SageMaker expects the inference code to be in /opt/ml/model/ directory. Option A is wrong because ECR permissions would cause a different error. Option B is wrong because model artifacts are separate.

Option D is wrong because an incorrect Region would cause a different error.

Practice this question →

109

MCQhard

A company deploys a real-time inference endpoint using Amazon SageMaker with an ML model that has strict latency requirements. The endpoint currently uses a single ml.c5.xlarge instance. During a load test, the p99 latency exceeds the 100ms threshold. The team adds more instances but latency does not improve because the model is heavily CPU-bound. What is the MOST cost-effective change to meet the latency requirement?

A.Change the instance type to a GPU instance such as ml.g4dn.xlarge.

B.Use a multi-model endpoint to serve multiple models on the same instance.

C.Enable automatic scaling based on inference latency.

D.Increase the number of instances and use a target tracking scaling policy.

AnswerA

GPU instances accelerate model inference, reducing per-request latency.

Why this answer

Switching to an instance with GPU acceleration (e.g., ml.g4dn.xlarge) offloads computation to GPU, reducing CPU-bound latency. More instances (B) increase throughput but not per-request latency if the model is CPU-bound. Multi-model endpoints (C) help with many models but not single-model latency.

Automatic scaling (D) helps with varying load but not per-request latency improvement.

Practice this question →

110

MCQeasy

A data scientist needs to deploy a trained model to Amazon SageMaker for real-time inference. The model is stored as a .tar.gz file in Amazon S3. Which AWS service is used to create a SageMaker endpoint?

A.SageMaker Model and Endpoint Configuration

B.AWS Lambda

C.AWS CloudFormation

D.Amazon ECS

AnswerA

You create a Model, then EndpointConfig, then Endpoint.

Why this answer

Option B is correct because a SageMaker endpoint requires a model and an endpoint configuration. Options A, C, D are not required for creating the endpoint.

Practice this question →

111

MCQeasy

A data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model overfits on the training data. Which hyperparameter adjustment is MOST likely to reduce overfitting?

A.Increase the value of the max_depth hyperparameter.

B.Increase the value of the subsample hyperparameter to 1.0.

C.Increase the value of the lambda (L2 regularization) hyperparameter.

D.Increase the value of the num_round hyperparameter.

AnswerC

L2 regularization penalizes large coefficients, reducing model complexity and overfitting.

Why this answer

Option A is correct because increasing the L2 regularization lambda penalizes large weights and reduces overfitting. Option B is wrong because increasing max_depth increases model complexity, worsening overfitting. Option C is wrong because increasing num_round can lead to more overfitting.

Option D is wrong because increasing subsample may reduce overfitting, but is less direct than regularization.

Practice this question →

112

Multi-Selecthard

A company is using Amazon SageMaker to build a custom model. The training job is failing with a 'ResourceLimitExceeded' error. Which TWO actions should the company take to resolve this issue?

Select 2 answers

A.Request a service quota increase for SageMaker training instances.

B.Use spot instances for training.

C.Use Amazon EFS for training data.

D.Reduce the size of the training dataset.

E.Use a smaller instance type.

AnswersA, B

Increases the maximum number of instances.

Why this answer

Option A is correct because it increases the limit. Option D is correct because spot instances can help. Option B is wrong because it doesn't address limit.

Option C is wrong because it's for storage. Option E is wrong because it doesn't help with compute limits.

Practice this question →

113

MCQeasy

A data science team is deploying a machine learning model to production using Amazon SageMaker. The model requires real-time inference with low latency. Which SageMaker feature should they use to deploy the model?

A.SageMaker Notebook Instance

B.SageMaker Batch Transform

C.SageMaker Autopilot

D.SageMaker Realtime Endpoint

AnswerD

Provides low-latency, real-time inference.

Why this answer

Option C is correct because SageMaker Realtime Endpoints provide low-latency, synchronous inference. Option A is wrong because batch transform is for asynchronous, batch predictions. Option B is wrong because SageMaker Notebooks are for development, not deployment.

Option D is wrong because SageMaker Autopilot automates model building, not deployment.

Practice this question →

114

MCQhard

A machine learning team is using Amazon SageMaker Experiments to track multiple training runs. They need to compare the performance of different models based on metrics like accuracy and F1 score. However, when they view the experiment list in SageMaker Studio, the metrics are not displayed. What is the MOST likely cause?

A.The training job did not define metric definitions in the algorithm specification.

B.The training script did not use the SageMaker SDK to log the metrics.

C.The training job is running on an instance type that does not support Experiments.

D.The IAM role used by SageMaker does not have permission to write to the Experiments table.

AnswerB

Metrics must be logged using experiment.log_metric() or automatically if using frameworks with SageMaker integration.

Why this answer

Option C is correct because metrics must be explicitly logged via the SageMaker SDK's log_metric() or in training job definition to appear in Experiments. Option A is wrong because permission to view is separate from metric capture. Option B is wrong because the metric definitions in the training job are optional but not required for logging.

Option D is wrong because Experiments run on any instance type.

Practice this question →

115

MCQhard

A training job log shows this error. The training instance is an ml.m5.large with 8 GB EBS storage. The training data is 500 MB, and the model size is expected to be 200 MB. What is the most likely cause?

A.The training data is not fully downloaded from S3 before processing

B.The S3 bucket does not have write permissions

C.The training instance does not have enough RAM

D.The training process is generating large temporary files that fill the instance's local storage

AnswerD

Intermediate files, checkpoints, or logs can exceed the 8 GB storage.

Why this answer

Option C is correct: the instance's local storage is full due to temporary files or checkpoints. Option A (insufficient memory) would show MemoryError. Option B (S3 permissions) would show AccessDenied.

Option D (data download) would show download errors.

Practice this question →

116

Multi-Selecteasy

A data scientist needs to select a model training infrastructure that supports distributed training across multiple GPUs and provides automatic model parallelism. Which TWO AWS services should the scientist consider?

Select 2 answers

A.AWS Glue

B.AWS Lambda

C.Amazon Redshift

D.Amazon EMR

E.Amazon SageMaker

AnswersD, E

EMR with Spark MLlib can perform distributed training.

Why this answer

Options A (SageMaker) and B (Amazon EMR) are correct. SageMaker supports distributed training with model parallelism. EMR with Spark supports distributed ML.

Option C (AWS Glue) is for ETL, not training. Option D (Amazon Redshift) is a data warehouse. Option E (AWS Lambda) is not for large-scale training.

Practice this question →

117

MCQmedium

A data scientist is using Amazon SageMaker to train a model using the built-in XGBoost algorithm. The training job uses a hyperparameter tuning job to optimize hyperparameters. The tuning job has been running for 3 hours and has completed 20 training jobs. The data scientist wants to stop the tuning job early if it is not making progress. What should the data scientist do to accomplish this?

A.Configure the tuning job with early stopping enabled.

B.Set up a CloudWatch alarm to stop the tuning job if a metric does not improve.

C.Use SageMaker Experiments to monitor and manually stop the tuning job.

D.Use SageMaker Debugger to stop training jobs that are not improving.

AnswerA

Built-in early stopping stops underperforming training jobs.

Why this answer

Option B is correct because SageMaker's automatic model tuning supports early stopping with training job early stopping type. Option A is wrong because SageMaker Experiments is for tracking, not stopping. Option C is wrong because SageMaker Debugger stops training jobs, not tuning jobs.

Option D is wrong because a CloudWatch alarm cannot stop a tuning job directly.

Practice this question →

118

MCQeasy

A company is using SageMaker to train a model. The training data includes personally identifiable information (PII). The company must ensure that the data is encrypted at rest and in transit. Which combination of actions meets these requirements?

A.Use S3 server-side encryption with S3 managed keys (SSE-S3)

B.Enable SSL for data in transit and use VPC endpoints

C.Place all resources in a private VPC subnets with no internet access

D.Use S3 server-side encryption and enable SageMaker inter-container traffic encryption

AnswerD

S3 SSE encrypts at rest; SageMaker inter-container encryption uses TLS for in-transit.

Why this answer

Option C is correct: S3 SSE-S3 or SSE-KMS encrypts data at rest, and SageMaker uses HTTPS for in-transit encryption. Option A (only SSL) lacks at-rest encryption. Option B (only S3 SSE) lacks in-transit.

Option D (VPC only) does not encrypt.

Practice this question →

119

MCQmedium

A research lab is using SageMaker to train deep learning models on a custom dataset stored in S3. Each training job uses a single ml.p3.2xlarge instance. Recently, training jobs have been failing intermittently with 'NetworkError: Connection reset by peer' during the data download phase. The data scientist notices that the dataset is 50GB and the network throughput is low. The training script uses the default S3 download method (boto3) to copy data from S3 to the local instance storage. Which solution should the data scientist implement to resolve the issue?

A.Mount an EBS volume to the instance and copy data there before training.

B.Use SageMaker Pipe mode to stream data directly from S3.

C.Add retry logic in the training script to handle network errors.

D.Use a larger instance type like p3.8xlarge for better network bandwidth.

AnswerB

Pipe mode avoids large local file downloads and is more resilient.

Why this answer

Option D is correct because using SageMaker's Pipe mode streams data directly from S3 without writing to local disk, which is more reliable and avoids large downloads. Option A is incorrect because simply retrying may not fix the underlying network issue. Option B is incorrect because using a larger instance does not guarantee improved network reliability.

Option C is incorrect because using EBS volumes adds cost and does not solve the network reset issue; data still needs to be downloaded.

Practice this question →

120

MCQeasy

An ML engineer needs to run a hyperparameter tuning job on Amazon SageMaker. The training algorithm supports distributed training across multiple GPUs. The engineer wants to minimize the total time to find the best hyperparameters. Which strategy should be used?

A.Use random search to explore a wide range.

B.Use grid search to cover all combinations.

C.Use Hyperband which is designed for distributed training.

D.Use Bayesian optimization as the tuning strategy.

AnswerD

Bayesian optimization adaptively selects hyperparameters, reducing total tuning time.

Why this answer

Bayesian search uses past results to select hyperparameters, converging faster than random or grid search. Random search (B) does not use past results. Grid search (C) is exhaustive and slow.

Hyperband (D) is a bandit method but requires early stopping; Bayesian is better for minimizing total time with a fixed budget.

Practice this question →

121

MCQhard

Refer to the exhibit. A SageMaker training job is launched with the CLI command shown. The job fails with an error 'S3 data distribution type not supported for File mode'. What is the most likely fix?

A.Change TrainingInputMode to Pipe

B.Change InstanceCount to 2

C.Increase VolumeSizeInGB to 100

D.Increase MaxRuntimeInSeconds to 7200

AnswerA

FullyReplicated is only supported in Pipe mode.

Why this answer

Option D is correct because FullyReplicated data distribution is only supported in Pipe mode. Option A is wrong because increasing volume size does not affect data distribution. Option B is wrong because changing instance type does not fix the mode mismatch.

Option C is wrong because MaxRuntime is not related.

Practice this question →

122

MCQmedium

A company is using Amazon SageMaker to host a model for real-time inference. The model was trained using SageMaker's built-in Linear Learner algorithm. The endpoint has been running for a week, and the operations team notices that the endpoint's latency has increased from 50 ms to 150 ms over the past few days. The number of requests per second has remained steady at about 200. The team suspects a memory leak in the inference container. What should the team do to diagnose the issue?

A.Enable CloudWatch Logs and use Container Insights to view memory utilization.

B.Use Amazon CloudWatch to monitor the endpoint's latency metric.

C.Use SageMaker Debugger to inspect the inference container.

D.Use SageMaker Model Monitor to detect data drift.

AnswerA

Container Insights shows memory usage trends, helping diagnose leaks.

Why this answer

Option B is correct because CloudWatch Logs and container insights can show memory usage over time. Option A is wrong because CloudWatch metrics show latency but not memory directly. Option C is wrong because SageMaker Debugger is for training.

Option D is wrong because SageMaker Model Monitor is for data drift, not memory leaks.

Practice this question →

123

Multi-Selectmedium

A company is using Amazon SageMaker to train and deploy machine learning models. The data science team wants to track and compare model versions, hyperparameters, and metrics across multiple training jobs. Which TWO AWS services should they use together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon RDS

B.Amazon CloudWatch Logs

C.AWS Glue

D.Amazon S3

E.Amazon SageMaker Experiments

AnswersD, E

S3 stores experiment artifacts and outputs.

Why this answer

Options A and C are correct. SageMaker Experiments provides tracking of training jobs, metrics, and parameters. Amazon S3 stores the experiment outputs and artifacts.

Option B is wrong because AWS Glue is for ETL, not for experiment tracking. Option D is wrong because Amazon RDS is a relational database. Option E is wrong because CloudWatch Logs is for logs, not for experiment tracking.

Practice this question →

124

Multi-Selectmedium

Which TWO configuration steps are necessary to deploy a custom Docker container for training in Amazon SageMaker? (Choose two.)

Select 2 answers

A.Expose a REST API endpoint for inference

B.Implement the train function in the container that saves model artifacts to /opt/ml/model

C.Define a Docker Compose file to manage multi-container training

D.Include a training script that reads hyperparameters from /opt/ml/input/config/hyperparameters.json

E.Push the container image to Docker Hub

AnswersB, D

SageMaker expects the model to be saved in /opt/ml/model.

Why this answer

Options B and D are correct. SageMaker requires the container to have a training script at /opt/ml/code and to implement the train function. Option A is incorrect because the container must be stored in ECR, not Docker Hub.

Option C is optional. Option E is incorrect because SageMaker does not use Docker Compose.

Practice this question →

125

Multi-Selecthard

A machine learning team is using Amazon SageMaker to train a deep learning model on a large dataset stored in Amazon S3. The training job is taking too long. The team wants to reduce training time without modifying the model architecture. Which THREE actions should the team take? (Choose 3.)

Select 3 answers

A.Enable SageMaker Managed Spot Training to use cheaper spot instances.

B.Use a larger instance type with more vCPUs and memory.

C.Use distributed training with multiple GPU instances.

D.Use Pipe input mode to stream data from S3 instead of downloading it.

E.Use SageMaker Processing to preprocess the data.

AnswersA, C, D

Spot instances can reduce cost and training time if interruptions are handled.

Why this answer

Option B is correct because Pipe mode streams data directly from S3, reducing download time. Option C is correct because SageMaker Managed Spot Training can reduce training time by using cheaper spot instances. Option D is correct because distributed training with multiple GPUs can parallelize computation.

Option A is wrong because increasing instance size may help but is less effective than distributed training. Option E is wrong because SageMaker Processing is for data processing, not training.

Practice this question →

126

Multi-Selecthard

Which THREE factors should be considered when choosing an instance type for a SageMaker training job?

Select 3 answers

A.The number of vCPUs needed for parallel processing

B.The memory requirements of the model

C.The endpoint latency requirement

D.The AWS region where the instance is launched

E.The GPU requirements for model training

AnswersA, B, E

More vCPUs can speed up training.

Why this answer

Options B, C, and E are correct. Option A is wrong because it's not a factor; you can use any region. Option D is wrong because endpoints are for serving, not training.

Practice this question →

127

MCQmedium

A company is deploying a machine learning model using SageMaker. The model is a PyTorch model that requires GPU for inference. The company wants to minimize costs while ensuring low latency. Which instance type should be used for the SageMaker endpoint?

A.ml.m5.large

B.ml.p3.2xlarge

C.ml.c5.2xlarge

D.ml.g4dn.xlarge

AnswerB

GPU instance optimized for inference, cost-effective.

Why this answer

Option B is correct because ml.p3.2xlarge is a GPU instance optimized for inference with low latency, and it is cost-effective for production workloads. Option A is wrong because ml.c5.2xlarge is a CPU instance and does not have a GPU. Option C is wrong because ml.g4dn.xlarge is also a GPU instance but may not provide the same performance for PyTorch models as ml.p3.

Option D is wrong because ml.m5.large is a general-purpose CPU instance without GPU support.

Practice this question →

128

MCQhard

A company operates a real-time fraud detection system using SageMaker. The model is deployed on an ml.c5.xlarge instance behind an Application Load Balancer (ALB). Recently, during a sales event, traffic spiked and the endpoint returned HTTP 503 errors. The team scaled the instance count from 2 to 5, but errors persisted. CloudWatch metrics show low CPU utilization (~30%) and high memory usage (~90%). The model loads a large dictionary file (2GB) into memory at startup. Which action should resolve the issue?

A.Enable auto-scaling with Spot instances.

B.Switch to a compute-optimized instance type like c5.2xlarge.

C.Increase the number of instances further to 10.

D.Use a memory-optimized instance type like r5.large.

AnswerD

r5.large has 16GB memory vs 4GB for c5.xlarge, allowing the model to handle more requests.

Why this answer

Option C is correct because the high memory usage indicates the instance type does not have enough memory to handle concurrent requests. Using a memory-optimized instance like r5.large provides more memory per vCPU, reducing memory pressure and preventing OOM errors. Option A is incorrect because low CPU utilization suggests the bottleneck is not CPU.

Option B is incorrect because increasing instances does not help if each instance is memory-constrained. Option D is incorrect because using Spot instances may cause interruptions and does not fix the memory issue.

Practice this question →

129

MCQeasy

A data scientist wants to deploy a PyTorch model for real-time inference with low latency. Which AWS service should they use?

A.Amazon Elastic Container Service (ECS)

B.Amazon SageMaker batch transform

C.Amazon SageMaker real-time endpoint

D.AWS Lambda

AnswerC

Designed for low-latency inference.

Why this answer

Amazon SageMaker real-time endpoints provide low-latency inference for custom models. Option A is wrong because SageMaker batch transform is for offline processing. Option B is wrong because AWS Lambda has a 15-minute timeout and is not optimized for large model inference.

Option D is wrong because Amazon ECS requires more operational overhead.

Practice this question →

130

MCQmedium

A company is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is failing with an OutOfMemory error. The data scientist wants to minimize cost while resolving the issue. Which action should the data scientist take?

A.Increase the instance type to one with more memory.

B.Use the 'auto' setting for the input mode.

C.Reduce the batch size hyperparameter.

D.Change the input mode from 'File' to 'Pipe'.

AnswerD

Pipe mode streams data, reducing memory footprint.

Why this answer

The OutOfMemory error occurs because the 'File' input mode downloads the entire training dataset to the instance's local storage before training begins, consuming significant memory. Switching to 'Pipe' mode streams data directly from S3 to the training algorithm, reducing memory footprint and avoiding the need for larger instances. This minimizes cost by using the existing instance type while resolving the memory issue.

Exam trap

The trap here is that candidates may assume reducing the batch size (Option C) is the standard fix for memory issues, but they overlook that the 'File' input mode's full dataset download is the primary cause, and 'Pipe' mode directly addresses this without additional cost.

How to eliminate wrong answers

Option A is wrong because increasing the instance type to one with more memory would resolve the error but at a higher cost, contradicting the goal to minimize cost. Option B is wrong because the 'auto' setting for input mode does not exist in SageMaker; the valid input modes are 'File' and 'Pipe', and 'auto' is not a recognized configuration. Option C is wrong because reducing the batch size hyperparameter may reduce memory usage per step but does not address the root cause of the dataset being fully loaded into memory in 'File' mode, and it could negatively impact model convergence or training time.

Practice this question →

131

MCQmedium

Refer to the exhibit. An administrator has attached this IAM policy to a user. The user tries to start a SageMaker training job that uses a custom Docker image from Amazon ECR. The training job fails with an access denied error. What is the MOST likely reason?

A.The s3:* action is too permissive and should be scoped.

B.The policy is missing ecr:GetDownloadUrlForLayer and ecr:BatchGetImage.

C.The iam:PassRole permission is missing the SageMaker service principal.

D.The sagemaker:* action should be restricted to specific resources.

AnswerB

ECR permissions are required to pull the custom image.

Why this answer

The policy grants s3:* on the bucket but does not include ecr:* permissions (Option B) needed to pull the Docker image. Option A (s3:* on objects) is too broad but not the issue. Option C (iam:PassRole on role) is allowed.

Option D (sagemaker:* on specific resources) is not the issue.

Practice this question →

132

MCQeasy

A data scientist runs the AWS CLI command shown in the exhibit. The output shows that job-2 failed. Which action should the data scientist take to diagnose the failure?

A.Check the CloudWatch Logs log group for job-2

B.Check the S3 bucket for any error logs uploaded by the training job

C.Run `aws sagemaker list-training-jobs --name-contains job-2` to get more details

D.Run `aws sagemaker describe-training-job --training-job-name job-2` to see the failure reason

AnswerD

DescribeTrainingJob includes a FailureReason field.

Why this answer

Option A is correct because the DescribeTrainingJob API provides detailed status and failure reason. Option B (list-training-jobs) already used. Option C (CloudWatch Logs) is useful but the first step is to get the failure reason from DescribeTrainingJob.

Option D (S3 logs) is not standard.

Practice this question →

133

MCQhard

A data scientist is training a deep learning model on SageMaker using a custom Docker container. The training job fails with an error indicating that the container exited with a non-zero status. The CloudWatch logs show 'FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/data.csv''. What is the most likely cause?

A.The container's Docker entry point is misconfigured.

B.The S3 data source path is incorrect or the data has not been uploaded.

C.The training script has a syntax error.

D.The model output path in the training job configuration is wrong.

AnswerB

Missing training data leads to FileNotFoundError.

Why this answer

Option C is correct because the error indicates the training data is missing at the expected path, which typically occurs when the S3 data source configuration is incorrect or the data is not uploaded. Option A is wrong because the error is about missing data, not the container entry point. Option B is wrong because the error is not related to training script syntax.

Option D is wrong because the error does not mention model artifacts.

Practice this question →

134

MCQhard

A data scientist attempts to create a SageMaker training job using the IAM policy shown in the exhibit. The training job fails with an access denied error. What is the most likely cause?

A.The S3 bucket policy does not grant access to the SageMaker service principal

B.The IAM policy is missing the sagemaker:DescribeTrainingJob permission

C.The IAM policy is missing the s3:ListBucket action

D.The IAM policy is missing the s3:PutObject permission

AnswerC

SageMaker needs ListBucket to read objects from the bucket.

Why this answer

Option C is correct: The policy allows s3:GetObject but not s3:ListBucket, which is required to read objects. Option A (missing sagemaker:DescribeTrainingJob) is not needed for creation. Option B (resource arn for s3 is wrong) is correct but not the cause of access denied.

Option D (missing s3:PutObject) is not required for reading training data.

Practice this question →

135

MCQhard

A company is using Amazon SageMaker to train a model using a custom Docker container. The training script writes model artifacts to the `/opt/ml/model` directory. The training job completes successfully, but the model artifacts are not uploaded to the S3 output path specified in the training job. The company has verified that the SageMaker execution role has the necessary S3 permissions. The Docker container is built using a base image that is not one of the official SageMaker Docker images. What is the MOST likely reason for the failure to upload model artifacts?

A.The training script's entry point is not correctly specified in the container.

B.The custom container does not include the SageMaker training toolkit, which handles artifact uploads.

C.The output path in the training job configuration is incorrectly formatted.

D.The SageMaker execution role does not have s3:PutObject permission on the output bucket.

AnswerB

Correct: Without the toolkit, SageMaker does not automatically upload artifacts.

Why this answer

When using a custom Docker container, SageMaker expects the container to have a training entry point that follows the SageMaker toolkit conventions. If the container does not include the SageMaker training toolkit, SageMaker cannot automatically upload artifacts. Option C is correct.

Option A (entry point) is not the issue if the script runs. Option B (S3 permissions) is already verified. Option D (output path) is configured correctly.

Practice this question →

136

MCQmedium

A company has a SageMaker endpoint that serves predictions for a mobile app. The endpoint is deployed on a single ml.m5.large instance. Recently, users have reported that the app sometimes returns outdated predictions. The data science team has confirmed that the model is updated daily by retraining with new data and creating a new endpoint configuration. However, the endpoint still returns predictions from the old model for some requests. The team has verified that the new endpoint configuration is associated with the endpoint and that the endpoint is in service. What is the most likely cause of this issue?

A.The old model artifacts are still being cached by the endpoint

B.The endpoint has multiple variants and the old variant still has a weight assigned

C.The mobile app is using a CDN that caches the predictions

D.The new endpoint configuration has not been deployed to the endpoint

AnswerB

If the old variant has a weight, it will continue to serve traffic. The new variant should get a weight of 1 and the old variant weight should be set to 0.

Why this answer

Option D is correct because if the old endpoint variant still has a weight greater than zero, some traffic will continue to be routed to the old model. Option A is wrong because if the new configuration were not active, the endpoint would not be in service. Option B is wrong because the model artifacts are stored in S3 and are not deleted during updates.

Option C is wrong because CloudFront does not cache SageMaker endpoint responses by default.

Practice this question →

137

MCQeasy

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long and the scientist wants to reduce the training time by distributing the workload across multiple GPUs. Which SageMaker feature should be used to achieve this?

A.Use SageMaker's distributed training libraries

B.Use Amazon EMR to distribute the training

C.Use SageMaker Automatic Model Tuning

D.Use SageMaker Hyperparameter Tuning

AnswerA

SageMaker provides built-in distributed training libraries that can split the workload across multiple GPUs.

Why this answer

Option D is correct because SageMaker's distributed training libraries enable efficient distribution across multiple GPUs. Option A is wrong because Hyperparameter Tuning is for optimizing hyperparameters, not for distributed training. Option B is wrong because Automatic Model Tuning is for hyperparameter optimization.

Option C is wrong because Amazon EMR is for big data processing, not for deep learning training.

Practice this question →

138

MCQhard

A company uses SageMaker Pipelines to automate model retraining. The pipeline fails intermittently at the Preprocess step with a 'ResourceLimitExceeded' error. The team uses a ml.m5.xlarge instance. What is the most likely cause?

A.The account has reached the limit for concurrent ml.m5.xlarge instances

B.The preprocessing script has a memory leak

C.The S3 bucket has insufficient permissions

D.The pipeline execution role is missing the PassRole permission

AnswerA

ResourceLimitExceeded typically means hitting a service limit like concurrent instances.

Why this answer

The error indicates reaching a service quota. SageMaker has a default limit of concurrent training jobs per account. Option A is correct.

Option B would cause different errors. Option C is unrelated. Option D would cause a SageMaker service error, not a resource limit.

Practice this question →

139

MCQmedium

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

A.The Docker image is built from a base image that does not have the required libraries.

B.The container runs as a non-root user that lacks write permissions to /opt/ml.

C.The SageMaker training job is configured with insufficient memory.

D.The training script is not copying the model to /opt/ml/model.

AnswerB

SageMaker mounts volumes as root by default; if the container runs as a different user, it may not have write access.

Why this answer

SageMaker expects the training container to write output to /opt/ml/model and /opt/ml/output. If the container does not have write permissions to these directories, the training fails. The most likely cause is that the Docker image was built with a non-root user that does not have write permissions to /opt/ml.

The correct fix is to ensure the container runs as root or grants write permissions.

Practice this question →

140

MCQmedium

An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?

A.ecr:GetDownloadUrlForLayer and ecr:BatchGetImage on the ECR repository

B.ecr:PutImage on the ECR repository

C.s3:GetObject on the ECR repository

D.sagemaker:CreateTrainingJob on the ECR resource

AnswerA

These permissions are required to pull a Docker image from ECR.

Why this answer

When SageMaker pulls a custom Docker image from ECR during training job creation, the execution role needs permissions to download the image layers. The required actions are ecr:GetDownloadUrlForLayer (to generate pre-signed URLs for each layer) and ecr:BatchGetImage (to retrieve image metadata and layer manifests). Without these, the 'AccessDenied' error occurs because SageMaker cannot authenticate or fetch the container image from the ECR repository.

Exam trap

The trap here is that candidates often confuse ECR pull permissions with S3 permissions (option C) or assume that the SageMaker CreateTrainingJob permission (option D) implicitly covers the ECR pull, when in fact the IAM role must explicitly grant the specific ECR read actions for image retrieval.

How to eliminate wrong answers

Option B is wrong because ecr:PutImage is used to push images into ECR, not to pull them; the training job only needs read access. Option C is wrong because s3:GetObject is an S3 permission, not an ECR permission; ECR uses its own API actions for image retrieval, not S3. Option D is wrong because sagemaker:CreateTrainingJob is a SageMaker API action that allows creating the training job itself, not the ECR pull operation; the error occurs at the ECR layer, not at the SageMaker API level.

Practice this question →

141

MCQhard

A company is using Amazon Forecast for demand forecasting. The data includes time series data for multiple items. The company wants to ensure that the forecast is updated daily as new data arrives. Which approach should be used to automate this process?

A.Use AWS Lambda to invoke the Forecast CreateDatasetImportJob and CreatePredictorBacktestExportJob APIs on a daily schedule triggered by Amazon CloudWatch Events.

B.Use Amazon Kinesis Data Streams to stream new data directly into Forecast.

C.Use Amazon SageMaker to retrain the model daily and replace the forecast endpoint.

D.Enable the 'AutoPredictor' feature in Forecast to automatically update the predictor when new data arrives.

AnswerA

This automates data import and retraining.

Why this answer

Option A is correct because Amazon Forecast Predictor can be updated with new data by using the CreateDatasetImportJob API, and then retraining the predictor. Option B is wrong because Forecast does not support real-time streaming. Option C is wrong because Forecast does not update automatically; manual retraining is required.

Option D is wrong because there is no built-in scheduler; custom automation is needed.

Practice this question →

142

MCQhard

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives variable traffic and the company wants to optimize cost while maintaining responsiveness. Which scaling policy should be used?

A.Target tracking scaling based on invocation count

B.Simple scaling with a cooldown period

C.Scheduled scaling

D.Manual scaling

AnswerA

Automatically adjusts to traffic.

Why this answer

Option C is correct because target tracking scaling based on invocation count is the best approach for variable traffic, as it adjusts to demand. Option A is wrong because manual scaling requires constant monitoring. Option B is wrong because simple scaling with fixed steps may not adapt well.

Option D is wrong because scheduled scaling is for predictable traffic.

Practice this question →

143

MCQeasy

A company wants to use Amazon SageMaker to train a model using data that is updated daily. The training data is stored in an S3 bucket, and the team wants to automate the training process whenever new data arrives. Which AWS service should be used to trigger the SageMaker training job?

A.AWS Lambda triggered by S3 event notifications

B.Amazon CloudWatch Events

C.Amazon Simple Queue Service (SQS)

D.AWS Step Functions with a scheduled trigger

AnswerA

Lambda can be triggered by S3 events to start the training job.

Why this answer

Option B is correct because S3 events can trigger a Lambda function, which can then start the SageMaker training job. Option A is wrong because CloudWatch Events can schedule events but not directly triggered by S3 object creation. Option C is wrong because Step Functions can orchestrate the workflow but requires an S3 event to start.

Option D is wrong because SQS is a queue service; it does not directly trigger Lambda from S3 events without additional setup.

Practice this question →

144

Multi-Selecteasy

A data engineer is building a data pipeline for a machine learning project using Amazon SageMaker. The raw data is stored in Amazon S3. Which TWO steps are essential to ensure data privacy and security before training? (Choose TWO.)

Select 2 answers

A.Create a bucket policy that restricts access to the data scientist's IAM role only

B.Enable versioning on the S3 bucket

C.Encrypt the data at rest using S3 server-side encryption

D.Use S3 Transfer Acceleration for faster uploads

E.Use Amazon SageMaker in a VPC and configure VPC endpoints to access S3 securely

AnswersC, E

Encryption protects data at rest.

Why this answer

Options B and E are correct because encryption at rest and in transit are essential. Option A is optional. Option C is for access control, but encryption is more fundamental.

Option D is not a security measure.

Practice this question →

145

MCQeasy

A data scientist is training a linear regression model on a dataset with 100 features. The model shows high variance on the test set. Which action is MOST likely to reduce overfitting?

A.Use a more complex model like XGBoost

B.Increase the number of training iterations

C.Apply L2 regularization (Ridge regression)

D.Add more feature engineering to increase model complexity

AnswerC

L2 regularization penalizes large coefficients, reducing overfitting.

Why this answer

Option B is correct because adding L2 regularization (ridge regression) penalizes large coefficients and reduces model complexity, which helps with overfitting. Option A (adding more features) would increase variance. Option C (increasing training iterations) doesn't address overfitting.

Option D (using a more complex model) would worsen overfitting.

Practice this question →

146

Multi-Selecteasy

Which TWO AWS services can be used to deploy a machine learning model for serverless inference? (Choose 2.)

Select 2 answers

A.Amazon SageMaker Serverless Inference

B.AWS Lambda

C.Amazon EMR

D.Amazon ECS with Fargate

E.AWS Batch

AnswersA, B

Serverless inference option.

Why this answer

SageMaker Serverless Inference (Option A) and AWS Lambda (Option C) both support serverless inference. Option B (Amazon ECS) requires managing clusters. Option D (Amazon EMR) is for big data.

Option E (AWS Batch) is for batch computing.

Practice this question →

147

MCQmedium

A team is deploying a machine learning model to production using Amazon SageMaker. They want to automatically scale the endpoint based on the incoming request volume, and they also need to ensure that the endpoint can handle sudden bursts of traffic without dropping requests. Which scaling policy should they use?

A.Scheduled scaling policy for peak hours

B.Target tracking scaling policy based on the number of invocations

C.Simple scaling policy based on average latency

D.Manual scaling by monitoring CloudWatch alarms

AnswerB

Target tracking automatically adjusts capacity to maintain a target metric and can handle bursts.

Why this answer

Option B is correct because a target tracking scaling policy with a specified target value for the metric allows the endpoint to automatically adjust capacity to maintain the target metric, and it can handle bursts by adding more instances proactively. Option A is wrong because a simple scaling policy based on average latency may not handle bursts quickly. Option C is wrong because a scheduled scaling policy is for predictable traffic patterns.

Option D is wrong because manual scaling is not automatic.

Practice this question →

148

Multi-Selecteasy

A company is building a machine learning pipeline on AWS. The pipeline includes data ingestion, preprocessing, training, and deployment. Which THREE AWS services can be used to orchestrate the pipeline? (Choose THREE.)

Select 3 answers

A.AWS Glue Workflows

B.Amazon CloudWatch

C.Amazon SageMaker Pipelines

D.AWS Lambda

E.AWS Step Functions

AnswersA, C, E

Glue Workflows can orchestrate ETL jobs.

Why this answer

Option A, B, and C are correct because Step Functions, SageMaker Pipelines, and AWS Glue Workflows provide orchestration capabilities. Option D is wrong because Lambda is a compute service, not a workflow orchestrator. Option E is wrong because CloudWatch is for monitoring.

Practice this question →

149

MCQhard

A data scientist is using Amazon SageMaker Debugger to monitor training jobs. The training loss is decreasing but then suddenly spikes. What is the most likely cause and how should it be addressed?

A.Gradient explosion; apply gradient clipping.

B.Overfitting; apply regularization.

C.Learning rate too low; increase learning rate.

D.Vanishing gradients; use ReLU activation.

AnswerA

Gradient clipping limits the gradient magnitude.

Why this answer

Option A is correct because a sudden spike in loss after decreasing often indicates a gradient explosion. Gradient clipping prevents this. Option B is wrong because learning rate that is too low causes slow convergence, not spikes.

Option C is wrong because overfitting shows decreasing training loss but increasing validation loss. Option D is wrong because vanishing gradients cause loss to plateau, not spike.

Practice this question →

150

MCQeasy

A company is training a deep learning model on Amazon SageMaker. The training job is failing with an out-of-memory error. Which SageMaker feature should the company use to resolve this issue without changing the instance type?

A.Use SageMaker distributed training with model parallelism

B.Use SageMaker Savings Plans

C.Enable SageMaker Managed Spot Training

D.Use SageMaker Debugger to monitor memory usage

E.Enable SageMaker Profiler to profile memory

AnswerA

Model parallelism splits the model across multiple instances, reducing memory per instance.

Why this answer

Option D is correct because SageMaker Managed Spot Training can reduce costs but does not fix memory issues. Option A and B are about debugging, not memory. Option C (distributed training) can split the model across instances, reducing per-instance memory usage.

Option E is about cost, not memory.

Practice this question →

← PreviousPage 2 of 5 · 351 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Machine Learning Implementation and Operations questions.

Start 20-question session