Knowledge + Practice

CCNA ML Model Development Questions

59 of 134 questions · Page 2/2 · ML Model Development · Answers revealed

Practice these questions Domain overview All questions

76

MCQhard

A model deployed on SageMaker is returning inaccurate predictions for certain customer segments. The team suspects data drift. Which SageMaker feature should they use to continuously monitor input data distribution?

A.SageMaker Clarify

B.SageMaker Debugger

C.SageMaker Model Monitor

D.SageMaker Feature Store

AnswerC

Model Monitor can track input data distributions and alert on drift.

Why this answer

SageMaker Model Monitor continuously monitors input data for drift, alerting when distributions change. Other features serve different purposes: Clarify for bias and explainability, Debugger for training, Feature Store for feature management.

Practice this question →

77

MCQmedium

A data scientist is training a logistic regression model and wants to use L1 regularization to create a sparse model. Which parameter should be adjusted?

A.alpha

B.lambda

C.penalty

D.C (inverse of regularization strength)

AnswerC

Setting penalty='l1' enables L1 regularization, which induces sparsity.

Why this answer

In scikit-learn's LogisticRegression, the 'penalty' parameter can be set to 'l1' to use L1 regularization. 'C' is the inverse of regularization strength, but without setting penalty='l1', it won't be L1. 'alpha' and 'lambda' are parameters in other libraries like scikit-learn's linear models (alpha) or XGBoost (lambda), but not for logistic regression default.

Practice this question →

78

Multi-Selecthard

A company uses SageMaker to train a model. They want to ensure that training data is encrypted at rest and in transit, and that only authorized users can access the training artifacts. Which three steps should they take? (Choose three.)

Select 3 answers

A.Configure IAM policies to restrict access to SageMaker resources

B.Use SageMaker Model Monitor

C.Use a VPC with private subnets and VPC endpoints

D.Enable S3 server-side encryption for training data

E.Use SageMaker Network Isolation

AnswersA, C, D

Controls who can create, modify, and access SageMaker resources.

Why this answer

Option A is correct because IAM policies allow you to define fine-grained permissions to control which users or roles can create, describe, or delete SageMaker resources (e.g., training jobs, endpoints). By restricting access via IAM, you ensure that only authorized principals can interact with training artifacts, such as model output in S3 or logs in CloudWatch. This directly addresses the requirement of limiting access to authorized users.

Exam trap

The trap here is that candidates often confuse network isolation (Option E) with encryption or access control, but network isolation only restricts network connectivity, not data encryption or authorization.

Practice this question →

79

MCQhard

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

A.Increase the number of features

B.Increase the regularization strength

C.Use a more complex model like XGBoost

D.Use stratified cross-validation

AnswerB

Stronger regularization (e.g., higher L2 penalty) shrinks coefficients and reduces overfitting.

Why this answer

The model shows high training accuracy (0.99) but significantly lower validation accuracy (0.75), which is a classic sign of overfitting. Increasing the regularization strength (e.g., L1 or L2 penalty) in logistic regression directly penalizes large coefficients, reducing the model's complexity and improving generalization. This is the most direct way to address overfitting in a logistic regression model.

Exam trap

AWS often tests the misconception that adding more data or using more complex models always improves performance, but here the correct answer is to increase regularization strength, which directly counters overfitting in a logistic regression model.

How to eliminate wrong answers

Option A is wrong because increasing the number of features would give the model more parameters to fit the training data even more closely, worsening overfitting rather than reducing it. Option C is wrong because using a more complex model like XGBoost would increase the model's capacity to memorize noise, which typically exacerbates overfitting unless accompanied by strong regularization or pruning. Option D is wrong because stratified cross-validation ensures class distribution balance across folds but does not directly reduce overfitting; it improves the reliability of validation metrics but does not change the model's tendency to overfit.

Practice this question →

80

MCQeasy

A company is deploying a real-time inference endpoint for a natural language processing model using Amazon SageMaker. The model is a fine-tuned BERT variant. The endpoint has been running for two weeks with acceptable latency (average 200 ms). However, over the past 24 hours, the latency has increased to an average of 800 ms, and the number of simultaneous requests has doubled. The team expects traffic to continue to grow. The current endpoint configuration uses a single ml.m5.large instance. The model is loaded into memory once, and the inference framework is PyTorch. The team needs to maintain latency under 500 ms. Which course of action should the team take to address the latency increase while minimizing cost?

A.Switch to ml.c5.large instances because CPU-optimized instances provide better inference performance for NLP models.

B.Increase the instance size to ml.m5.xlarge and keep a single instance.

C.Enable automatic scaling for the endpoint with a target average latency of 500 ms and use multiple ml.m5.large instances.

D.Implement a multi-model endpoint with multiple ml.m5.large instances and use Amazon Elastic Inference (EI) accelerators.

AnswerC

Correct: Auto scaling adds instances based on latency, distributing load and maintaining under 500 ms, and minimizes cost by scaling only when needed.

Why this answer

With increased traffic, a single instance is overloaded. Auto scaling with a latency target dynamically adds instances to handle load, maintaining latency. Option A scales up but doesn't add redundancy; B switches instance family but doesn't address scaling; C suggests multi-model endpoint which is for hosting multiple models, not scaling a single model, and EI may not be cost-effective.

Therefore D is correct.

Practice this question →

81

MCQmedium

During model training on Amazon SageMaker, the training job fails with a 'ResourceLimitExceeded' error. What is the most likely cause?

A.The algorithm's learning rate is too high

B.The dataset is too large for the instance

C.The training script has a syntax error

D.The account's instance limit for the chosen instance type has been reached

AnswerD

ResourceLimitExceeded indicates the account has exceeded the allowed number of instances for that instance type.

Why this answer

The ResourceLimitExceeded error typically indicates that the AWS account has reached its service limit for the specified instance type. A syntax error would cause a different error (e.g., ClientError). A dataset too large might cause an out-of-memory error but not this specific error.

Learning rate does not cause resource limits.

Practice this question →

82

MCQeasy

A team wants to track and compare multiple machine learning experiments, including hyperparameters, metrics, and artifacts. They are using Amazon SageMaker. Which AWS service or feature should they use to achieve this?

A.AWS CloudTrail

B.Amazon SageMaker Experiments

C.Amazon SageMaker Model Registry

D.Amazon SageMaker Studio

AnswerB

Experiments is the correct service for tracking.

Why this answer

Amazon SageMaker Experiments is the correct service because it is specifically designed to track and compare machine learning experiments, including hyperparameters, metrics, and artifacts. It provides a structured way to log, organize, and analyze multiple runs, enabling teams to identify the best-performing model configurations.

Exam trap

The trap here is that candidates confuse SageMaker Studio (the IDE) with SageMaker Experiments (the tracking service), assuming Studio alone provides experiment tracking, but Studio is merely the interface that can visualize experiment data stored by Experiments.

How to eliminate wrong answers

Option A is wrong because AWS CloudTrail records API activity for auditing and governance, not for tracking ML experiment metadata like hyperparameters or metrics. Option C is wrong because Amazon SageMaker Model Registry is used for cataloging and managing approved model versions, not for tracking the iterative experiments that produce them. Option D is wrong because Amazon SageMaker Studio is an integrated development environment (IDE) for ML workflows; while it can display experiment data, it is not the service that tracks experiments itself.

Practice this question →

83

MCQhard

Refer to the exhibit. A data scientist configured SageMaker Debugger to monitor training for overfitting. However, the rule never triggers even though the model appears to be overfitting. What is the most likely reason?

A.The debug hook is not collecting the validation loss

B.The instance type for the rule is too small

C.The S3 output path is not writable

D.The rule evaluator image is incorrect

AnswerA

The hook only collects 'losses' and 'gradients', lacking a validation loss collection needed to detect overfitting.

Why this answer

The DebugHookConfig only collects losses and gradients. The overfitting rule likely compares training loss to validation loss, but validation loss is not being collected. Without a collection for validation loss (e.g., validation:loss), the rule cannot evaluate the condition for overfitting.

The rule evaluator image, instance type, and S3 path are less likely causes; the image is a placeholder but might be valid.

Practice this question →

84

MCQhard

A model has high training accuracy but low validation accuracy. Which action is least likely to reduce overfitting?

A.Use dropout

B.Increase regularization strength

C.Add more training data

D.Increase model complexity

AnswerD

Increasing complexity makes the model more prone to overfitting.

Why this answer

Increasing model complexity (e.g., adding more layers or parameters) makes the model more flexible, which typically exacerbates overfitting by allowing it to memorize noise in the training data. Since the goal is to reduce overfitting, this action is counterproductive and therefore the least likely to help.

Exam trap

AWS often tests the misconception that 'more complex models always perform better,' leading candidates to incorrectly select increasing model complexity as a solution to overfitting rather than recognizing it as a cause.

How to eliminate wrong answers

Option A is wrong because dropout randomly deactivates neurons during training, which forces the network to learn redundant representations and reduces co-adaptation, directly combating overfitting. Option B is wrong because increasing regularization strength (e.g., L1/L2 penalty) adds a cost for large weights, shrinking the hypothesis space and preventing the model from fitting noise. Option C is wrong because adding more training data provides the model with more diverse examples, reducing the chance of memorizing spurious patterns and improving generalization.

Practice this question →

85

MCQmedium

A data scientist is training a binary classification model using Amazon SageMaker. The dataset has a severe class imbalance (95% negative, 5% positive). The model achieves 99% accuracy but fails to identify positive cases correctly. Which action should the data scientist take to improve the model's ability to detect positive cases?

A.Switch to a logistic regression model with balanced class weights.

B.Use accuracy as the evaluation metric and retrain the model.

C.Apply SMOTE (Synthetic Minority Over-sampling Technique) to the training data.

D.Use the F1 score as the evaluation metric and adjust the classification threshold based on the precision-recall curve.

AnswerD

F1 score and threshold tuning directly address the imbalance.

Why this answer

Option D is correct because in a severely imbalanced dataset (95% negative, 5% positive), accuracy is misleading. The F1 score balances precision and recall, and adjusting the classification threshold based on the precision-recall curve allows the model to prioritize recall for the minority class, directly improving detection of positive cases. This approach is recommended in SageMaker when using built-in algorithms or custom models with imbalanced data.

Exam trap

The trap here is that candidates often think oversampling (SMOTE) or changing the model type is the primary fix, but the exam tests understanding that evaluation metrics and threshold tuning are critical for imbalanced classification, not just data preprocessing.

How to eliminate wrong answers

Option A is wrong because switching to logistic regression with balanced class weights may help, but it is not the best action; the question asks for a single action to improve detection, and adjusting the threshold and metric (D) is more direct and effective than changing the model type. Option B is wrong because using accuracy as the evaluation metric will continue to favor the majority class and fail to reflect poor positive detection, reinforcing the original problem. Option C is wrong because applying SMOTE to the training data can introduce synthetic samples, but it does not address the need to evaluate and tune the model's decision threshold; SMOTE alone may not fix the detection issue if the threshold remains at 0.5.

Practice this question →

86

MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker and notices that the training loss decreases but the validation loss starts increasing after a certain number of epochs. The model is likely overfitting. Which SageMaker feature can they use to detect and diagnose this issue during training?

A.SageMaker Model Monitor

B.SageMaker Automatic Model Tuning

C.SageMaker Experiments

D.SageMaker Debugger

AnswerD

SageMaker Debugger provides built-in rules such as OverfitRule to monitor training and detect issues like overfitting in real time.

Why this answer

SageMaker Debugger includes built-in rules like OverfitRule that monitor training and validation metrics during training and emit alerts when overfitting is detected. SageMaker Experiments tracks runs but does not diagnose, Model Monitor is for inference, and Automatic Model Tuning optimizes hyperparameters.

Practice this question →

87

MCQmedium

Refer to the exhibit. A data scientist receives the above error when running a SageMaker training job. Which action will resolve the issue?

A.Change the training instance type to ml.m5.xlarge

B.Add an S3 bucket policy granting s3:GetObject to the SageMaker role

C.Use s3g:// instead of s3:// in the data source URI

D.Increase the volume size in the ResourceConfig

AnswerB

Granting the missing permission allows SageMaker to read the training data.

Why this answer

Option A is correct because the error indicates the SageMaker execution role lacks s3:GetObject permission on the training data. Adding that permission to the role resolves the issue. Option B (changing instance type) is unrelated.

Option C (increase volume size) does not affect S3 access. Option D (s3g://) is an invalid S3 URI scheme.

Practice this question →

88

MCQmedium

A financial services company uses SageMaker to train a fraud detection model. They have imbalanced data with 1% fraud. They trained a Gradient Boosting model using SMOTE for oversampling and achieved 99% accuracy on the test set, but the fraud recall is only 10%. The data scientist is concerned about the model's performance. Which change is most likely to improve fraud recall without sacrificing too much precision?

A.Use a different evaluation metric like F1-score during training.

B.Increase the weight of the fraud class in the loss function.

C.Reduce the SMOTE sampling ratio to create more synthetic samples.

D.Use a random undersampling of the majority class.

AnswerB

Correct: Class weighting focuses the model on the minority class, improving recall.

Why this answer

Option B is correct because increasing the weight of the fraud class in the loss function penalizes misclassifications of fraud more, improving recall. Option A is wrong because reducing the SMOTE ratio (i.e., less oversampling) would likely reduce recall. Option C is wrong because using F1-score as a metric does not change the training objective.

Option D is wrong because random undersampling may lose important majority class data, reducing precision.

Practice this question →

89

MCQmedium

An ML team is developing a regression model using Amazon SageMaker. They have a 100 GB CSV dataset stored in Amazon S3. The data is contained in a single large file. They launch a SageMaker training job with an ml.p3.8xlarge instance using a custom Docker container. The training script loads the data using pandas' read_csv from S3 directly. The team observes that the training job takes over 24 hours, and CloudWatch metrics show: GPU utilization is consistently above 90%, but CPU utilization is below 30%. Network I/O is moderate, and disk I/O is low. The team has already tried switching to a larger instance type (ml.p3.16xlarge) with no significant improvement. They need to reduce training time. Which action is MOST likely to achieve this?

A.Use SageMaker Pipe Mode to stream data directly from S3 to the algorithm, bypassing the local file system.

B.Split the CSV file into multiple smaller files (e.g., 100 MB each) and update the training script to read from a list of files in S3.

C.Use Amazon SageMaker Managed Spot Training to reduce cost, then use the savings to rent a larger instance.

D.Increase the number of training instances by using a distributed training configuration with Horovod.

AnswerB

This allows SageMaker to parallelize data loading across multiple instances or even multiple processes within one instance, improving I/O throughput.

Why this answer

The bottleneck is data loading. The single large CSV file prevents parallelism; SageMaker's Pipe mode streams data directly to the algorithm, but custom containers must support it. However, a simpler and effective approach is to split the data into multiple smaller files, enabling SageMaker's distributed data loading across instances and improving I/O parallelism.

Increasing instance count with single file doesn't help because each instance still reads the same file. Changing instance type already tried. Spot instances don't improve speed.

EBS volume doesn't matter.

Practice this question →

90

MCQmedium

A trained model needs to be deployed for real-time inference with low latency. Which AWS service is best suited for this?

A.SageMaker Batch Transform

B.SageMaker endpoints

C.SageMaker Hyperparameter Tuning

D.AWS Lambda with model packaged

AnswerB

Endpoints are designed for real-time inference with automatic scaling and low latency.

Why this answer

SageMaker endpoints provide managed, scalable, and low-latency real-time inference. Batch Transform is for offline inference, Hyperparameter Tuning is for training, and Lambda is for serverless but lacks native ML optimizations.

Practice this question →

91

MCQhard

A financial services company is training a large natural language processing (NLP) model using PyTorch on a SageMaker distributed training job. The cluster consists of 4 ml.p3.16xlarge instances (8 GPUs each). The training job runs successfully but takes 72 hours, exceeding the allotted 48-hour window. The team must reduce training time without sacrificing model quality. The model architecture has 1.5 billion parameters and currently uses the SageMaker data parallel library with Horovod for all-reduce. Observing CloudWatch metrics, the team notices that GPU utilization averages only 45% and network throughput is near maximum. Which action will most effectively reduce training time?

A.Enable Elastic Fabric Adapter (EFA) for faster inter-node connectivity.

B.Increase the batch size to improve GPU utilization.

C.Increase the number of instances from 4 to 8 to add more GPUs.

D.Switch to SageMaker model parallel library with pipeline parallelism to reduce communication overhead.

AnswerD

Model parallelism partitions the model across devices, reducing communication volume and improving utilization.

Why this answer

Option C is correct because with low GPU utilization and high network bandwidth consumption, the bottleneck is likely communication overhead. Model parallelism splits the model across GPUs, reducing the need for frequent all-reduce of large gradients, thus improving GPU utilization. Option A is wrong because increasing instance count would increase communication overhead and likely not improve utilization.

Option B is wrong because data parallelism already uses GPUs; increasing batch size may cause memory overflow. Option D is wrong because enabling EFA improves network, but network is already near maximum; the bottleneck is not network speed but the frequency of communication.

Practice this question →

92

MCQhard

A data science team is using Amazon SageMaker Pipelines to orchestrate a multi-step workflow that includes data preprocessing, training, and model evaluation. They want to reuse the preprocessed data across multiple pipeline executions without re-running the preprocessing step if the source data hasn't changed. What should they configure?

A.Use SageMaker Training steps with checkpointing

B.Use SageMaker Processing steps with caching

C.Use SageMaker Feature Store to store the preprocessed features

D.Use SageMaker Data Wrangler for the preprocessing

AnswerB

Caching in SageMaker Pipelines reuses step outputs when inputs are identical, avoiding redundant computation.

Why this answer

SageMaker Pipelines supports step caching, which allows reusing the output of a step if its inputs and parameters are unchanged. SageMaker Feature Store is for feature storage and serving, not for pipeline step reuse. Checkpointing is for training resumption, not step caching.

Data Wrangler preprocesses but caching is a pipeline feature.

Practice this question →

93

MCQmedium

A data scientist is training a large model on SageMaker and wants to reduce training time by using multiple GPUs. The model is small enough to fit on a single GPU but training is slow. Which SageMaker feature should be used?

A.Data parallelism using SageMaker's Distributed Data Parallel

B.Use a larger instance with more vCPUs

C.Model parallelism using SageMaker's Model Parallel

D.Use Elastic Inference

AnswerA

Data parallelism distributes the training across multiple GPUs, reducing training time for models that fit on a single GPU.

Why this answer

Option D is correct because SageMaker's Distributed Data Parallel (DDP) replicates the model across multiple GPUs and splits mini-batches, speeding up training for models that fit on a single GPU. Option A (Model Parallel) is for models too large for one GPU. Option B (larger instance) may provide more vCPUs but not necessarily more GPUs.

Option C (Elastic Inference) accelerates inference, not training.

Practice this question →

94

MCQhard

A data science team at a financial services company is deploying a real-time fraud detection model using Amazon SageMaker. The model is a gradient boosting classifier trained on historical transaction data. The model is deployed to a SageMaker endpoint with an ML.M5.LARGE instance for real-time inference. After deployment, the team observes that the endpoint's latency spikes to over 2 seconds during peak hours (10:00-12:00 and 14:00-16:00), causing timeouts for client applications. The average latency during off-peak hours is 200 ms. The team has enabled auto-scaling with a target average CPU utilization of 70%, but the endpoint still experiences high latency during peak hours. The instance count never scales beyond 2 instances during peaks. The model size is 500 MB, and each request includes 200 features. The team needs to reduce latency to under 500 ms at the 99th percentile during peak hours without increasing costs beyond the current budget. Which course of action should the team take?

A.Configure SageMaker batch transform for the real-time endpoint to process requests asynchronously.

B.Increase the auto-scaling maximum instance count to 10 and set target CPU utilization to 50%.

C.Switch the endpoint instance type to a GPU instance such as ml.g4dn.xlarge to accelerate inference.

D.Enable data compression on the endpoint to reduce payload size and network latency.

AnswerC

GPU instances can accelerate inference for gradient boosting models by parallelizing computations, reducing per-request latency significantly.

Why this answer

Option C is correct because GPU instances like ml.g4dn.xlarge are optimized for compute-intensive workloads such as gradient boosting inference, which involves numerous matrix operations. By offloading the computation to the GPU, the model can process each request faster, reducing latency from over 2 seconds to under 500 ms at the 99th percentile without increasing the instance count or budget. This directly addresses the root cause—CPU-bound inference during peak hours—while keeping costs stable.

Exam trap

The trap here is that candidates assume auto-scaling or instance count adjustments will solve latency issues, but the real bottleneck is per-instance compute capacity, which GPU acceleration directly addresses without increasing costs.

How to eliminate wrong answers

Option A is wrong because SageMaker batch transform is designed for offline, asynchronous processing of large datasets, not for real-time inference; it would introduce unacceptable delays and cannot meet the sub-500 ms latency requirement. Option B is wrong because increasing the maximum instance count to 10 and lowering CPU target to 50% would significantly increase costs (more instances running) and still not guarantee sub-500 ms latency if each instance is CPU-bound; the current scaling limit of 2 instances suggests the bottleneck is per-instance compute capacity, not scaling policy. Option D is wrong because data compression reduces payload size and network latency, but the primary latency spike is due to compute time (model inference), not network transfer; the 500 MB model and 200 features are already moderate, and compression would offer minimal improvement for the compute-bound bottleneck.

Practice this question →

95

MCQmedium

A machine learning engineer is training a deep learning model on SageMaker and notices that the training loss decreases rapidly in the first few epochs but then plateaus. The validation loss starts increasing after 10 epochs. Which action should the engineer take to improve generalization?

A.Add more layers to the model

B.Use early stopping with validation loss monitoring

C.Increase the learning rate

D.Decrease the batch size

AnswerB

Early stopping halts training when validation loss stops decreasing, reducing overfitting.

Why this answer

Early stopping is the correct action because the validation loss increasing after 10 epochs while training loss continues to decrease is a classic sign of overfitting. By monitoring validation loss and halting training when it stops improving (e.g., using a patience parameter), the engineer prevents the model from memorizing noise in the training data, thereby improving generalization. SageMaker's built-in training job features or the `EarlyStopping` callback in frameworks like TensorFlow or PyTorch can implement this directly.

Exam trap

AWS often tests the distinction between underfitting and overfitting symptoms, and the trap here is that candidates mistake a plateauing training loss for a need to increase model complexity or learning rate, when the rising validation loss clearly signals overfitting that early stopping can mitigate.

How to eliminate wrong answers

Option A is wrong because adding more layers increases model capacity, which would exacerbate overfitting and likely cause validation loss to rise even sooner, not improve generalization. Option C is wrong because increasing the learning rate would make training more unstable, potentially causing the loss to diverge or oscillate, and would not address the overfitting indicated by the rising validation loss. Option D is wrong because decreasing the batch size introduces more noise into gradient estimates, which can sometimes help escape local minima but does not directly prevent overfitting; it may even slow convergence and does not target the core issue of validation loss increasing.

Practice this question →

96

Multi-Selectmedium

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (95% negative class, 5% positive class). The model is evaluated on a held-out test set, and the F1 score is 0.12. The data scientist wants to improve the F1 score. Which two actions should the data scientist take? (Choose two.)

Select 2 answers

A.Reduce the model complexity by decreasing the number of layers in a deep neural network.

B.Apply SMOTE (Synthetic Minority Oversampling Technique) to the training data using a preprocessing script in SageMaker Processing.

C.Increase the decision threshold to reduce false positives.

D.Use recall as the primary evaluation metric instead of F1.

E.Set the `scale_pos_weight` parameter in the SageMaker XGBoost estimator to the ratio of negative to positive samples.

AnswersB, E

Correct: SMOTE generates synthetic samples of the minority class, balancing the dataset and improving F1.

Why this answer

Setting scale_pos_weight in XGBoost adjusts the loss function to penalize misclassification of the minority class more heavily, improving recall and F1. SMOTE oversamples the minority class to balance the dataset. Option C reduces model capacity, which may not help; D changes the metric but doesn't fix the problem; E increases threshold and likely reduces recall further.

Practice this question →

97

Multi-Selecthard

Which THREE steps should be taken to optimize a large-scale distributed training job on SageMaker? (Choose 3.)

Select 3 answers

A.Attach multiple EBS volumes with throughput provisioning.

B.Use GPU instances with high bandwidth and memory (e.g., ml.p4d.24xlarge).

C.Enable batch transform for offline inference after training.

D.Use Elastic Fabric Adapter (EFA) for low-latency inter-node communication.

E.Select the appropriate distributed training strategy (e.g., Horovod, SageMaker data parallel, or model parallel).

AnswersB, D, E

GPU instances are necessary for large model training.

Why this answer

Options A, C, and D are correct. Using EFA (Elastic Fabric Adapter) reduces network latency, choosing the right distribution strategy (e.g., data parallelism vs model parallelism) improves scaling, and using GPU-optimized instances provides high compute. Option B is wrong because attaching additional EBS volumes does not directly help distributed training performance.

Option E is wrong because batch transform is for inference, not training.

Practice this question →

98

MCQeasy

A company is building a recommendation system and has trained a matrix factorization model using SageMaker. They want to evaluate the model's performance using precision at k (P@k) and recall at k (R@k). They have a test set of user-item interactions. The data scientist implements a custom evaluation script that computes these metrics, but the precision values are consistently zero. What is the most likely cause?

A.The model outputs are not being ranked correctly.

B.The model is overfitting.

C.The test set contains only positive interactions.

D.The k value is too large.

AnswerC

Correct: Without negative examples, precision is undefined or zero if no test items are in the recommendation list.

Why this answer

Option C is correct because if the test set contains only positive interactions (items the user interacted with), then there are no negative examples. In precision at k, if the recommended items do not exactly match the test set items (which is likely), precision will be zero. Options A and B are incorrect because ranking or k value would not cause consistent zero unless no overlap.

Option D is incorrect because overfitting would cause high training accuracy, not zero precision.

Practice this question →

99

Multi-Selecthard

A data scientist is developing a gradient boosting model and observes that the model is overfitting to the training data. Which three techniques can help reduce overfitting? (Select THREE.)

Select 3 answers

A.Reduce the learning rate

B.Apply early stopping

C.Increase the maximum depth of trees

D.Increase the regularization parameters (e.g., lambda, alpha)

E.Add subsampling of data or features

.Increase the number of trees

AnswersA, B, E

A lower learning rate makes the model more robust and reduces overfitting.

Why this answer

Reducing the learning rate slows down learning and helps reduce overfitting. Subsampling (data/feature sampling) adds randomness and reduces overfitting. Early stopping stops training before overfitting occurs.

Increasing the number of trees or tree depth increases model complexity, worsening overfitting. Increasing regularization parameters (like lambda, alpha) also helps reduce overfitting, but the three most common for gradient boosting are reducing learning rate, subsampling, and early stopping.

Practice this question →

100

Multi-Selecteasy

Which TWO data storage options are commonly used by Amazon SageMaker Feature Store for offline and online storage?

Select 2 answers

A.Amazon Redshift

B.Amazon RDS

C.Amazon ElastiCache

D.Amazon S3

E.Amazon DynamoDB

AnswersD, E

S3 is the default offline store for large historical feature data.

Why this answer

Amazon SageMaker Feature Store uses Amazon S3 as the default offline storage layer because it provides durable, scalable, and cost-effective object storage for large volumes of historical feature data. Amazon DynamoDB is used as the default online storage layer because it offers low-latency, single-digit millisecond read/write performance required for real-time inference serving.

Exam trap

The trap here is that candidates often confuse Amazon ElastiCache (a caching layer) with the primary online storage service, or assume Amazon Redshift is used for offline storage due to its analytical capabilities, but SageMaker Feature Store specifically integrates DynamoDB for online and S3 for offline storage as first-class options.

Practice this question →

101

Multi-Selectmedium

An MLOps team is designing a CI/CD pipeline for deploying machine learning models to production on Amazon SageMaker. They want to ensure that the deployment process is automated and that models are automatically rolled back if performance degrades. Which of the following AWS services or features should they use to achieve this? (Choose THREE.)

Select 3 answers

A.Amazon SageMaker Model Registry

B.Amazon SageMaker Ground Truth

C.Amazon CloudWatch

D.Amazon SageMaker Pipelines

E.AWS CloudTrail

AnswersA, C, D

Model Registry manages model versions and approvals.

Why this answer

Amazon SageMaker Model Registry is correct because it provides a centralized catalog for managing, versioning, and approving ML models. It enables automated deployment by triggering CI/CD pipelines when a model version is approved, and supports automatic rollback by allowing you to revert to a previous approved version if performance degrades, as detected by monitoring metrics.

Exam trap

The trap here is that candidates may confuse SageMaker Ground Truth (a data labeling service) or CloudTrail (an auditing service) with the core MLOps components needed for automated deployment and rollback, overlooking that Model Registry, Pipelines, and CloudWatch are the precise services that form the CI/CD and monitoring backbone.

Practice this question →

102

MCQhard

A team is deploying a model that requires low-latency inference for real-time predictions. They are using a SageMaker endpoint with a single instance. During testing, they observe high latency. Which change would most effectively reduce latency?

A.Use a multi-model endpoint

B.Add Elastic Inference

C.Enable SageMaker Batch Transform

D.Switch to a larger instance type

AnswerD

Correct: Larger instances provide more CPU/GPU for faster inferences.

Why this answer

Option B is correct because switching to a larger instance type provides more compute capacity, reducing inference latency. Option A is wrong because multi-model endpoints may increase latency due to model loading. C is wrong because Batch Transform is for batch, not real-time.

D is wrong because Elastic Inference adds GPU acceleration but may not reduce latency as much as compute upgrade.

Practice this question →

103

MCQmedium

A data scientist trains a neural network on SageMaker using the TensorFlow framework. The training accuracy is lower than expected, and the scientist suspects vanishing gradients. How can the scientist leverage SageMaker Debugger to diagnose this?

A.Increase the number of training epochs to allow gradients to propagate.

B.Export model summaries to TensorBoard for manual inspection.

C.Reduce the learning rate to prevent gradient explosion.

D.Use a built-in Debugger rule to monitor gradient magnitudes during training.

AnswerD

Built-in rules like VanishingGradient can detect and alert when gradients become too small.

Why this answer

Option A is correct because SageMaker Debugger includes built-in rules such as vanishing_gradient and exploding_gradient that automatically monitor tensors. Option B is wrong because TensorBoard is not integrated with SageMaker Debugger directly for rule-based alerts. Option C is wrong because adding more epochs may not solve vanishing gradients.

Option D is wrong because reducing learning rate can help but does not diagnose the issue.

Practice this question →

104

MCQmedium

A team is using Amazon SageMaker to train a neural network. They want to minimize training time while effectively exploring the hyperparameter space. Which approach should they use?

A.Random search

B.Bayesian optimization

C.Grid search

D.Manual tuning

AnswerB

Bayesian optimization uses past evaluations to focus on promising regions, reducing training time.

Why this answer

Bayesian optimization is efficient as it builds a probabilistic model and selects hyperparameters that are likely to improve performance, making it faster than exhaustive methods like grid search.

Practice this question →

105

MCQmedium

A team is tuning hyperparameters for a neural network using SageMaker's HyperparameterTuningJob with Bayesian optimization. After several trials, the objective metric has not improved significantly. Which action is most likely to help continue making progress?

A.Expand the hyperparameter ranges

B.Switch to random search strategy

C.Use a warm start with previous tuning results

D.Switch to Bayesian search

AnswerB

Random search introduces exploration and can discover new promising regions beyond the current exploitation focus.

Why this answer

Option D is correct because switching to random search introduces exploration and can help escape local optima that Bayesian optimization might be stuck exploiting. Option A (switch to Bayesian) is already in use. Option B (warm start) uses previous results but does not change the search strategy.

Option C (expand ranges) might help if the optimum lies outside current ranges, but stagnation often requires more exploration.

Practice this question →

106

MCQeasy

A data scientist is using SageMaker to train a linear regression model. After training, they evaluate the model on the test set and get an R² of 0.95. However, when they deploy the model to a SageMaker endpoint and run predictions on new data, the predictions are far off. What is the most likely cause?

A.The endpoint is using a different inference script.

B.The test set is not representative of the production data distribution.

C.The model was trained with a wrong algorithm.

D.The model is overfitting the training data.

AnswerB

Correct: Data drift causes model to perform poorly on new data despite good test metrics.

Why this answer

Option A is correct because the test set is not representative of the production data distribution (data drift). The high R² on the test set suggests the model fits well, but production data differs. Option B is wrong because overfitting would show lower test R².

Option C is wrong because different inference scripts would cause errors, not just poor predictions. Option D is wrong because the algorithm is appropriate.

Practice this question →

107

MCQmedium

A company wants to deploy a machine learning model that makes real-time predictions for a mobile app. The model is a deep neural network with a large model size (500 MB). Which SageMaker endpoint configuration is most cost-effective while meeting low-latency requirements?

A.Multi-model endpoint

B.Serverless inference

C.Real-time endpoint with a single instance

D.Batch transform

AnswerC

Ensures low latency and is cost-effective for a single model with sustained traffic.

Why this answer

A real-time endpoint with a single instance provides consistent low latency and is cost-effective for a single large model. Other options either do not meet latency requirements or are designed for multiple models.

Practice this question →

108

MCQeasy

A data scientist is training a regression model in Amazon SageMaker. The dataset contains missing values in several features. The scientist wants to handle missing values as part of the training pipeline to ensure consistency between training and inference. Which approach should the scientist use?

A.Impute missing values in a separate Jupyter notebook and save the cleaned data.

B.Use SageMaker Autopilot to automatically handle missing values.

C.Drop all rows with missing values before training.

D.Use a scikit-learn container in SageMaker to create a preprocessing step that imputes missing values and include it in the inference pipeline.

AnswerD

Consistent preprocessing in pipeline.

Why this answer

Option D is correct because it uses a scikit-learn container within SageMaker to create a preprocessing step that imputes missing values, then includes that step in the inference pipeline. This ensures the same imputation logic (e.g., mean, median, or custom strategy) is applied consistently during both training and inference, preventing data drift and maintaining reproducibility. SageMaker Pipelines or the built-in scikit-learn container allow the preprocessing to be serialized as part of the model artifact, so inference requests automatically undergo the same transformation.

Exam trap

The trap here is that candidates often assume SageMaker Autopilot (Option B) is the correct choice because it automates preprocessing, but they miss that the question specifically requires a custom, reproducible pipeline that ensures consistency between training and inference, which Autopilot does not expose for custom control.

How to eliminate wrong answers

Option A is wrong because handling missing values in a separate Jupyter notebook and saving the cleaned data breaks the training-inference consistency; the imputation logic is not captured in a reusable pipeline, leading to potential mismatch when new data arrives during inference. Option B is wrong because SageMaker Autopilot is an automated machine learning service that handles missing values internally during model selection, but it does not allow the data scientist to control the imputation method or integrate a custom preprocessing step into a production inference pipeline. Option C is wrong because dropping all rows with missing values can discard valuable data, reduce model performance, and is not feasible when missing values appear in inference-time data, as the pipeline would have no strategy to handle them.

Practice this question →

109

MCQhard

A machine learning engineer is training a deep learning model using TensorFlow in SageMaker. The training runs on an ml.p3.16xlarge instance (8 GPUs). The engineer notices that GPU utilization is low (~30%) and time per epoch is high. The model uses a custom training loop. Which configuration change is most likely to improve GPU utilization?

A.Increase the batch size to match GPU memory

B.Reduce the number of data loading workers

C.Use mixed precision training

D.Enable SageMaker Managed Warm Pools

AnswerA

Larger batch size increases the amount of computation per step, keeping GPUs more fully utilized.

Why this answer

Option C is correct because increasing the batch size increases the computational work per GPU, keeping them busier and improving utilization. Option A (mixed precision) can improve throughput but not necessarily utilization if batch size remains small. Option B (SageMaker Managed Warm Pools) is for inference.

Option D (reducing data loading workers) could worsen data starvation, decreasing utilization.

Practice this question →

110

MCQhard

A data scientist is running a SageMaker training job with a custom PyTorch image. The training script loads a large dataset into memory, and the job fails with an out-of-memory error after a few minutes. The instance type is ml.m5.xlarge (16 GB RAM). What should the data scientist do to resolve this issue without changing the instance type?

A.Enable SageMaker Managed Spot Training to free memory

B.Implement data loading with multiprocessing and increase the number of workers

C.Reduce the batch size in the training script

D.Use SageMaker Pipe mode to stream data from S3

AnswerC

Smaller batch sizes reduce memory consumption per step, helping to fit within the available RAM.

Why this answer

Reducing the batch size decreases the memory required for each training step, which can prevent out-of-memory errors. Pipe mode can help by streaming data, but if the entire dataset is loaded, it may not be sufficient. Multiprocessing can increase memory usage.

Spot training does not free memory.

Practice this question →

111

MCQhard

A machine learning engineer is deploying a pre-trained NLP model on Amazon SageMaker for real-time inference. The model expects input sequences of variable length, and performance is critical. The engineer wants to minimize latency while handling the variable-length inputs efficiently. Which approach should the engineer choose?

A.Reduce the model size by pruning and quantization.

B.Pad all input sequences to the maximum length in the batch.

C.Use dynamic batching with a custom inference script that groups requests by sequence length.

D.Process each request individually to avoid padding overhead.

AnswerC

Dynamic batching reduces padding and latency.

Why this answer

Option C is correct because dynamic batching with a custom inference script that groups requests by sequence length minimizes padding overhead and maximizes hardware utilization. By batching similar-length sequences together, the model avoids excessive padding to the maximum length in the batch, which reduces wasted computation and latency. This approach is particularly effective for variable-length NLP inputs on SageMaker, where the inference container can be customized to implement the grouping logic.

Exam trap

AWS often tests the misconception that padding to the maximum length is always necessary or efficient, but the trap here is that dynamic batching with length-based grouping is a more sophisticated technique that balances batching efficiency with minimal padding overhead.

How to eliminate wrong answers

Option A is wrong because pruning and quantization reduce model size and can improve latency, but they do not address the core issue of efficiently handling variable-length input sequences; they are orthogonal optimizations. Option B is wrong because padding all sequences to the maximum length in the batch introduces significant wasted computation and memory, especially when sequence lengths vary widely, leading to higher latency. Option D is wrong because processing each request individually eliminates batching benefits, resulting in lower throughput and higher per-request latency due to underutilized hardware accelerators.

Practice this question →

112

MCQeasy

Refer to the exhibit. A data scientist reviews the output of a SageMaker training job. The model has 95% training accuracy and 92% validation accuracy. Which statement is true?

A.The model has acceptable performance with a small generalization gap

B.The model is underfitting because the validation accuracy is too low

C.The model needs more epochs to improve validation accuracy

D.The model is overfitting because the training accuracy is higher than validation accuracy

AnswerA

The 3% gap is typical and the accuracy values are high.

Why this answer

Option C is correct because a 3% gap between training and validation accuracy is typically considered a small generalization gap and indicates acceptable performance. Option A (overfitting) would be a larger gap. Option B (underfitting) would show low accuracy on both sets.

Option D (more epochs) may not help if the model is already converging.

Practice this question →

113

Multi-Selectmedium

A machine learning engineer is training a neural network using Amazon SageMaker. The training job uses a single GPU instance. To improve training speed using distributed training, which two steps should they take? (Select TWO.)

Select 2 answers

A.Split the dataset into smaller files

B.Use SageMaker's distributed data parallelism library

C.Modify the training script to use Horovod or PyTorch DistributedDataParallel

D.Enable automatic mixed precision

E.Increase the number of worker instances in the training job

AnswersC, E

These frameworks enable multi-GPU communication and are necessary for distributed training.

Why this answer

Distributed training requires both modifying the training script to use a distributed framework (e.g., Horovod, PyTorch DDP) and increasing the number of instances. Splitting the dataset into smaller files can improve I/O but is not about distribution. SageMaker's distributed data parallelism library is one option, but modifying the script with a framework is the general step.

Automatic mixed precision improves speed on a single GPU but does not enable distributed training.

Practice this question →

114

Multi-Selecteasy

A data scientist is using SageMaker Autopilot to automatically build a model. Which TWO aspects does Autopilot handle? (Choose TWO.)

Select 2 answers

A.Data ingestion

B.Model deployment

C.Feature engineering

D.Data labeling

E.Hyperparameter tuning

AnswersC, E

Correct: Autopilot automatically explores different feature transformations.

Why this answer

Options A and B are correct because Autopilot performs automated feature engineering and hyperparameter tuning. Option C is wrong because Autopilot does not deploy the model; that is a separate step. D is wrong because data labeling is not part of Autopilot.

E is wrong because Autopilot does not handle data ingestion beyond using provided dataset.

Practice this question →

115

Multi-Selectmedium

Which TWO options are recommended best practices for monitoring model performance in production on SageMaker? (Choose 2.)

Select 2 answers

A.Retrain the model daily based on recent data without evaluation.

B.Use SageMaker Clarify for bias monitoring and feature importance drift.

C.Enable SageMaker Model Monitor to capture data drift and model quality metrics.

D.Set up a CloudWatch alarm on the endpoint's Invocations metric.

E.Manually compare prediction distributions weekly.

AnswersB, C

Clarify can monitor bias and explainability over time.

Why this answer

Options A and C are correct. SageMaker Model Monitor can track data quality and model quality drifts. Option B is wrong because CloudWatch alarms can notify but are not a complete monitoring solution.

Option D is wrong because manual review is not scalable. Option E is wrong because retraining without monitoring may be premature.

Practice this question →

116

Multi-Selecteasy

A data science team needs to track and compare multiple ML training runs, including hyperparameters, metrics, and output artifacts. Which TWO AWS services can be used together to meet this requirement? (Choose two.)

Select 2 answers

A.Amazon SageMaker Experiments

B.Amazon S3

C.Amazon SageMaker Studio notebooks

D.Amazon SageMaker Model Registry

E.Amazon CloudWatch Logs

AnswersA, D

SageMaker Experiments captures and compares training runs, metrics, and parameters.

Why this answer

Amazon SageMaker Experiments provides experiment tracking and comparison. Amazon SageMaker Model Registry helps manage model versions and artifacts. SageMaker Studio notebooks alone lack tracking; S3 provides storage but no tracking; CloudWatch logs is for monitoring.

Practice this question →

117

MCQeasy

A company uses SageMaker to train a model. The training job is failing with an error "ResourceLimitExceeded". What is the most likely cause?

A.The account has reached the limit for number of training instances

B.The model artifact is too large to upload

C.Invalid hyperparameters

D.The training data size exceeds the available instance storage

AnswerA

ResourceLimitExceeded occurs when you exceed a service quota, such as instance count.

Why this answer

The error indicates that a service limit has been reached, commonly the number of concurrent training instances. Other options would produce different error messages.

Practice this question →

118

MCQhard

A team is using Amazon SageMaker's Automatic Model Tuning (AMT) to optimize hyperparameters for a random forest model. After 10 training jobs, the best objective metric value plateaus. The team wants to explore the search space more broadly. Which AMT strategy should they use?

A.Grid search

B.Random search

C.Bayesian optimization

D.Hyperband

AnswerB

Random search explores the entire search space uniformly, increasing the chance of finding new promising regions.

Why this answer

Random search samples hyperparameters randomly and covers the search space more broadly, which can help escape a plateau. Bayesian optimization focuses on promising regions, which may not explore broadly. Grid search is exhaustive but expensive.

Hyperband uses early stopping to allocate resources efficiently but still may not explore broadly if the plateau persists.

Practice this question →

119

Multi-Selectmedium

A data scientist is using Amazon SageMaker Data Wrangler for data preparation. Which two tasks can be performed using Data Wrangler's built-in transforms? (Choose two.)

Select 2 answers

A.Running a SQL query on the data

B.Encoding categorical variables

C.Handling missing values

D.Creating an ensemble of models

E.Training a custom machine learning model

AnswersB, C

Built-in transform for one-hot encoding or label encoding.

Why this answer

Data Wrangler includes built-in transforms for handling missing values and encoding categorical variables. Running SQL queries is possible via custom import, but not a built-in transform. Training models and creating ensembles are not part of Data Wrangler.

Practice this question →

120

MCQhard

Refer to the exhibit. A data scientist runs a SageMaker training job with the above configuration. The training completes but the model performance is poor. Which change to the hyperparameters is most likely to improve the model's AUC?

A.Increase max_depth to 10

B.Increase subsample to 1.0

C.Increase num_round to 200

D.Decrease eta to 0.1

AnswerD

A lower learning rate improves generalization by taking smaller steps, often yielding better AUC.

Why this answer

Option C is correct because reducing eta (learning rate) from 0.3 to 0.1 allows the model to converge more carefully, often improving generalization and AUC. Option A (increase num_round) may cause overfitting, especially with a high learning rate. Option B (increase max_depth) can also lead to overfitting.

Option D (increase subsample to 1.0) uses all data per round, which may reduce regularization and exacerbate overfitting.

Practice this question →

121

Multi-Selectmedium

A team is training a deep learning model on Amazon SageMaker using a custom Docker container. Which three practices should they follow to optimize training performance? (Choose three.)

Select 3 answers

A.Store training data in Amazon S3 in a shuffled and compressed format

B.Use the largest instance type available

C.Increase the number of layers in the model to improve accuracy

D.Use SageMaker Managed Spot Training with checkpointing

E.Use Pipe mode to stream data instead of File mode

AnswersA, D, E

Shuffling prevents bias and compression reduces transfer time, improving training performance.

Why this answer

Storing training data in Amazon S3 in a shuffled and compressed format (Option A) optimizes training performance because shuffling prevents biased gradient updates during stochastic gradient descent, while compression reduces I/O overhead and network transfer time. SageMaker's Pipe mode can then stream this compressed data directly to the training algorithm without intermediate disk writes, further accelerating throughput.

Exam trap

AWS often tests the misconception that bigger instances always mean faster training, but the real optimization lies in data pipeline efficiency (e.g., Pipe mode, compression, and shuffling) and cost management (e.g., Managed Spot Training with checkpointing).

Practice this question →

122

MCQmedium

A company is training a deep learning model on Amazon SageMaker. The training job started but has been stuck in 'InProgress' state for an unusually long time with low CPU utilization. The data scientist suspects a bottleneck. What should be the first troubleshooting step?

A.Switch the training job to use Spot instances to reduce cost and potentially improve throughput.

B.Increase the number of training instances to parallelize data loading.

C.Stop and restart the training job with a different instance type.

D.Review CloudWatch Logs for the training container to identify errors or warnings.

AnswerD

Logs often show the exact cause of hanging, such as waiting for data or resource constraints.

Why this answer

Option A is correct because checking CloudWatch logs for the training container can reveal errors like resource limits or data loading issues. Option B is wrong because restarting without investigation may waste time. Option C is wrong because using Spot instances does not address the stuck job.

Option D is wrong because increasing instance count may not help if the bottleneck is elsewhere.

Practice this question →

123

Multi-Selecthard

A data scientist is building a text classification model using Amazon SageMaker. The dataset is stored as a CSV file in Amazon S3. The scientist wants to use the SageMaker built-in BlazingText algorithm. Which of the following steps are required to prepare the data for training? (Choose TWO.)

Select 2 answers

A.Convert the text to one-hot encoded vectors.

B.Tokenize and remove stop words from the text.

C.Convert the CSV file to the format of a single file with one instance per line.

D.Upload the data to an Amazon SageMaker notebook instance.

E.Ensure each line in the training file contains a single text instance with the label prefixed by '__label__'.

AnswersC, E

BlazingText expects a single file with one instance per line.

Why this answer

Option C is correct because BlazingText expects input data in a single file where each line represents one training instance. This is a specific requirement of the algorithm's input format, not a general SageMaker practice. The CSV file must be converted to this line-per-instance format for BlazingText to process it correctly.

Exam trap

The trap here is that candidates assume general NLP preprocessing (like tokenization or stop word removal) is always required, but BlazingText is designed to handle raw text and expects a specific line format, not preprocessed vectors.

Practice this question →

124

MCQeasy

Which technique is commonly used to handle missing values in a categorical feature?

A.One-hot encoding

B.Mean imputation

C.Mode imputation

D.Standard scaling

AnswerC

Mode imputation replaces missing categorical values with the most frequent category, a common practice.

Why this answer

Mode imputation (replacing missing values with the most frequent category) is a standard method for categorical data. Mean imputation is for numerical data, standard scaling is for feature scaling, and one-hot encoding encodes categories without handling missing values.

Practice this question →

125

MCQhard

An MLOps engineer is building an automated retraining pipeline for a fraud detection model. The model must be retrained weekly, and the new model should only be promoted to production if it meets predefined performance thresholds compared to the current model. Which combination of SageMaker capabilities should the engineer use?

A.Amazon SageMaker Debugger and Amazon SageMaker Clarify

B.Amazon SageMaker Model Monitor and Amazon SageMaker Ground Truth

C.Amazon SageMaker Autopilot and Amazon SageMaker Experiments

D.Amazon SageMaker Pipelines and Amazon SageMaker Model Registry

AnswerD

Pipelines orchestrate the workflow, Model Registry manages model versions and approvals.

Why this answer

Option D is correct because Amazon SageMaker Pipelines provides the orchestration for the automated retraining workflow (including weekly scheduling and conditional logic), while SageMaker Model Registry enables versioning, approval, and promotion of models based on performance thresholds. Together, they allow the engineer to define a pipeline that trains a new model, evaluates it against the current production model, and only registers it for deployment if it meets the predefined criteria.

Exam trap

AWS often tests the distinction between monitoring tools (Model Monitor, Debugger) and orchestration/registry services (Pipelines, Model Registry), so the trap here is that candidates may confuse Model Monitor's drift detection with the need for a retraining pipeline, overlooking that the question specifically requires automated retraining and conditional promotion.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger monitors training metrics and detects anomalies (e.g., vanishing gradients), but it does not orchestrate retraining pipelines or manage model promotion. SageMaker Clarify is used for bias detection and feature importance, not for automated retraining workflows. Option B is wrong because SageMaker Model Monitor detects data drift in production, not for retraining orchestration, and SageMaker Ground Truth is a labeling service for creating training datasets, not for pipeline automation or model promotion.

Option C is wrong because SageMaker Autopilot automates model building (feature engineering, algorithm selection) but does not provide pipeline orchestration or model registry capabilities for conditional promotion; SageMaker Experiments tracks trial runs but lacks the workflow automation and approval gates needed for this use case.

Practice this question →

126

MCQeasy

A team uses SageMaker for training. They need to monitor training progress and view metrics like loss and accuracy. Which SageMaker feature should they use?

A.SageMaker Ground Truth

B.SageMaker Debugger

C.SageMaker Model Monitor

D.SageMaker Experiments

AnswerB

Debugger can output tensors and metrics during training for real-time monitoring.

Why this answer

SageMaker Debugger captures real-time training metrics and provides alerts, making it ideal for monitoring progress. Other features serve different purposes.

Practice this question →

127

MCQeasy

A data scientist wants to evaluate the performance of a binary classification model. The dataset is highly imbalanced with only 5% positive class. Which metric should be used to evaluate the model?

A.Accuracy

B.Mean Squared Error

C.R-squared

D.F1-score

AnswerD

F1-score considers both precision and recall, providing a balanced measure for imbalanced classes.

Why this answer

F1-score balances precision and recall, making it suitable for imbalanced datasets. Accuracy can be misleading (e.g., 95% if always predicting negative). Mean Squared Error and R-squared are for regression.

Practice this question →

128

MCQhard

A machine learning team wants to detect bias in a binary classification model before deployment. They use SageMaker Clarify. Which type of bias metric should they compute to understand whether the model treats different demographic groups unfairly in predictions?

A.SHAP (SHapley Additive exPlanations) values from the test predictions.

B.A re-run of the training job with a fairness constraint.

C.Pre-training bias metrics like Class Imbalance (CI) and Difference in Positive Proportions in Labels (DPPL).

D.Feature importance values after training.

AnswerC

Pre-training metrics identify bias in the training data that could lead to unfair models.

Why this answer

Option B is correct because pre-training bias metrics (such as class imbalance, Kullback-Leibler divergence) reveal data bias before modeling, while post-training metrics assess prediction bias. Option A is wrong because feature importance explains model behavior but not bias. Option C is wrong because SHAP values are for model interpretability, not bias metrics per se.

Option D is wrong because retraining does not detect bias.

Practice this question →

129

Multi-Selectmedium

A data scientist is building a text classification model using Amazon SageMaker. The dataset is large and includes imbalanced classes. Which three techniques can help improve model performance? (Choose three.)

Select 3 answers

A.Performing feature extraction using TF-IDF

B.Using cost-sensitive learning

C.Oversampling the minority class

D.Using a linear classifier only

E.Using SMOTE

AnswersB, C, E

Assigns higher misclassification costs to the minority class, improving performance.

Why this answer

Oversampling, SMOTE, and cost-sensitive learning are standard approaches to handle class imbalance. Using a linear classifier only is limiting, and TF-IDF is a feature extraction method that does not address imbalance directly.

Practice this question →

130

Multi-Selecthard

Which TWO tools are specifically designed for debugging and analyzing training jobs in SageMaker?

Select 2 answers

A.SageMaker Autopilot

B.SageMaker Experiments

C.SageMaker Debugger

D.SageMaker Clarify

E.SageMaker Model Monitor

AnswersB, C

Experiments organizes training runs for analysis and comparison.

Why this answer

SageMaker Debugger provides real-time monitoring and debugging of training jobs, and SageMaker Experiments helps track and compare runs. Model Monitor is for deployed endpoints, Clarify for bias, and Autopilot for automated model creation.

Practice this question →

131

MCQhard

A data scientist is using Amazon SageMaker Debugger to monitor training metrics. They want to stop training automatically if the model is overfitting. Which action should they take?

A.Define a Debugger rule that monitors the loss plateau

B.Configure a custom rule that triggers a STOP training action when validation loss stops decreasing

C.Create a SageMaker Training Compiler

D.Use a built-in rule that checks for vanishing gradients

AnswerB

A custom rule can monitor validation loss and stop training when it plateaus or increases, indicating overfitting.

Why this answer

SageMaker Debugger allows custom rules with actions like STOP training. A built-in rule for overfitting may not exist, so a custom rule is needed. The rule should check if validation loss stops decreasing (plateau) or starts increasing, and trigger STOP.

Other options monitor different issues.

Practice this question →

132

MCQmedium

Refer to the exhibit. A data scientist receives an AccessDenied error when trying to create a training job using SageMaker. What is the most likely cause?

A.Missing s3:PutObject permission

B.Missing sagemaker:CreateTrainingJob permission

C.Missing sagemaker:DescribeTrainingJob permission

D.Using wrong AWS region

AnswerA

Training jobs require put access to S3 for outputs and logs.

Why this answer

The policy allows sagemaker:CreateTrainingJob and s3:GetObject, but training jobs also need to write logs and output to S3 (s3:PutObject). The missing s3:PutObject permission causes the AccessDenied error.

Practice this question →

133

MCQmedium

A company is using Amazon SageMaker to train a large deep learning model. The training job is taking a very long time. The data scientist suspects that the GPU utilization is low due to inefficient data loading. Which action should the data scientist take to diagnose and address this issue?

A.Switch to a CPU-only instance to reduce overhead.

B.Check GPU utilization using Amazon CloudWatch metrics, and if low, optimize the data loading pipeline by using Pipe mode or faster data formats.

C.Reduce the batch size to speed up training.

D.Increase the number of GPUs in the training instance.

AnswerB

Monitoring GPU utilization and optimizing data loading addresses the bottleneck.

Why this answer

Option B is correct because low GPU utilization during deep learning training often indicates a data loading bottleneck, where the GPU spends cycles waiting for data. Amazon CloudWatch provides GPU utilization metrics for SageMaker training jobs, and if utilization is low, optimizing the data pipeline with Pipe mode (streaming data directly from Amazon S3) or using faster data formats like RecordIO or TFRecord can reduce I/O overhead and keep the GPU busy.

Exam trap

The trap here is that candidates often assume adding more GPUs or reducing batch size will speed up training, but without addressing the data pipeline bottleneck, these changes can actually worsen GPU utilization and training time.

How to eliminate wrong answers

Option A is wrong because switching to a CPU-only instance would eliminate GPU acceleration entirely, making training even slower, and does not address the root cause of inefficient data loading. Option C is wrong because reducing the batch size typically decreases GPU utilization further, as the GPU processes fewer samples per step, increasing the relative overhead of data loading and model synchronization. Option D is wrong because increasing the number of GPUs does not fix a data loading bottleneck; it can actually exacerbate the issue by requiring even more data to be fed to multiple GPUs, potentially lowering per-GPU utilization further.

Practice this question →

134

MCQmedium

A machine learning engineer is developing a text classification model using Amazon SageMaker. The dataset consists of 1 million customer reviews, with labels indicating sentiment (positive, negative, neutral). The engineer uses a pre-trained BERT model from the Hugging Face Model Hub and fine-tunes it on the dataset using SageMaker's Hugging Face estimator with a ml.p3.2xlarge instance. After 2 hours of training, the training job fails with a 'ResourceExhaustedError: CUDA out of memory' error. The error occurs during the forward pass of the first epoch. The engineer confirms that the batch size is set to 32, the maximum sequence length is 512 tokens, and the dataset is stored in a S3 bucket in the same AWS region. The engineer needs to complete fine-tuning without increasing instance costs. Which course of action should the engineer take?

A.Reduce the batch size to 8 and enable gradient accumulation with 4 steps to maintain effective batch size.

B.Enable SageMaker Managed Spot Training to reduce costs and use the savings to upgrade to a ml.p3.8xlarge instance.

C.Switch to a CPU-based instance like ml.c5.2xlarge to avoid GPU memory constraints.

D.Reduce the maximum sequence length to 128 tokens to lower memory consumption.

AnswerA

Reducing batch size lowers GPU memory usage, and gradient accumulation allows the model to see the same number of samples per update without increasing memory.

Why this answer

Option A is correct because reducing the batch size to 8 directly lowers GPU memory usage per forward pass, and enabling gradient accumulation with 4 steps allows the model to simulate the original effective batch size of 32 (8 × 4 = 32) without increasing memory footprint. This approach resolves the CUDA out-of-memory error while keeping the same instance type (ml.p3.2xlarge) and without incurring additional costs.

Exam trap

The trap here is that candidates may think reducing sequence length (Option D) is the simplest fix, but they overlook that it can severely impact model performance for sentiment analysis on long reviews, while gradient accumulation (Option A) is the standard technique to handle GPU memory limits without sacrificing batch size or accuracy.

How to eliminate wrong answers

Option B is wrong because upgrading to a ml.p3.8xlarge instance increases costs (it has 4× the GPU memory and is more expensive per hour), and Managed Spot Training only reduces cost but does not change the instance type; the engineer explicitly needs to avoid increasing instance costs. Option C is wrong because switching to a CPU-based instance (ml.c5.2xlarge) would dramatically increase training time for a BERT model (which relies on GPU parallelism) and may still run out of memory for sequence length 512, while also violating the requirement to complete fine-tuning efficiently. Option D is wrong because reducing the maximum sequence length to 128 tokens would truncate input texts, potentially losing critical context in customer reviews and degrading model accuracy; the engineer needs to maintain model quality while fixing the memory error.

Practice this question →

← PreviousPage 2 of 2 · 134 questions total

Ready to test yourself?

Try a timed practice session using only ML Model Development questions.

Start 20-question session