CCNA ML Model Development Questions

75 of 134 questions · Page 1/2 · ML Model Development · Answers revealed

1
MCQhard

Refer to the exhibit. A data scientist used a SageMaker training job with a custom Scikit-learn script. The training job failed with the error shown. What is the most likely cause of this failure?

A.The training script is reading the CSV file incorrectly, causing a shape mismatch.
B.The InputDataConfig specifies ContentType text/csv but the actual file is not CSV.
C.The SageMaker training image is outdated and does not support Scikit-learn 1.0.
D.The training data contains missing values that need to be imputed.
AnswerA

Correct: The error indicates a shape issue, and SageMaker's CSV loading can produce 1D arrays for single-column data, which the script must handle.

Why this answer

The error 'Expected 2D array, got 1D array' indicates the input data is being interpreted as a single-dimensional array. In SageMaker, when reading a CSV file with one column, the default behavior may produce a 1D array. The script likely expects a 2D array.

Option A is correct because the script is incorrectly reading or processing the CSV, causing a shape mismatch.

2
MCQhard

Refer to the exhibit. A data scientist ran a SageMaker training job using a built-in algorithm. The job failed with the above error. What is the most likely cause?

A.The S3 bucket lacks proper permissions for SageMaker to read the training data.
B.The input CSV file has missing or mismatched column headers.
C.The built-in algorithm does not support CSV input format.
D.The training instance ran out of memory.
AnswerB

The failure reason indicates the CSV headers do not match the training schema.

Why this answer

Option B is correct because the error explicitly states the data format is incorrect; the CSV headers do not match the expected schema. Option A is wrong because the error does not mention memory. Option C is wrong because the error is about data format, not permissions.

Option D is wrong because the algorithm is built-in and should support CSV if headers match.

3
MCQhard

A machine learning team is training a large natural language processing model on Amazon SageMaker using the SageMaker Hugging Face container. The training job runs on multiple instances and uses Managed Spot Training to reduce costs. However, the job frequently gets interrupted by Spot interruptions, causing long training times. What should the team do to mitigate this issue?

A.Use a reserved capacity with Savings Plans
B.Use a larger instance type to finish faster
C.Enable checkpointing and increase the number of save intervals
D.Disable Managed Spot Training and use On-Demand instances
AnswerC

Checkpointing saves model state so training can resume after a Spot interruption; more frequent saves reduce the amount of work lost.

Why this answer

Enabling checkpointing and saving intermediate model states at appropriate intervals allows the training job to resume from the last checkpoint after a Spot interruption, significantly reducing wasted time. Increasing save intervals means more frequent saving, which reduces work lost. Reserved capacity does not help with interruptions; using larger instances doesn't prevent interruptions; disabling Spot increases cost.

4
MCQhard

A machine learning team is developing a deep learning model for image classification. They observe that the training loss decreases rapidly but the validation loss starts increasing after a few epochs. Which strategy should they implement to address this issue?

A.Increase the batch size
B.Add more convolutional layers
C.Increase the learning rate
D.Apply dropout regularization
AnswerD

Dropout prevents co-adaptation of neurons, acting as a regularizer to reduce overfitting.

Why this answer

Increasing validation loss indicates overfitting. Dropout regularization randomly drops neurons during training, which reduces overfitting. Increasing learning rate would make training unstable.

Adding more layers increases capacity and likely worsens overfitting. Increasing batch size can have a regularizing effect but is not as direct as dropout.

5
MCQmedium

A retail company uses SageMaker to train a multi-class image classification model with a custom ResNet-50 implemented in TensorFlow. The training data is 500 GB of images stored in S3. The data scientist uses a ml.p3.2xlarge instance with a single GPU. The training takes 10 hours per epoch, and the model does not converge after 5 epochs. The scientist needs to accelerate training and improve model accuracy. The current implementation loads images individually from S3 using TensorFlow's tf.data API. The scientist also notices high I/O wait time. Which combination of actions should the scientist take? (Assume the scientist is aware of best practices.) The answer is a single choice from A-D.

A.Increase the number of epochs to 20 and enable early stopping with patience 5.
B.Convert images to RecordIO format and store them on Amazon EFS for faster access.
C.Deploy the model on a SageMaker endpoint and use batch transform for offline predictions.
D.Use SageMaker Pipe mode for data ingestion and upgrade to a ml.p3.8xlarge instance.
AnswerD

Pipe mode reduces I/O wait by streaming data; more GPUs parallelize training.

Why this answer

Option B is correct because using SageMaker Pipe mode streams data directly from S3 to the training container, reducing I/O bottlenecks. Additionally, switching to a multi-GPU instance like ml.p3.8xlarge speeds up computation. Option A is wrong because increasing epochs does not address I/O or speed.

Option C is wrong because batch transform is for inference. Option D is wrong because recordIO is not natively supported by TensorFlow tf.data without conversion, and EFS adds network latency.

6
MCQeasy

A data scientist is performing feature engineering on a dataset containing a categorical feature with high cardinality (over 1000 unique values). Which encoding method is most appropriate to use as input for a tree-based model?

A.One-hot encoding
B.Label encoding
C.Target encoding
D.Binary encoding
AnswerB

Label encoding converts categories to integers, which tree-based models can handle without expanding feature space.

Why this answer

Option A is correct because label encoding assigns integer labels to categories, and tree-based models can effectively split on these ordinal-like values without creating a large number of features. Option B (one-hot encoding) would produce too many features. Option C (target encoding) risks data leakage.

Option D (binary encoding) creates fewer features than one-hot but still many and may not be as interpretable for trees.

7
Multi-Selecthard

A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)

Select 3 answers
A.Use an ml.c5.2xlarge instance with CPU only
B.Use SageMaker Batch Transform for inference
C.Compile the model with SageMaker Neo
D.Use SageMaker Elastic Inference (EI) instead of a full GPU instance
E.Use an ml.p3.2xlarge instance for the endpoint
AnswersC, D, E

Neo optimizes the model for faster inference on target hardware.

Why this answer

SageMaker Neo compiles the PyTorch model into an optimized runtime binary that is specifically tuned for the target hardware (e.g., GPU instances like ml.p3). This reduces inference latency by applying graph-level optimizations, operator fusion, and memory layout transformations without changing the model's accuracy, while also lowering compute resource usage and cost.

Exam trap

AWS often tests the distinction between real-time vs. batch inference and the trade-off between full GPU instances and lighter acceleration options like Elastic Inference, expecting candidates to recognize that Batch Transform is not suitable for low-latency endpoints and that CPU-only instances cannot meet GPU requirements.

8
MCQmedium

A data scientist needs to ensure that the same train/test split is used across multiple experiments for reproducibility in SageMaker. Which approach should they take?

A.Use the same SageMaker instance type
B.Use the same hyperparameter values
C.Use the same dataset version
D.Set a random seed in the training script
AnswerD

Correct: Setting a random seed ensures reproducibility of random operations like data splits.

Why this answer

Option C is correct because setting a random seed in the training script ensures reproducibility of the data split. Options A, B, and D are incorrect because instance type, dataset version, and hyperparameters do not control the random split.

9
MCQmedium

A company deploys a real-time inference endpoint on SageMaker for a customer-facing application. Traffic patterns are unpredictable and sometimes spike. The endpoint must scale automatically to handle load while minimizing cost. Which approach should the company take?

A.Switch to batch transform for all inference requests.
B.Use a larger instance type to handle peak traffic.
C.Configure a target tracking scaling policy on the endpoint using CloudWatch metrics.
D.Deploy multiple models behind an Application Load Balancer.
AnswerC

Target tracking scales instances in/out based on a metric threshold, matching demand.

Why this answer

Option D is correct because a target tracking scaling policy with a metric like average latency or request count scales automatically based on demand. Option A is wrong because using multiple models does not address scaling. Option B is wrong because increasing instance type without auto-scaling leads to over-provisioning.

Option C is wrong because batch transform is for asynchronous, not real-time.

10
MCQeasy

A machine learning engineer is training a model using SageMaker's built-in XGBoost algorithm. The training job fails with an error indicating insufficient memory. Which parameter should be adjusted to reduce memory usage?

A.subsample
B.num_round
C.max_depth
D.colsample_bytree
AnswerC

Reducing max_depth decreases tree depth and memory usage.

Why this answer

Option A is correct because decreasing max_depth reduces the size of each tree, which lowers memory consumption during training. Option B (subsample) reduces the number of rows per iteration but may not directly address tree size. Option C (colsample_bytree) reduces the number of features per tree but is less effective than max_depth.

Option D (num_round) increases the number of trees but does not directly reduce per-tree memory.

11
MCQeasy

A data scientist is building a model to predict customer churn based on historical data. The dataset has 10 features and 100,000 records, and the target is binary. Which algorithm is most appropriate for this binary classification problem?

A.Principal component analysis
B.K-means clustering
C.Linear regression
D.Logistic regression
AnswerD

Logistic regression is designed for binary classification and handles large datasets efficiently.

Why this answer

Logistic regression is a standard algorithm for binary classification, providing probabilistic outputs and interpretability. Linear regression is for regression, K-means is for clustering, and PCA is for dimensionality reduction.

12
MCQhard

A team is using SageMaker to run a large-scale distributed training job for a language model. They are using SageMaker's Pipe mode to stream data from S3 to reduce IO. They observe that the training throughput is lower than expected, and the CPU utilization is high while GPU utilization is low. The training script uses PyTorch's DataLoader with num_workers=0. The data preprocessing is minimal. Which change is most likely to improve GPU utilization?

A.Increase the number of data loading workers (num_workers).
B.Use a larger instance with more vCPUs.
C.Increase the number of GPUs per instance.
D.Switch from Pipe mode to File mode.
AnswerA

Correct: Increasing num_workers parallelizes data loading, reducing CPU bottleneck and improving GPU utilization.

Why this answer

Option D is correct because setting num_workers=0 forces the main process to load data, causing a CPU bottleneck. Increasing num_workers parallelizes data loading, reducing GPU idle time. Option A is wrong because adding GPUs does not address the data loading bottleneck.

Option B is wrong because more vCPUs without more workers does not help. Option C is wrong because switching to File mode would increase IO overhead, worsening the problem.

13
Multi-Selectmedium

A data scientist is training a deep learning model using SageMaker and wants to use distributed training across multiple GPUs to reduce training time. Which TWO actions should the scientist take to configure distributed training? (Select TWO.)

Select 2 answers
A.Reduce the number of epochs to match the number of GPUs
B.Use the SageMaker distributed data parallelism library
C.Manually split the training data into shards and upload to S3
D.Configure the SageMaker estimator with a distribution parameter
E.Set the instance count to 1 with a multi-GPU instance
AnswersB, D

The library automatically distributes data across GPUs.

Why this answer

The SageMaker distributed data parallelism library (option B) automatically partitions training data and synchronizes gradients across multiple GPUs, reducing training time without manual data splitting. Configuring the SageMaker estimator with a distribution parameter (option D) enables this library by specifying the distribution strategy (e.g., 'torch_distributed' or 'tensorflow_distributed'), which is required to activate distributed training.

Exam trap

The trap here is that candidates confuse single-instance multi-GPU training (option E) with true distributed training across multiple instances, or assume manual data sharding (option C) is required when SageMaker automates it.

14
MCQeasy

A data scientist is training a binary classification model using a dataset that has a severe class imbalance (90% negative, 10% positive). Which technique should be used to address the imbalance during model training?

A.Use a larger batch size
B.Use L2 regularization
C.Apply random oversampling of the minority class
D.Increase the learning rate
AnswerC

Random oversampling balances the class distribution by replicating minority class samples.

Why this answer

Random oversampling of the minority class (Option C) directly addresses class imbalance by duplicating or synthesizing examples from the positive class, which balances the training distribution and prevents the model from becoming biased toward the majority class. This technique is specifically designed to mitigate the skewed gradient updates that occur when the minority class is underrepresented, leading to better recall and precision for the positive class in binary classification tasks.

Exam trap

AWS often tests the misconception that hyperparameter tuning (like batch size or learning rate) can fix data imbalance, when in fact only data-level or algorithm-level techniques (e.g., oversampling, undersampling, or cost-sensitive learning) directly address the skewed class distribution.

How to eliminate wrong answers

Option A is wrong because using a larger batch size does not correct class imbalance; it may even exacerbate the issue by making each batch more likely to contain only majority-class samples, reducing the model's exposure to minority examples. Option B is wrong because L2 regularization is a technique to prevent overfitting by penalizing large weights, but it has no effect on the class distribution or the imbalance between positive and negative samples. Option D is wrong because increasing the learning rate can cause unstable training or divergence, and it does not address the underlying data imbalance; it may lead to the model ignoring minority class patterns altogether.

15
MCQhard

A team is deploying a machine learning model for real-time fraud detection. The model must have inference latency under 10 ms and handle up to 1000 requests per second. The model is a gradient boosting model using XGBoost. Which SageMaker hosting configuration is MOST cost-effective while meeting the requirements?

A.Use SageMaker Batch Transform with multiple instances
B.Use a SageMaker Multi-Model Endpoint (MME) on an ml.c5.4xlarge instance with auto scaling
C.Deploy on a single ml.c5.xlarge instance with a real-time endpoint
D.Deploy separate real-time endpoints for each model on ml.m5.large instances
AnswerB

MME allows multiple models to share a container, reducing cost while scaling to meet demand.

Why this answer

Option B is correct because a Multi-Model Endpoint (MME) on a single ml.c5.4xlarge instance allows multiple models to share the same endpoint, reducing cost while still meeting the latency (<10 ms) and throughput (1000 req/s) requirements. The ml.c5.4xlarge provides sufficient compute (16 vCPUs, 32 GB memory) for XGBoost inference, and auto scaling ensures capacity adjusts to handle peak load without over-provisioning.

Exam trap

The trap here is that candidates often assume a single large instance is insufficient for high throughput, but MME allows efficient resource sharing across models, making a single ml.c5.4xlarge cost-effective when the model is small and CPU-bound.

How to eliminate wrong answers

Option A is wrong because SageMaker Batch Transform is designed for offline, asynchronous inference on large datasets, not real-time fraud detection with sub-10 ms latency. Option C is wrong because a single ml.c5.xlarge instance (4 vCPUs, 8 GB memory) cannot handle 1000 requests per second with <10 ms latency for an XGBoost model; it would be CPU-bound and cause request throttling or timeouts. Option D is wrong because deploying separate real-time endpoints on ml.m5.large instances (2 vCPUs, 8 GB memory each) is cost-inefficient and would require many instances to meet throughput, increasing cost without latency benefit; also, ml.m5 instances are memory-optimized but XGBoost inference is CPU-intensive, making ml.c5 instances more suitable.

16
MCQmedium

An ML engineer is using Amazon SageMaker Automatic Model Tuning (AMT) to optimize hyperparameters for a gradient boosting model. The tuning job is taking a long time and has completed many training jobs. The engineer wants to stop training jobs that are unlikely to improve the objective metric. What should they configure?

A.Reduce the number of hyperparameter ranges
B.Use a random search strategy instead of Bayesian
C.Increase the maximum number of training jobs
D.Enable early stopping in the hyperparameter tuning job
AnswerD

Early stopping terminates training jobs that are not meeting an improvement threshold, reducing overall tuning time.

Why this answer

Early stopping in AMT automatically stops training jobs that are not improving the objective, saving time. Reducing hyperparameter ranges may narrow the search but does not stop unpromising jobs. Random search does not incorporate early stopping.

Increasing max jobs would prolong the process.

17
MCQhard

A company uses SageMaker to train a model with a large dataset stored in S3. They notice that the training job is taking longer than expected and the GPU utilization is low. Which action would most likely improve GPU utilization?

A.Increase the batch size
B.Disable distributed training
C.Use a smaller instance type
D.Decrease the batch size
AnswerA

Correct: Larger batch sizes better utilize GPU memory and compute.

Why this answer

Option D is correct because increasing the batch size can help saturate GPU compute. Options A is wrong because decreasing batch size would lower GPU utilization. B is wrong because using a smaller instance provides less compute.

C is wrong because disabling distributed training reduces parallelism.

18
MCQeasy

A data engineer needs to split a time-series dataset into training and validation sets for a forecasting model. Which split method should be used to avoid data leakage?

A.Use k-fold cross-validation with random shuffling.
B.Use feature importance scores to weight the splitting process.
C.Random split with 80% training and 20% validation.
D.Temporal split where training uses data up to a cutoff date and validation uses later data.
AnswerD

Time-series data must be split chronologically to preserve the temporal dependencies.

Why this answer

Option B is correct because for time-series data, a temporal split ensures that validation data comes from a later time period than training data. Option A is wrong because random splits can cause future data to leak into training. Option C is wrong while cross-validation is useful, it must be done in a time-aware manner (e.g., rolling origin), but standard k-fold cross-validation is not appropriate.

Option D is wrong because feature importance is not a splitting method.

19
MCQmedium

A machine learning team is using Amazon SageMaker to train a model. They notice that the training job is taking longer than expected and the logs show repeated warnings about 'loss not decreasing'. Which SageMaker feature should they use to diagnose and visualize the training process?

A.Amazon SageMaker Clarify
B.Amazon SageMaker Experiments
C.Amazon SageMaker Debugger
D.Amazon SageMaker Model Monitor
AnswerC

Debugger provides real-time training diagnostics.

Why this answer

Amazon SageMaker Debugger is the correct choice because it provides real-time monitoring and visualization of training metrics, including loss values, gradients, and weights. The repeated 'loss not decreasing' warnings indicate a training issue (e.g., vanishing gradients or learning rate problems), and Debugger can capture these tensors and emit alerts or trigger actions (like stopping the job) via built-in or custom rules. It also integrates with SageMaker Studio for interactive visualization of the training progress.

Exam trap

The trap here is that candidates often confuse SageMaker Debugger with SageMaker Experiments, thinking both are for monitoring training metrics, but Experiments only logs high-level metrics (like final loss or accuracy) while Debugger provides deep, step-by-step tensor-level diagnostics for issues like loss stagnation.

How to eliminate wrong answers

Option A is wrong because Amazon SageMaker Clarify is designed for bias detection and explainability of model predictions, not for monitoring training metrics like loss. Option B is wrong because Amazon SageMaker Experiments is used for tracking and comparing different training runs (e.g., hyperparameters, metrics), but it does not provide real-time, in-depth debugging of internal tensors or loss plateaus during a single training job. Option D is wrong because Amazon SageMaker Model Monitor focuses on detecting data drift and quality issues in deployed models (inference endpoints), not on diagnosing training-time problems like loss stagnation.

20
MCQeasy

A team is developing a model to predict customer churn. The dataset has 10,000 samples with 20 features. The target variable is binary with 15% churn rate. The team wants to use logistic regression. Which data preprocessing step is MOST important to ensure proper convergence?

A.Remove correlated features to reduce multicollinearity
B.Impute missing values with the median
C.Apply SMOTE to balance the classes
D.Standardize the features to have zero mean and unit variance
AnswerD

Standardization ensures gradient descent converges faster and avoids dominance by large-scale features.

Why this answer

Logistic regression uses gradient descent or similar optimization algorithms that rely on the scale of the features. When features have different units or magnitudes, the cost function becomes elongated, causing slow or unstable convergence. Standardizing to zero mean and unit variance ensures that all features contribute equally to the gradient updates, leading to faster and more reliable convergence.

Exam trap

AWS often tests the misconception that class imbalance is the primary barrier to convergence, when in fact feature scaling is the fundamental requirement for optimization algorithms in logistic regression.

How to eliminate wrong answers

Option A is wrong because while multicollinearity can inflate standard errors in logistic regression, it does not prevent convergence; the model can still converge with correlated features, though interpretation may suffer. Option B is wrong because imputing missing values with the median is a general preprocessing step but is not the most critical for convergence; logistic regression can handle missing data through other methods, and median imputation does not address the scale issue. Option C is wrong because SMOTE addresses class imbalance, which affects model bias and performance metrics, but logistic regression can converge perfectly well on imbalanced data; the optimizer does not require balanced classes for convergence.

21
MCQmedium

Refer to the exhibit. A data scientist configured an automatic model tuning job for a classification model. The tuning job completed after 20 training jobs, but the best validation accuracy was only 0.65. What is the most effective way to potentially improve the result?

A.Increase MaxNumberOfTrainingJobs to 100
B.Change the strategy to Random
C.Change the objective metric to training:accuracy
D.Increase MaxParallelTrainingJobs to 10
AnswerA

More training jobs allow Bayesian optimization to explore more hyperparameter combinations, potentially finding a better optimum.

Why this answer

With only 20 training jobs, Bayesian optimization may not have fully explored the hyperparameter space. Increasing the maximum number of training jobs allows more exploration and increases the chance of finding better hyperparameters. Changing to random search could help but Bayesian is generally more efficient.

Changing the objective to training accuracy would not improve generalization. Increasing parallel jobs does not increase total exploration.

22
MCQhard

A company wants to forecast monthly sales that show clear seasonality. Which algorithm is most suitable?

A.ARIMA (Seasonal ARIMA)
B.Random forest
C.K-means clustering
D.Linear regression
AnswerA

Seasonal ARIMA explicitly models seasonality and autocorrelation, ideal for seasonal time series.

Why this answer

ARIMA (specifically SARIMA for seasonality) is designed for univariate time series forecasting with seasonal patterns. Linear regression may capture trends but not seasonality, random forest can be used but not optimal for time series, K-means is clustering.

23
MCQeasy

A company has a trained machine learning model that needs to be deployed as a real-time inference endpoint on Amazon SageMaker. The endpoint must automatically scale based on incoming traffic. Which SageMaker feature should be used?

A.SageMaker Endpoint Auto Scaling
B.SageMaker Elastic Inference
C.SageMaker Batch Transform
D.SageMaker Model Monitor
AnswerA

Auto Scaling automatically adjusts the instance count based on configured policies to handle traffic changes.

Why this answer

SageMaker Endpoint Auto Scaling adjusts the number of instances behind an endpoint based on demand. Batch Transform is for batch predictions, Model Monitor for monitoring, and Elastic Inference for accelerating inference.

24
MCQhard

A company wants to use a pre-trained NLP model from SageMaker JumpStart for sentiment analysis. Which step is required to make predictions?

A.Label the dataset for fine-tuning
B.Train the model from scratch on the company's data
C.Convert the model to ONNX format
D.Deploy the model to an endpoint
AnswerD

Deploying to a SageMaker endpoint allows real-time inference on new data.

Why this answer

D is correct because SageMaker JumpStart provides pre-trained models that are ready for inference without additional training. To make predictions, you must deploy the model to a SageMaker endpoint, which creates a hosted inference endpoint that can accept input data and return sentiment analysis results.

Exam trap

AWS often tests the misconception that pre-trained models require fine-tuning or additional data preparation before inference, when in fact they can be used directly for predictions after deployment to an endpoint.

How to eliminate wrong answers

Option A is wrong because labeling the dataset for fine-tuning is only necessary if you want to adapt the pre-trained model to a specific domain or task, but it is not required for making predictions with the pre-trained model as-is. Option B is wrong because training from scratch defeats the purpose of using a pre-trained model from JumpStart, which is designed to avoid the cost and time of training from scratch. Option C is wrong because converting the model to ONNX format is an optimization step for cross-platform deployment or performance, but it is not a prerequisite for making predictions with SageMaker JumpStart models, which natively support SageMaker inference.

25
MCQeasy

A company wants to build a machine learning model to predict house prices based on features like square footage, number of bedrooms, and location. The target variable is a continuous numeric value. Which Amazon SageMaker built-in algorithm is most appropriate for this task?

A.Object2Vec
B.XGBoost
C.Linear Learner
D.BlazingText
AnswerC

Linear Learner is designed for regression and classification, and is the most direct choice for predicting a continuous value with linear relationships.

Why this answer

Linear Learner is the most appropriate built-in algorithm for this regression task because it is specifically designed for predicting continuous numeric values (house prices) using linear models. It supports both regression and classification, and for regression, it minimizes mean squared error (MSE) to fit a linear relationship between features and the target variable. The algorithm also offers automatic feature scaling and model tuning, making it a direct fit for this use case.

Exam trap

The trap here is that candidates often choose XGBoost (Option B) because it is a popular and powerful algorithm for tabular data, but the question specifically asks for the most appropriate built-in algorithm for a linear regression task, and Linear Learner is the direct, optimized choice for that purpose.

How to eliminate wrong answers

Option A (Object2Vec) is wrong because it is designed for learning embeddings from pairs of objects (e.g., recommendation systems or similarity tasks), not for regression on tabular data. Option B (XGBoost) is wrong because while it can be used for regression, it is a gradient-boosted tree algorithm that is not a built-in SageMaker algorithm optimized for linear regression; it is better suited for structured data with complex non-linear relationships, but the question asks for the most appropriate built-in algorithm, and Linear Learner is the direct choice for linear regression. Option D (BlazingText) is wrong because it is designed for natural language processing tasks like word embeddings and text classification, not for numerical regression on tabular features.

26
Multi-Selectmedium

A data scientist is preparing text data for a sentiment analysis model using Amazon SageMaker. Which two data preprocessing techniques are commonly used when working with text data for natural language processing? (Choose two.)

Select 2 answers
A.One-hot encoding of all words
B.Image resizing
C.Tokenization
D.Principal component analysis (PCA)
E.Stop word removal
AnswersC, E

Tokenization splits text into tokens (words or subwords), a fundamental step in NLP preprocessing.

Why this answer

Stop word removal and tokenization are standard text preprocessing steps. One-hot encoding of all words leads to high dimensionality and is rarely used directly. Image resizing is for images, and PCA is for numerical dimensionality reduction.

27
MCQeasy

Refer to the exhibit. A data scientist ran a training job using a custom algorithm container. The job failed with the error shown. What is the most likely cause?

A.The S3 output path is incorrect
B.The algorithm script references an undefined variable or metric named 'loss'
C.The training image is not accessible
D.The instance type is insufficient
AnswerB

The error directly states it cannot evaluate 'loss', meaning the variable is not defined or out of scope.

Why this answer

The error 'Cannot evaluate expression: loss' indicates that the training script attempted to compute or log a variable named 'loss' that is not defined in the code. The training image access, S3 output path, and instance type are not related to this specific error.

28
MCQmedium

A team is using SageMaker for automatic model tuning. They want to minimize the mean absolute error (MAE) and have a budget of 50 training jobs. Which tuning strategy should they choose to best explore the hyperparameter space?

A.Grid search
B.Hyperband
C.Random search
D.Bayesian optimization
AnswerD

Bayesian optimization uses past results to guide the search and is sample-efficient.

Why this answer

Bayesian optimization builds a probabilistic model of the objective function and is effective for finding good hyperparameters with a limited budget. Other strategies are less efficient: random search is less directed, grid search is expensive, and Hyperband is better with larger budgets.

29
Multi-Selectmedium

Which THREE steps are part of the typical workflow when using SageMaker built-in algorithms?

Select 3 answers
A.Set up a real-time inference endpoint
B.Create a training job
C.Create a custom training image
D.Set hyperparameters
E.Monitor training with CloudWatch
AnswersB, D, E

A training job is required to start model training.

Why this answer

Creating a training job, setting hyperparameters, and monitoring training with CloudWatch are core steps. Creating a training image is not required for built-in algorithms (they are provided by AWS), and setting up an inference endpoint is a deployment step after training.

30
MCQmedium

A company is using SageMaker to train a model for image classification. They have a dataset of 10,000 images. They use SageMaker's built-in image classification algorithm with transfer learning. During training, they notice that the training job completes successfully but the model accuracy on the validation set is very low (~30%). They suspect the model is underfitting. Which action is most likely to improve accuracy?

A.Use a different algorithm.
B.Add more layers to the model architecture.
C.Use a smaller batch size.
D.Increase the number of training epochs.
AnswerD

Correct: More epochs allow the model to learn patterns better, reducing underfitting.

Why this answer

Option D is correct because underfitting often results from insufficient training. Increasing the number of epochs gives the model more opportunity to learn. Option A is wrong because a smaller batch size may help but not as directly as more epochs.

Option B is wrong because adding layers could lead to overfitting if not regularized. Option C is wrong because changing the algorithm may not address underfitting.

31
MCQmedium

A data scientist has trained a model that achieves 95% accuracy on the training set but only 70% on the test set. Which of the following is the most likely cause?

A.Data leakage
B.Overfitting
C.Convergence to local minimum
D.Underfitting
AnswerB

Overfitting causes high training accuracy and low test accuracy due to poor generalization.

Why this answer

Option C is correct because a large gap between training and test accuracy indicates overfitting, where the model memorizes the training data but fails to generalize. Option A (underfitting) would show low accuracy on both. Option B (data leakage) could cause high accuracy on both if leak is consistent.

Option D (convergence to local minimum) is a training issue but does not directly explain the gap.

32
MCQeasy

A data scientist is training a binary classification model using imbalanced data where the positive class is only 1% of the dataset. The scientist wants to maximize the recall for the positive class while maintaining reasonable precision. Which evaluation metric is most appropriate to tune during model selection?

A.Log loss
B.Area under the ROC curve (AUC)
C.F1 score
D.Accuracy
AnswerC

F1 score combines precision and recall, making it suitable for imbalanced classes when both matter.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it ideal for imbalanced datasets where the positive class is only 1%. By tuning the F1 score, the data scientist directly balances the trade-off between maximizing recall (capturing true positives) and maintaining reasonable precision (avoiding false positives), which aligns with the stated goal.

Exam trap

AWS often tests the misconception that AUC-ROC is always the best metric for imbalanced data, but the trap here is that AUC-ROC can remain high even when the model fails to recall the minority class, whereas the F1 score directly penalizes poor recall.

How to eliminate wrong answers

Option A is wrong because log loss measures the probabilistic accuracy of predictions, penalizing confident wrong predictions, but it does not directly optimize recall or precision for the minority class in imbalanced data. Option B is wrong because AUC-ROC evaluates the model's ability to rank positive instances higher than negative ones across all thresholds, but it can be misleadingly high even when recall for the minority class is poor, as it is insensitive to class imbalance. Option D is wrong because accuracy is the ratio of correct predictions to total predictions, and with only 1% positive class, a model that predicts all negatives achieves 99% accuracy, completely failing to capture any positive instances.

33
MCQhard

A machine learning engineer runs a SageMaker HyperparameterTuningJob with Bayesian optimization strategy. The job terminates earlier than the specified MaxNumberOfTrainingJobs. The engineer notices that the best objective metric value has not improved for several consecutive jobs. What is the most likely adjustment to make?

A.Adjust the early stopping tolerance (e.g., increase the number of consecutive jobs with no improvement allowed).
B.Switch to a grid search strategy to cover all hyperparameter combinations.
C.Increase the MaxNumberOfTrainingJobs parameter to allow more exploration.
D.Decrease the number of hyperparameters being tuned.
AnswerA

Early stopping is likely too aggressive; increasing the tolerance allows more exploration before terminating.

Why this answer

Option C is correct because Bayesian optimization uses early stopping to avoid wasting resources on unpromising hyperparameters. The early stopping tolerance can be configured to be less aggressive. Option A is wrong because increasing max jobs would still not help if the search gets stuck.

Option B is wrong because decreasing the number of hyperparameters may reduce the search space but does not address early stopping. Option D is wrong because grid search is less efficient and would ignore the ongoing Bayesian optimization.

34
MCQeasy

A machine learning engineer trains a binary classifier and obtains an accuracy of 95% on the test set. The dataset is imbalanced with 95% positive class. What is the most important metric to evaluate the model's performance?

A.R-squared
B.F1 score
C.Accuracy
D.RMSE
AnswerB

F1 score combines precision and recall, making it suitable for imbalanced classification.

Why this answer

Accuracy is misleading on imbalanced datasets because a model that always predicts the majority class achieves high accuracy. F1 score balances precision and recall, providing a more reliable measure.

35
Multi-Selecthard

A data scientist is training a large transformer model using SageMaker's model parallelism library. The training job is failing with an out-of-memory (OOM) error. Which two actions can help resolve the OOM error? (Choose two.)

Select 2 answers
A.Reduce the sequence length
B.Enable activation checkpointing
C.Increase the batch size per GPU
D.Switch to a smaller instance type
E.Decrease the pipeline parallelism degree
AnswersA, B

Shorter sequences directly reduce memory usage for attention and hidden states.

Why this answer

Options C and E are correct. Activation checkpointing (C) trades compute for memory by recomputing activations during backpropagation rather than storing them. Reducing sequence length (E) directly decreases memory usage for attention layers.

Option A (decrease pipeline parallelism degree) can increase per-stage memory. Option B (increase batch size) increases memory. Option D (smaller instance) reduces available memory, worsening OOM.

36
Multi-Selectmedium

A company is training a deep learning model using SageMaker's built-in PyTorch framework. They want to optimize training performance. Which THREE actions should they take? (Choose THREE.)

Select 3 answers
A.Use Pipe mode to stream data from S3
B.Use a spot instance for training
C.Enable SageMaker Debugger for profiling
D.Increase the number of workers in the DataLoader
E.Use a SageMaker ML Storage volume for checkpointing
AnswersA, C, D

Correct: Pipe mode reduces IO overhead by streaming data directly.

Why this answer

Options B, C, and D are correct. Using SageMaker Debugger for profiling (B) helps identify bottlenecks. Pipe mode (C) streams data from S3 efficiently.

Increasing DataLoader workers (D) parallelizes data loading. Option A is wrong because checkpoint storage does not directly improve performance. Option E is wrong because spot instances reduce cost but not performance.

37
Multi-Selectmedium

A data scientist is building a text classification model using a pre-trained BERT model from the Hugging Face library on SageMaker. The scientist wants to fine-tune the model on a custom dataset. Which TWO steps are necessary to set up the fine-tuning job? (Select TWO.)

Select 2 answers
A.Use the HuggingFace estimator provided by SageMaker
B.Enable SageMaker Clarify for explainability during training
C.Build a custom Docker container with PyTorch and Transformers
D.Specify the PyTorch framework version and Transformers version in the estimator
E.Use SageMaker Processing to preprocess the data in parallel
AnswersA, D

The HuggingFace estimator simplifies fine-tuning with pre-built containers.

Why this answer

Option A is correct because the SageMaker HuggingFace estimator is specifically designed to simplify fine-tuning of pre-trained Hugging Face models like BERT. It automatically handles the underlying infrastructure, including the correct PyTorch/TensorFlow and Transformers versions, without requiring custom Docker containers. This is the recommended approach for Hugging Face model fine-tuning on SageMaker.

Exam trap

AWS often tests the misconception that custom Docker containers are required for any non-standard framework, but the HuggingFace estimator eliminates that need by providing a managed environment with version control.

38
Multi-Selectmedium

Which TWO SageMaker Pipelines steps are essential for automating a complete ML workflow from data processing to model deployment? (Choose 2.)

Select 2 answers
A.A TuningStep for hyperparameter tuning.
B.A ProcessingStep to run data preprocessing and feature engineering.
C.A TransformStep for batch inference on the training data.
D.A CreateModelStep (or RegisterModelStep) to register or deploy the trained model.
E.A ConditionStep to decide whether to train a model based on data quality.
AnswersB, D

Processing is typically required to prepare data.

Why this answer

Options B and D are correct. ProcessingStep runs data processing, and CreateModelStep or RegisterModelStep deploys the model. Step A is wrong because TrainingStep is for training, but included in typical pipeline, but the question asks essential steps; actually a pipeline must handle processing and model creation.

However, without training step, no model. Let's refine: The question implies a complete workflow, so likely all mentioned but we need two. Possibly they want the ones that are not optional.

Correct answer: B (ProcessingStep) and D (CreateModelStep) are core. Option A (TrainingStep) is also core, but since we choose two, we need to pick the most fundamental? The instructions say 'essential' maybe both processing and training are essential. Let's reconsider: I'll make A incorrect because it is not essential if using a built-in algorithm? No, training is essential.

This is tricky. I'll restructure the options to make two clearly essential: ProcessingStep and RegisterModelStep (or CreateModelStep). But to avoid confusion, I'll set the correct answer as: B (ProcessingStep) and D (ModelStep for deployment).

However, TrainingStep is also essential. Since it's 'choose 2', I need to ensure only two are fully correct. Let me change the options so that training step is not listed as a separate option, or make it a distractor.

I'll adjust: Options: A) ConditionStep for branching, B) ProcessingStep, C) TuningStep, D) CreateModelStep, E) TransformStep. Then correct are B and D, because condition is optional, tuning optional, transform optional. Yes that works.

I'll update the question stem to include typical steps. Let's finalize.

39
MCQhard

A machine learning engineer is using SageMaker to train a model with the built-in LightGBM algorithm. The engineer wants to use early stopping to prevent overfitting. The training job is configured with a validation dataset. Which hyperparameter should be set to enable early stopping?

A.early_stopping_rounds
B.num_iterations
C.early_stopping
D.num_boost_round
AnswerA

early_stopping_rounds triggers early stopping after a specified number of rounds without validation improvement.

Why this answer

Option B is correct because the SageMaker LightGBM implementation uses the early_stopping_rounds hyperparameter to specify the number of consecutive rounds without improvement before stopping. Option A (num_iterations) sets the maximum number of rounds. Option C (num_boost_round) is an alias for the number of boosting rounds.

Option D (early_stopping) is not a valid hyperparameter name.

40
MCQmedium

A data scientist has trained a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). After evaluation, the model shows an accuracy of 99%, but the recall for fraud cases is only 10%. Which metric should the data scientist prioritize to improve the model's performance for fraud detection?

A.Log loss
B.F1-score
C.Precision
D.Area under the ROC curve (AUC-ROC)
AnswerB

F1-score is the harmonic mean of precision and recall, making it a balanced metric for imbalanced classification.

Why this answer

F1-score balances precision and recall, making it more informative than accuracy for imbalanced datasets. AUC-ROC is also used but F1 directly addresses the trade-off between false positives and false negatives. Precision alone does not capture recall, and Log loss does not directly indicate recall improvement.

41
MCQhard

A financial services company is deploying a real-time fraud detection model using Amazon SageMaker. The model is a gradient boosting model (XGBoost) trained on historical transaction data. The inference endpoint uses an ml.m5.2xlarge instance with a single variant. Recently, the company has experienced a 3x increase in transaction volume during peak hours, causing inference latency to exceed the 200ms SLA. The data science team has already optimized the model by reducing the number of trees and feature set, but the latency remains high during spikes. The team considers using SageMaker's built-in scaling policies. They currently have a single endpoint with one production variant. The team wants to maintain low latency without over-provisioning resources. They have ruled out model changes. Which approach should the team take?

A.Configure an Application Auto Scaling target tracking scaling policy for the variant based on the 'SageMakerVariantInvocationsPerInstance' metric, with a target value that keeps the inference latency within the SLA.
B.Deploy the model on multiple endpoints behind an Application Load Balancer.
C.Use scheduled scaling to increase the instance count during known peak hours.
D.Manually increase the instance count during peak hours.
AnswerA

This auto-scales based on load.

Why this answer

Option A is correct because SageMaker's built-in target tracking scaling policy using the 'SageMakerVariantInvocationsPerInstance' metric allows the endpoint to automatically adjust the instance count based on real-time invocation load. By setting a target value that correlates with the 200ms SLA, the policy dynamically scales out during traffic spikes and scales in during lulls, preventing over-provisioning while maintaining low latency. This approach directly addresses the 3x peak-hour volume increase without requiring manual intervention or model changes.

Exam trap

The trap here is that candidates may confuse scheduled scaling (Option C) as a valid solution for predictable peaks, but the question's emphasis on 'real-time' and 'without over-provisioning' points to dynamic scaling, which target tracking provides; scheduled scaling cannot adapt to unexpected volume variations within the peak window.

How to eliminate wrong answers

Option B is wrong because deploying multiple endpoints behind an Application Load Balancer adds unnecessary complexity and does not leverage SageMaker's native auto-scaling capabilities; it also introduces additional latency from the load balancer and requires manual management of endpoint distribution. Option C is wrong because scheduled scaling assumes predictable peak hours, but the question states the volume increase occurs 'during peak hours' which may vary day-to-day; scheduled scaling cannot react to real-time spikes and may over-provision or under-provision if the timing shifts. Option D is wrong because manually increasing the instance count during peak hours is reactive, error-prone, and violates the requirement to avoid over-provisioning; it also requires constant human monitoring and cannot scale down automatically when traffic subsides.

42
MCQhard

A machine learning engineer is using SageMaker Automatic Model Tuning (AMT) to optimize hyperparameters for a random forest model. The engineer notices that the tuning job is taking too long and many hyperparameter combinations are being evaluated but not improving the objective metric. Which action should the engineer take to make the tuning more efficient?

A.Switch the strategy from Bayesian to random search
B.Use a smaller instance type for each training job
C.Increase the maximum number of training jobs
D.Enable early stopping for the tuning job
AnswerD

Early stops poorly performing trials, reducing wasted computation.

Why this answer

Option D is correct because enabling early stopping in SageMaker Automatic Model Tuning (AMT) terminates poorly performing training jobs before they complete, which reduces wasted compute time and speeds up the tuning process. This is especially effective when using Bayesian optimization, as it allows the algorithm to focus on promising hyperparameter regions and avoid evaluating combinations that are unlikely to improve the objective metric.

Exam trap

The trap here is that candidates may confuse early stopping with reducing instance size or changing search strategies, not realizing that early stopping directly addresses wasted computation on poor trials without sacrificing search quality.

How to eliminate wrong answers

Option A is wrong because switching from Bayesian to random search would likely make the tuning less efficient, as random search does not use past results to guide future evaluations and often requires more trials to find optimal hyperparameters. Option B is wrong because using a smaller instance type for each training job reduces per-job compute capacity, which can slow down individual training runs and may not address the core issue of evaluating many unproductive combinations. Option C is wrong because increasing the maximum number of training jobs would evaluate even more hyperparameter combinations, prolonging the tuning job and potentially increasing wasted resources without improving efficiency.

43
MCQmedium

A machine learning engineer observes that a SageMaker training job fails with the error shown in the exhibit. What is the most likely cause of the failure?

A.The SageMaker execution role does not have an IAM policy that grants read access to the S3 bucket containing the training data.
B.The training data is stored in an unsupported format like Parquet.
C.The training job is using an incorrect AWS Region for the S3 bucket.
D.The VPC configuration prevents the training job from reaching the S3 bucket.
AnswerA

The error message explicitly says 'Unable to locate credentials', indicating missing permissions for the role.

Why this answer

The error clearly states that the SageMaker execution role lacks the necessary permissions to download data. The role assigned to the training job must have S3 read access. Option A is correct.

Option B is incorrect because the error is explicit about credentials. Option C is incorrect because the error is not about network. Option D is incorrect because the error does not mention data format.

44
MCQeasy

Refer to the exhibit. A SageMaker training job failed. Based on the error message, which action should the engineer take?

A.Change the algorithm
B.Use a larger instance type
C.Increase the volume size
D.Increase the instance count
AnswerB

A larger instance type has more memory, addressing the out-of-memory error.

Why this answer

The error indicates insufficient instance memory. The ml.m5.large instance has limited memory; using a larger instance type (e.g., ml.m5.xlarge) provides more memory. Increasing instance count would distribute but not increase per-instance memory; volume size affects storage, not RAM; changing the algorithm may not help.

45
MCQeasy

Refer to the exhibit. A user launches a SageMaker notebook instance with this lifecycle configuration. What happens?

A.The script runs every time the notebook is started
B.The script runs after each kernel reset
C.The script runs only on the first start
D.The script runs only when creating the instance
AnswerA

Lifecycle configurations execute on each instance start event.

Why this answer

SageMaker lifecycle configurations run every time the notebook instance is started (including after stop and start). They do not run only on first start or after kernel resets.

46
MCQhard

Refer to the exhibit. The training job failed. What is the MOST likely cause?

A.The learning rate is too high
B.The instance type does not have SSD storage
C.The instance type does not have enough memory
D.The training data size exceeds the available EBS volume size
E.The number of epochs is too low
AnswerD

The error 'No usable scratch space' indicates disk space exhaustion on the EBS volume.

Why this answer

The log indicates no usable scratch space, meaning the default EBS volume is full. The retry with local SSD suggests that the instance might not have local SSD, or it also fails. The most likely cause is that the training data or intermediate files exceed the available EBS volume size.

47
MCQmedium

A company uses Amazon SageMaker to train a custom XGBoost model. The training job runs on a single ml.m5.large instance and takes 2 hours. To reduce training time without changing the algorithm, what should the data scientist do?

A.Increase the number of epochs
B.Use SageMaker's built-in XGBoost algorithm
C.Enable automatic model tuning
D.Use a larger instance type
AnswerD

A larger instance offers more compute resources, reducing training time for the same algorithm.

Why this answer

Using a larger instance type (e.g., ml.m5.xlarge) provides more compute capacity, directly reducing training time. Automatic model tuning adds overhead and does not reduce time. Built-in XGBoost runs faster but changes the algorithm.

Increasing epochs would increase training time.

48
Multi-Selectmedium

A data engineer is preparing a dataset for training a regression model. The dataset contains numerical features with missing values. Which two methods are appropriate for handling missing values? (Choose two.)

Select 2 answers
A.Perform one-hot encoding on the feature
B.Replace missing values with a constant, such as -999
C.Replace missing values with the mean of the feature
D.Use a model that supports missing values natively, such as XGBoost
E.Remove all rows with missing values
AnswersC, D

Mean imputation is simple and preserves data size, though it may reduce variance.

Why this answer

Options B and E are correct. Imputing missing values with the feature mean (B) is a common and straightforward technique. Using a model that supports missing values natively (E), like XGBoost or LightGBM, can handle missing data without explicit imputation.

Option A (removing rows) may discard valuable data. Option C (constant imputation) can introduce bias. Option D (one-hot encoding) is for categorical features.

49
MCQeasy

A data scientist is training a deep learning model on SageMaker and notices that the training loss oscillates and does not converge. They want to debug this issue. Which SageMaker feature can they use to monitor and analyze the training process?

A.SageMaker Profiler
B.SageMaker Gradient Descent optimization
C.SageMaker Debugger
D.SageMaker Automatic Model Tuning
AnswerC

Correct: Debugger can monitor training metrics and alert on anomalies.

Why this answer

Option A is correct because SageMaker Debugger provides monitoring, visualization, and built-in rules to detect issues like oscillating loss. Option B is wrong because Automatic Model Tuning is for hyperparameter search, not real-time monitoring. Option C is wrong because SageMaker Profiler focuses on resource utilization.

Option D is wrong because Gradient Descent is a method, not a feature.

50
Multi-Selecthard

A machine learning engineer is using SageMaker's HyperparameterTuningJob to optimize a neural network. The engineer observes that the tuning job is taking too long. Which three actions can reduce the tuning time? (Choose three.)

Select 3 answers
A.Use a warm start from a previous tuning job
B.Use early stopping to prune poorly performing training jobs
C.Switch to a smaller instance type for each training job
D.Increase the number of concurrent training jobs
E.Reduce the number of hyperparameter combinations by using a smaller search space
AnswersB, D, E

Early stopping kills underperforming trials early, saving time.

Why this answer

Options A, B, and C are correct. Reducing the search space (A) decreases the number of configurations to try. Early stopping (B) terminates poorly performing trials early.

Increasing concurrent jobs (C) runs multiple trials in parallel. Option D (smaller instance) may slow each trial, increasing total time. Option E (warm start) does not reduce the time of the current tuning job.

51
MCQeasy

The training job completes successfully but the model performance is poor. What is a likely cause?

A.The training data is not shuffled.
B.The instance type is too small for the dataset.
C.The number of rounds (num_round) is too high.
D.The max_depth hyperparameter is too high, leading to overfitting.
AnswerD

A max_depth of 10 can cause overfitting, especially on smaller datasets, resulting in poor generalization.

Why this answer

A max_depth of 10 is high for many datasets, often leading to overfitting. Overfitting results in poor generalization. The other options are less likely: num_round=50 is moderate, instance type is not directly related to model performance, and data shuffling is not specified but the primary issue is hyperparameter choice.

52
MCQmedium

An ML engineer creates a SageMaker inference pipeline with two containers: a preprocessor and a predictor. The preprocessor is a lightweight Python script that transforms input data. How should the engineer structure the endpoints to ensure both containers run sequentially?

A.Use batch transform with two transform jobs chained together.
B.Use an AWS Lambda function as a proxy to invoke the preprocessor and then the predictor separately.
C.Combine the preprocessor and predictor into a single Docker container.
D.Create a PipelineModel in SageMaker with both containers listed in order: first preprocessor, then predictor.
AnswerD

PipelineModel automatically sends the output of the first container as input to the second.

Why this answer

Option A is correct because SageMaker inference pipelines chain containers in the order specified in the pipeline model. You create a PipelineModel with a list of containers. Option B is wrong because Lambda is not needed; the pipeline handles sequencing.

Option C is wrong because batch transform is for batch processing, not sequential inference. Option D is wrong because using a single container would require bundling, which is less modular.

53
Multi-Selecteasy

A data scientist is splitting a dataset into training and test sets. Which two practices should they follow? (Select TWO.)

Select 2 answers
A.Shuffle the data before splitting
B.Use stratified sampling for classification to preserve class proportions
C.Use a 50/50 split to maximize test data
D.Ensure the test set is representative of real-world distribution
E.Use a 80/20 split for large datasets
AnswersA, D

Shuffling prevents bias from data order and is a standard practice.

Why this answer

Shuffling the data before splitting ensures randomness and prevents ordering biases. Ensuring the test set is representative of real-world distribution (e.g., using stratified sampling) improves generalization evaluation. An 80/20 split is common but not always optimal; 50/50 is not recommended.

Stratified sampling is a specific technique for classification, but the general practice of representativeness is broader.

54
MCQeasy

Refer to the exhibit. The data scientist wants to update the endpoint to use a new model version without downtime. Which approach should they use?

A.Delete the existing endpoint and create a new one
B.Update the endpoint's model name directly
C.Create a new endpoint configuration with a second variant and update the endpoint
D.Use SageMaker Model Monitor to automatically switch
AnswerC

This allows a blue/green deployment with zero downtime by shifting traffic gradually or instantly.

Why this answer

To update an endpoint without downtime, create a new endpoint configuration that includes both the old and new model variants, then update the endpoint to use that configuration. The endpoint gradually shifts traffic to the new variant. Deleting the endpoint causes downtime, direct model name update is not allowed, and Model Monitor does not handle deployments.

55
MCQeasy

A data scientist needs to store training data in Amazon S3 and wants to optimize read performance for iterative training jobs. Which S3 feature should they use?

A.S3 Transfer Acceleration
B.S3 Glacier
C.S3 Byte-Range Fetches
D.S3 Select
AnswerC

Allows parallel range requests to improve throughput for large objects.

Why this answer

S3 Byte-Range Fetches enable parallel reads of parts of objects, improving read performance for large files during training. Other options are for speed, archival, or querying.

56
MCQhard

A team trained a gradient boosting model with the following hyperparameters: learning_rate=0.1, n_estimators=1000, max_depth=6. The model achieves excellent training accuracy but poor validation accuracy. They suspect overfitting. Which hyperparameter change is LEAST likely to help?

A.Increase learning_rate to 0.5
B.Decrease n_estimators to 100
C.Add a subsample fraction of 0.8
D.Decrease max_depth to 3
AnswerA

A higher learning rate can cause the model to overfit more quickly, often worsening overfitting.

Why this answer

Increasing the learning rate makes the model more aggressive and can worsen overfitting. Decreasing n_estimators, decreasing max_depth, and adding subsampling all reduce model complexity and help mitigate overfitting.

57
MCQmedium

A company uses SageMaker Ground Truth to label a dataset for object detection. They set up a labeling job with a private workforce. After labeling, they export the dataset and train a model using SageMaker's built-in object detection algorithm. The model achieves high accuracy on the test set but low accuracy on a small holdout set that was manually labeled by an expert. What might be the issue?

A.The dataset size is too small.
B.The object detection algorithm is not suitable.
C.The holdout set uses a different labeling schema.
D.The labeling job had insufficient worker consensus.
AnswerD

Correct: Low consensus leads to noisy training labels, degrading model quality.

Why this answer

Option C is correct because insufficient worker consensus can lead to noisy labels, causing the model to learn incorrect patterns. The expert holdout set is accurate, so the discrepancy indicates poor label quality. Option A is wrong because the holdout set using different schema would cause systematic differences, not just lower accuracy.

Option B is wrong because small dataset size would affect both test and holdout. Option D is wrong because the algorithm is appropriate.

58
Multi-Selecthard

A machine learning engineer is evaluating a binary classification model for detecting fraudulent transactions. The dataset is highly imbalanced, and the cost of false negatives (missing a fraud) is very high. Which two evaluation metrics should the engineer consider? (Choose two.)

Select 2 answers
A.F1-score
B.Accuracy
C.Recall
D.Precision
E.Mean absolute error
AnswersA, C

F1-score combines precision and recall, giving a balanced measure that penalizes low recall.

Why this answer

Recall captures the proportion of actual frauds correctly identified, directly addressing false negatives. F1-score balances precision and recall, providing a single score. Accuracy is misleading on imbalanced data, precision focuses on false positives, and mean absolute error is for regression.

59
MCQeasy

A machine learning engineer is using Amazon SageMaker Experiments to track multiple training runs. They want to compare the performance of different hyperparameter configurations visually. Which SageMaker tool provides an interactive interface to compare experiments?

A.SageMaker Studio
B.SageMaker Model Monitor
C.SageMaker Experiments SDK
D.SageMaker Debugger Insights
AnswerA

SageMaker Studio offers a rich visual interface to browse, compare, and analyze experiments.

Why this answer

SageMaker Studio provides a visual interface with experiment lists, charts, and comparisons. The SageMaker Experiments SDK is programmatic, not visual. Debugger Insights is for debugging, and Model Monitor is for inference monitoring.

60
MCQeasy

A data scientist is training a linear regression model on a dataset with 10 features. After training, the model shows high training accuracy but poor test accuracy. Which of the following is the most likely cause?

A.Data leakage
B.Feature scaling
C.Overfitting
D.Underfitting
AnswerC

Overfitting occurs when the model learns noise in the training data, leading to high training accuracy but poor generalization.

Why this answer

High training accuracy and poor test accuracy indicates overfitting. Underfitting would show poor training accuracy. Data leakage could cause high accuracy but not necessarily overfitting.

Feature scaling is a preprocessing step and not directly a cause of this behavior.

61
MCQeasy

Which of the following is a recommended practice for preparing training data in Amazon SageMaker?

A.Store training data in Amazon DynamoDB
B.Compress data using gzip to reduce transfer time
C.Convert data to RecordIO format for built-in algorithms
D.Use Amazon S3 with public read access
AnswerC

RecordIO is a binary format that SageMaker built-in algorithms use for efficient data loading.

Why this answer

RecordIO format is recommended for built-in algorithms as it improves I/O performance. Storing data in Amazon DynamoDB is not suitable for training datasets. Public read access on S3 is insecure.

While gzip compression is common, RecordIO is a specific best practice for SageMaker.

62
MCQmedium

A team is evaluating classification models for a medical diagnosis application. The cost of a false negative is much higher than the cost of a false positive. Which metric should be optimized during model selection?

A.Recall
B.Accuracy
C.F1 score
D.Precision
AnswerA

Recall minimizes false negatives, directly addressing the high cost of missed diagnoses.

Why this answer

Option B is correct because recall (true positive rate) focuses on minimizing false negatives, which is the priority when a missed diagnosis is costly. Option A (precision) minimizes false positives. Option C (accuracy) treats all errors equally.

Option D (F1 score) balances precision and recall but does not emphasize recall over precision.

63
MCQhard

What is the most likely cause of this error?

A.The account has not requested a service limit increase for the specified instance type.
B.The training job is configured to use Managed Spot Training and the spot market is unavailable.
C.The IAM role used for the training job does not have sufficient permissions to launch the instance.
D.The instance type specified is not available in the current AWS Region.
AnswerA

A ResourceLimitExceeded error with a limit of 0 means the account needs to request a service limit increase from AWS Support.

Why this answer

The error indicates that the service limit for the specified instance type is set to 0, meaning the account has not been granted a limit increase for that instance. The instance might be available in the region, but the limit is zero. IAM permissions would give an access denied error, not ResourceLimitExceeded.

Spot unavailability generates a different error.

64
MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. After training, the scientist notices that the training and validation errors are both low, but the model performs poorly on new test data. What is the MOST likely cause?

A.There is data leakage from the validation set into the training set
B.The features are not scaled properly
C.The model is overfitting the training data
D.The model has high bias
AnswerA

Data leakage artificially inflates performance on validation but fails on true unseen data.

Why this answer

Option A is correct because data leakage from the validation set into the training set would allow the model to learn patterns that are not present in truly unseen data, leading to artificially low training and validation errors but poor generalization to new test data. In SageMaker, this can occur if the dataset is not properly split before feature engineering or if preprocessing (e.g., scaling or imputation) is applied to the entire dataset before splitting, causing the validation set to influence the training process.

Exam trap

The trap here is that candidates often confuse overfitting (low training error, high validation error) with data leakage (low training and validation errors, but poor test performance), so they incorrectly select Option C without recognizing that the validation error is also low.

How to eliminate wrong answers

Option B is wrong because improper feature scaling typically leads to slow convergence or suboptimal performance during training, but it would not cause low training and validation errors with poor test performance; scaling issues usually affect both training and validation errors similarly. Option C is wrong because overfitting would result in low training error but high validation error, not low validation error as described in the scenario. Option D is wrong because high bias (underfitting) would cause both training and validation errors to be high, not low.

65
MCQmedium

What will the debugger do with this configuration?

A.It will only capture gradients and not run any rules because the rule name is misspelled.
B.It will capture gradients every 10 steps and trigger a rule if loss does not decrease for 500 epochs.
C.It will capture gradients every 500 steps and trigger a rule if loss does not decrease for 10 steps with a threshold of 0.001.
D.It will capture gradients every 500 steps and trigger a rule if loss does not decrease for 500 iterations with a patience of 10.
AnswerC

The collection captures gradients every 500 steps; the rule parametrs (patience=10, threshold=0.001) define when to alert.

Why this answer

The 'save_interval' in the collection captures gradients every 500 steps. The rule 'LossNotDecreasing' checks if the loss does not decrease for 'patience' consecutive steps (10) within a tolerance of 'threshold' (0.001). Option B incorrectly interprets the timing; option C swaps values; option D incorrectly states rules run despite the misspelling? Actually rule name is valid.

66
MCQhard

Refer to the exhibit. A SageMaker training job logs show training AUC increasing but validation AUC plateauing at 0.880. What is the most likely issue?

A.Overfitting
B.Learning rate too high
C.Underfitting
D.Insufficient training data
AnswerA

The model is memorizing training data (train AUC up) but not generalizing (validation AUC flat).

Why this answer

Training AUC continues to increase while validation AUC stops improving and even drops slightly, indicating overfitting. Underfitting would show both low, high learning rate would cause erratic behavior, and insufficient data typically causes high variance but not this pattern.

67
MCQeasy

A company is training a binary classifier in SageMaker and observes that the training loss decreases but validation loss increases after a few epochs. What is the most likely issue?

A.Learning rate too high
B.Overfitting
C.Underfitting
D.Data imbalance
AnswerB

Correct: Overfitting occurs when the model performs well on training data but poorly on validation data.

Why this answer

Option A is correct because overfitting occurs when the model performs well on training data but poorly on validation data. Option B is wrong because underfitting would show high loss on both datasets. Option C is wrong because a high learning rate may cause divergence but not necessarily validation loss increase.

Option D is wrong because data imbalance typically affects both training and validation metrics.

68
MCQmedium

Which feature scaling method is most robust to outliers in the data?

A.Normalization (L2)
B.Standardization (Z-score)
C.Robust scaling
D.Min-max scaling
AnswerC

Robust scaling uses median and IQR, thus resilient to outliers.

Why this answer

Robust scaling uses median and interquartile range, making it less sensitive to outliers than standardization (mean and standard deviation) or min-max scaling (range dependent on extremes).

69
MCQmedium

A company is building a binary classifier for credit default prediction. The dataset is highly imbalanced (98% no default). They want to maximize recall for the minority class while maintaining reasonable precision. Which metric should be optimized during hyperparameter tuning?

A.AUC-ROC
B.F1 score
C.Accuracy
D.Precision
AnswerB

F1 score is the harmonic mean of precision and recall, addressing both metrics.

Why this answer

F1 score balances precision and recall, making it suitable for imbalanced datasets when both metrics are important. Other options are less appropriate because accuracy is misleading due to imbalance, precision ignores recall, and AUC-ROC does not directly optimize recall at a decision threshold.

70
MCQhard

A team is using SageMaker Pipelines to train a model. The pipeline has multiple steps: data processing, training, evaluation, and registration. They use a Condition step to evaluate the model's accuracy and if it exceeds a threshold, register the model. They run the pipeline and the training step succeeds, but the pipeline fails at the Condition step with an error: 'Unable to evaluate condition: the property 'Accuracy' does not exist.' The evaluation step output is a JSON file with key 'accuracy'. What is the most likely cause?

A.The evaluation step did not produce the output correctly.
B.The training step output is being used instead of the evaluation step output.
C.The pipeline definition has a syntax error.
D.The Condition step is referencing the wrong property name.
AnswerD

Correct: 'Accuracy' vs 'accuracy' case mismatch causes the error.

Why this answer

Option A is correct because the Condition step references 'Accuracy' (capital A) but the evaluation output uses 'accuracy' (lowercase). Property names are case-sensitive. Option B is wrong because the evaluation step produced output correctly.

Option C is wrong because if training step output were used, the property name would still be mismatched. Option D is wrong because the error is specific to property name, not syntax.

71
MCQhard

A financial services company is developing a real-time fraud detection model using XGBoost on SageMaker. They have millions of transactions daily and train a model weekly on 6 months of historical data. The training dataset is 500 GB in CSV format stored in S3. The training job uses an ml.p3.16xlarge instance with 8 GPUs, but training takes over 12 hours, which is too long for the weekly cadence. The data scientist notices that GPU utilization averages only 15% during training. The training script uses the SageMaker XGBoost container with default hyperparameters. Which combination of actions would MOST likely reduce training time? (Choose the best answer.)

A.Increase the instance type to ml.p3dn.24xlarge and use EFA networking.
B.Tune hyperparameters using SageMaker Automatic Model Tuning to reduce training epochs.
C.Use SageMaker Debugger to profile the training and adjust the batch size to maximize GPU memory usage.
D.Convert the training data to Parquet format, use Pipe input mode in the training job, and increase the instance count to run distributed training.
AnswerD

Parquet reduces data size and improves I/O; Pipe mode streams data efficiently; distributed training scales out to reduce time.

Why this answer

The low GPU utilization suggests I/O bottleneck (data loading) or inefficient data format. Converting CSV to Parquet reduces data size and speeds up I/O. Using Pipe mode streamlines data loading from S3.

Increasing instance type would further help if I/O is resolved. Option C directly addresses the root cause. Option A might not help if GPU is underutilized.

Option B focuses on hyperparameters, which might not be the primary bottleneck. Option D spreads data but doesn't fix I/O if still CSV.

72
Multi-Selecthard

A team is using SageMaker Pipelines to automate a training workflow. They need to ensure that if a step fails, the pipeline can resume from the failed step without reprocessing prior steps. Which TWO configurations are necessary? (Choose TWO.)

Select 2 answers
A.Set the Pipeline's parallel flag to True
B.Set a retry policy on the step
C.Use a Lambda step for retry logic
D.Store intermediate artifacts in S3
E.Enable caching on each step
AnswersB, E

Correct: Retry policies automatically retry a step upon failure.

Why this answer

Options B and D are correct. Enabling caching on each step (B) allows outputs to be reused from previous runs. Setting a retry policy (D) allows the pipeline to retry the failed step.

Option A is wrong because parallelism does not affect resumption. Option C is wrong because Lambda steps are for custom processing, not resumption. Option E is wrong because storing artifacts is common but not sufficient for resumption without caching.

73
MCQeasy

A data scientist wants to train an XGBoost model using the SageMaker Python SDK with a custom training script. Which estimator class should be used?

A.sagemaker.sklearn.SKLearnEstimator.
B.sagemaker.tensorflow.TensorFlowEstimator.
C.sagemaker.xgboost.estimator.XGBoost with a script mode entry point.
D.sagemaker.xgboost.XGBoostEstimator with the built-in algorithm mode.
AnswerC

Framework estimator allows custom scripts and leverages the XGBoost container.

Why this answer

Option C is correct because the SageMaker XGBoost framework estimator allows users to bring their own training script while using the optimized XGBoost container. Option A is wrong because the built-in XGBoost algorithm does not support custom scripts; it expects a specific input format. Option B is wrong because scikit-learn estimator does not natively support XGBoost training.

Option D is wrong because TensorFlow estimator is for TensorFlow models.

74
MCQeasy

A team uses SageMaker Experiments to track multiple training runs. They need to register the best-performing model in the model registry for approval. Which method ensures the model artifacts and metadata are captured correctly?

A.Write an AWS Lambda function to copy the best model to a specific S3 prefix.
B.Manually download the best model artifact and upload to S3, then create a model in SageMaker.
C.Use the SageMaker Model Registry's create_model_package_from_estimator or equivalent API to register the model.
D.Use Experiment analytics to view results and then create a model package using the Run's artifact URI.
AnswerC

Model Registry captures artifacts, metrics, and supports approval workflow.

Why this answer

Option D is correct because SageMaker Model Registry provides a centralized catalog for model versions with associated metadata, metrics, and approval status. Option A is wrong because manual comparison is error-prone. Option B is wrong because Experiments track runs but do not natively register models.

Option C is wrong because Lambda is not a direct mechanism for model registration.

75
MCQmedium

A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?

A.Increase the batch size to the maximum possible
B.Use a GPU-based instance such as ml.p3.2xlarge
C.Use a learning rate scheduler that reduces the learning rate over time
D.Add more convolutional layers to the model
AnswerB

GPUs accelerate matrix operations in neural networks, reducing training time.

Why this answer

Option B is correct because GPU-based instances like ml.p3.2xlarge are specifically designed for parallel processing of matrix operations, which are fundamental to neural network training. By offloading compute-intensive tensor operations to GPU cores, training time can be significantly reduced without altering the model architecture or data, thus preserving accuracy.

Exam trap

AWS often tests the misconception that any change to hyperparameters or architecture can reduce training time without side effects, but the trap here is that candidates confuse 'reducing training time' with 'improving convergence speed'—only hardware acceleration (GPU) directly reduces wall-clock time without risking accuracy degradation.

How to eliminate wrong answers

Option A is wrong because increasing batch size to the maximum possible can lead to degraded model accuracy due to reduced gradient noise, causing the model to converge to sharp minima or even fail to converge; it also risks out-of-memory errors. Option C is wrong because a learning rate scheduler that reduces the learning rate over time helps with convergence stability and final accuracy, but it does not directly reduce training time—it may even extend it if the learning rate becomes too small too early. Option D is wrong because adding more convolutional layers increases model complexity and the number of parameters, which typically increases training time and can lead to overfitting without guaranteeing improved accuracy.

Page 1 of 2 · 134 questions totalNext →

Ready to test yourself?

Try a timed practice session using only ML Model Development questions.