Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 451–525

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 7 of 24

451

Multi-Selectmedium

Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon Redshift for storing large datasets used for machine learning? (Choose 3.)

Select 3 answers

A.Query performance and latency requirements

B.Encryption at rest capabilities

C.Cost of storage vs. compute

D.Data format and compression support

E.Data retention policies

AnswersA, C, D

Redshift provides fast SQL analytics; S3 queries are slower.

Why this answer

Options A, C, and E are key considerations. Option B is about data retention, not storage choice. Option D is about security, which both services support.

Full explanation →

452

Multi-Selectmedium

A data scientist is building a regression model to predict house prices. The dataset contains 10 features, including 'number_of_bedrooms' and 'square_footage'. The scientist observes that the model has high variance. Which TWO actions are most appropriate to reduce overfitting? (Choose TWO.)

Select 2 answers

A.Reduce model complexity by using a simpler model

B.Add L2 regularization to the model

C.Increase the number of training epochs

D.Decrease the amount of training data

E.Add more polynomial features

AnswersA, B

Simpler models have lower variance.

Why this answer

Options A and C are correct. Adding L2 regularization penalizes large weights, reducing variance. Reducing model complexity (e.g., using a simpler model) also reduces overfitting.

Option B is wrong because adding more features increases complexity. Option D is wrong because increasing training epochs may lead to more overfitting. Option E is wrong because decreasing training data increases variance.

Full explanation →

453

MCQmedium

A data science team is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (95% negative class, 5% positive class). The team wants to maximize the F1 score. Which built-in SageMaker algorithm is most appropriate?

A.Linear Learner

B.XGBoost

C.PCA

D.K-Means

AnswerB

XGBoost has scale_pos_weight parameter to handle imbalance and can optimize for F1.

Why this answer

XGBoost supports scale_pos_weight to handle class imbalance, directly optimizing for F1. Linear Learner with balanced class weights can also help but typically optimizes log loss. K-Means is unsupervised.

PCA is for dimensionality reduction.

Full explanation →

454

Multi-Selectmedium

A company is using Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a fleet of EC2 instances running a custom consumer application. The consumer is falling behind and the shard iterator age is increasing. Which TWO actions should the data engineer take to improve consumer performance? (Choose TWO.)

Select 2 answers

A.Increase the number of shards in the stream

B.Decrease the data retention period

C.Use an AWS Lambda function to process the data

D.Enable enhanced fan-out on the stream

E.Switch to the Kinesis Client Library (KCL)

AnswersA, D

More shards increase the total read capacity.

Why this answer

Increasing the number of shards increases parallelism, and using enhanced fan-out allows each consumer to have its own read throughput. Option B is wrong because decreasing retention period does not improve consumer performance. Option D is wrong because using a Lambda function may not help if the bottleneck is shard throughput.

Option E is wrong because using KCL (Kinesis Client Library) is already standard; not a direct fix.

Full explanation →

455

Multi-Selectmedium

A data scientist is building a binary classifier using logistic regression. The dataset has 10 features and 100,000 observations. The model achieves 99% accuracy on the test set, but the precision is 50% and recall is 90%. Which TWO actions should the data scientist take to improve model performance? (Choose 2.)

Select 2 answers

A.Increase the regularization strength (C) in logistic regression.

B.Adjust the decision threshold to increase precision at the cost of recall.

C.Use a random forest classifier instead of logistic regression.

D.Collect more training data.

E.Remove features that have low correlation with the target.

AnswersB, C

Lowering threshold increases recall; raising threshold increases precision.

Why this answer

Precision is low, recall is high. To improve precision, the data scientist can adjust the decision threshold or use a different algorithm. Option A (increasing regularization) may help but is not direct.

Option D (using class weights) addresses imbalance. Option C (collecting more data) may not help. Option E (changing to random forest) could improve precision.

Option B (removing features) might hurt.

Full explanation →

456

MCQmedium

A data scientist is using Amazon SageMaker to perform hyperparameter tuning for a neural network. The tuning job uses the 'Random' search strategy. After 10 training jobs, the best objective metric has plateaued. The scientist wants to improve the results without increasing the total number of training jobs. Which approach should they take?

A.Use a different objective metric that is easier to optimize

B.Normalize the input features to have zero mean and unit variance

C.Increase the maximum number of training jobs

D.Switch the hyperparameter tuning strategy to 'Bayesian'

AnswerD

Bayesian optimization uses past trials to inform future hyperparameter choices, often converging faster.

Why this answer

Switching to Bayesian search (e.g., 'Bayesian' strategy) is more efficient because it uses past results to choose the next hyperparameters, potentially finding better values in fewer jobs. Increasing the number of jobs would increase cost. Random search might get lucky but is less efficient.

Changing the objective metric or scaling features would not directly improve the tuning process.

Full explanation →

457

MCQmedium

A data engineer runs the AWS CLI command shown in the exhibit to find large log files in S3. The command returns an empty list, but the engineer knows there are files larger than 1 MB in that prefix. What is the MOST likely issue?

A.The JMESPath query syntax is incorrect

B.The command does not paginate through all objects; only the first 1000 are returned

C.The prefix is incorrect; there are no objects under that prefix

D.The Size value is in kilobytes, not bytes

AnswerB

list-objects limits to 1000 keys; use --max-items or pagination.

Why this answer

Option A is correct. The list-objects command returns up to 1000 objects per call. If there are more objects, pagination is needed.

Option B is incorrect because the query syntax is correct. Option C is incorrect because the prefix is fine. Option D is incorrect because the unit is bytes, so 1000000 is 1 MB.

Full explanation →

458

MCQeasy

An ML engineer needs to store and version training datasets and model artifacts. Which AWS service should they use?

A.Amazon DynamoDB

B.Amazon Simple Storage Service (S3)

C.Amazon Elastic File System (EFS)

D.Amazon Elastic Block Store (EBS)

AnswerB

S3 supports versioning and is commonly used for ML artifacts.

Why this answer

Amazon S3 provides scalable object storage with versioning. Option A is wrong because EBS is block storage for single EC2 instances. Option C is wrong because EFS is file storage for shared access.

Option D is wrong because DynamoDB is a NoSQL database.

Full explanation →

459

MCQhard

Refer to the exhibit. A data scientist queries the table with 'SELECT COUNT(*) FROM mytable' in Athena and gets a result of 1000 rows. However, the scientist knows there are 1500 data files in the S3 location. What is the most likely reason for the discrepancy?

A.Some files may use a different delimiter (e.g., tab) and are not parsed correctly, resulting in zero rows from those files.

B.The table schema does not match the data, causing some files to be skipped.

C.Some files may be empty or contain only headers, so they contribute 0 rows.

D.Athena skips files larger than a certain size to prevent scanning too much data.

AnswerA

If delimiter is not comma, rows are not parsed, reducing count.

Why this answer

The table is not partitioned (empty PartitionKeys), and the SerDe expects comma-delimited files. If some files use a different delimiter (e.g., tab), those rows may not be parsed correctly, leading to fewer rows counted. Option A is wrong because the table schema matches the data.

Option B is wrong because Athena can handle many small files, but count should still include all rows. Option D is wrong because Athena does not skip files based on size unless explicitly filtered.

Full explanation →

460

Multi-Selecteasy

A data scientist is analyzing a dataset with a mix of numerical and categorical features. The target variable is binary. The data scientist wants to visualize the distribution of a numerical feature across the two target classes. Which TWO visualization techniques are appropriate? (Choose 2.)

Select 2 answers

A.Heatmap of the correlation matrix

B.Stacked bar chart of the feature binned

C.Overlapping histograms with transparency

D.Side-by-side boxplots

E.Scatter plot with color-coded classes

AnswersC, D

Histograms show distribution shapes; transparency allows comparison.

Why this answer

Option A is correct because side-by-side boxplots show distribution and outliers across categories. Option C is correct because overlapping histograms with transparency show distribution shapes. Option B is wrong because scatter plots require two numerical variables.

Option D is wrong because heatmaps are for correlation or contingency tables. Option E is wrong because bar charts are for categorical data, not numerical distributions.

Full explanation →

461

Multi-Selecthard

A company is using SageMaker to train a TensorFlow model for image classification. The training is slow on a single GPU instance. Which TWO strategies can reduce training time? (Choose TWO.)

Select 2 answers

A.Increase the image size

B.Use SageMaker Pipe Mode for data ingestion

C.Increase the number of training epochs

D.Use distributed training with multiple GPUs

E.Decrease the batch size

AnswersB, D

Pipe Mode streams data directly to the training container, reducing I/O time.

Why this answer

Options B and D are correct. Using SageMaker Pipe Mode streams data from S3, reducing download time. Using multiple GPUs (distributed training) parallelizes computation.

Option A (batch size decrease) may slow training. Option C (larger images) increases computation. Option E (more epochs) increases training time.

Full explanation →

462

MCQmedium

A data scientist is working with a dataset that has missing values in 30% of rows for a categorical feature 'city'. Which EDA step should be performed before deciding on imputation?

A.Check if missingness is related to other features or random

B.Impute missing values with the mode of the column

C.Drop all rows with missing values

D.Encode the city feature using label encoding

AnswerA

Why B is correct

Why this answer

Option B is correct because analyzing the missing pattern (e.g., MCAR, MAR, MNAR) guides imputation strategy. Option A is wrong because mode imputation is a method, not a diagnostic step. Option C is wrong because label encoding is for modeling, not imputation decision.

Option D is wrong because dropping rows may lose data; need to understand missingness first.

Full explanation →

463

Multi-Selecteasy

Which TWO are appropriate visualizations for exploring the distribution of a single numeric variable? (Select TWO.)

Select 2 answers

A.Heatmap

B.Histogram

C.Bar chart

D.Scatter plot

E.Box plot

AnswersB, E

Histogram displays frequency distribution of a single numeric variable.

Why this answer

Options B and D are correct. Option A is wrong because scatter plot is for two variables. Option C is wrong because heatmap shows correlation.

Option E is wrong because bar chart is for categorical data.

Full explanation →

464

MCQhard

A machine learning team trains a deep learning model on SageMaker. The training job uses a single ml.p3.2xlarge instance and takes 12 hours. The team needs to reduce training time without changing the algorithm. Which approach is most effective?

A.Increase the learning rate

B.Switch to a larger instance type, such as ml.p3.16xlarge

C.Use managed Spot Training

D.Use SageMaker's distributed data parallelism across multiple instances

AnswerD

Distributed data parallelism scales training across GPUs, reducing wall-clock time.

Why this answer

Using SageMaker's distributed data parallelism (e.g., with SageMaker distributed training libraries) across multiple GPUs can significantly reduce training time by splitting the mini-batches across GPUs. Increasing instance type to a single larger GPU (e.g., p3.16xlarge) helps but is less effective than multi-GPU distribution. Hyperparameter tuning doesn't directly reduce training time.

Spot instances may interrupt.

Full explanation →

465

MCQmedium

A company captures streaming data from IoT devices using Amazon Kinesis Data Streams. The data is consumed by a custom application that processes records in near real-time. Recently, the application has been falling behind, and the stream is showing increased 'iterator age' metrics in CloudWatch. Which action is MOST likely to reduce the iterator age?

A.Increase the data retention period of the stream

B.Decrease the number of shards in the stream

C.Increase the number of shards in the stream

D.Reduce the data retention period of the stream

AnswerC

More shards increase throughput, allowing the consumer to process faster.

Why this answer

Option D is correct because increasing the number of shards increases the capacity of the stream, allowing more parallel consumers and reducing backlog. Option A is wrong because reducing retention period does not affect processing speed. Option B is wrong because increasing retention period may increase backlog.

Option C is wrong because decreasing shards reduces capacity, worsening the issue.

Full explanation →

466

MCQhard

A data scientist ran an XGBoost training job in SageMaker and it failed with the error shown in the exhibit. Which hyperparameter change is most likely to resolve the numeric overflow?

A.Reduce max_depth to a lower value

B.Increase subsample

C.Increase eta (learning rate)

D.Increase num_round

AnswerA

Numeric overflow often occurs when trees are too deep, leading to large leaf weights. Reducing depth prevents this.

Why this answer

Option A is correct because reducing max_depth prevents trees from growing too deep, which can cause numeric overflow. Option B is wrong because increasing eta helps convergence but not overflow. Option C is wrong because increasing num_round may worsen overflow.

Option D is wrong because subsample doesn't affect depth.

Full explanation →

467

MCQmedium

A data scientist is analyzing a dataset with a binary target variable. They compute the correlation matrix and find that all features have correlations between -0.1 and 0.1 with the target. They suspect that the relationship might be non-linear. Which of the following techniques should they use to detect non-linear relationships?

A.ANOVA test

B.Spearman's rank correlation

C.Pearson correlation coefficient

D.Mutual information

AnswerD

Measures any kind of dependency, linear or non-linear.

Why this answer

Option C is correct because mutual information can capture any dependency. Option A is wrong because Pearson correlation only captures linear. Option B is wrong because Spearman captures monotonic but not all non-linear.

Option D is wrong because ANOVA is for categorical vs continuous.

Full explanation →

468

Multi-Selecthard

A company uses Amazon Redshift for data warehousing. The data engineering team notices that query performance has degraded over time. Which THREE actions should the team take to improve performance? (Choose THREE.)

Select 3 answers

A.Increase the number of nodes in the Redshift cluster

B.Define appropriate sort keys on large tables

C.Define appropriate distribution keys on large tables

D.Delete old data that is no longer needed

E.Run the ANALYZE command to update table statistics

AnswersB, C, E

Sort keys minimize the amount of data scanned, improving query performance.

Why this answer

Sort keys help the query optimizer scan less data. Distribution keys reduce data shuffling. ANALYZE commands update statistics for the optimizer.

Option B (increasing node count) is costly and not always necessary. Option D (deleting old data) may help but is not a direct performance tuning technique.

Full explanation →

469

Multi-Selectmedium

Which THREE of the following are best practices for training deep learning models on Amazon SageMaker?

Select 3 answers

A.Disable automatic scaling to avoid interruptions

B.Use SageMaker Debugger to profile system bottlenecks

C.Use Pipe mode for training data stored in S3 to reduce startup time

D.Always use the largest instance type available for faster training

E.Use managed spot training to reduce cost

AnswersB, C, E

Debugger provides insights into GPU utilization and I/O bottlenecks.

Why this answer

SageMaker Debugger is a best practice because it provides real-time profiling of system bottlenecks such as CPU/GPU utilization, memory I/O, and network throughput during training. This allows you to identify and resolve performance issues early, optimizing training efficiency and cost. It integrates directly with SageMaker's training jobs without requiring code changes.

Exam trap

The trap here is that candidates may confuse 'avoiding interruptions' with disabling automatic scaling, when in fact automatic scaling is designed to prevent interruptions by dynamically adjusting capacity, and disabling it increases the risk of failures.

Full explanation →

470

MCQeasy

During exploratory data analysis, a data scientist notices that a feature has a highly skewed distribution. Which transformation is most likely to make the distribution approximately normal?

A.Log transformation

B.Min-max scaling

C.One-hot encoding

D.Standardization (z-score)

AnswerA

Why C is correct

Why this answer

Option C is correct because log transformation is commonly used to reduce right skewness. Option A is wrong because standardization does not change distribution shape. Option B is wrong because min-max scaling does not change shape.

Option D is wrong because one-hot encoding is for categorical variables, not continuous.

Full explanation →

471

MCQmedium

A team is using SageMaker to train a model with hyperparameter tuning. The training jobs are taking too long. The team wants to reduce time without sacrificing model quality. Which approach should they take?

A.Use random search instead of Bayesian optimization.

B.Enable early stopping in the hyperparameter tuning job.

C.Increase the maximum number of training jobs.

D.Reduce the maximum runtime per training job.

AnswerB

Early stops poorly performing training jobs, saving time.

Why this answer

Option A is correct because early stopping terminates poor performing jobs early, saving time. Option B is wrong because reducing max runtime may not allow convergence. Option C is wrong because increasing max jobs would increase time.

Option D is wrong because random search is faster but may miss optimal hyperparameters; however, early stopping directly reduces time on bad trials.

Full explanation →

472

MCQhard

A machine learning engineer is training a deep learning model using the SageMaker built-in XGBoost algorithm. The training job is taking longer than expected. The engineer notices that the training data is stored in S3 in CSV format and is 500 GB in size. The instance type is ml.c4.8xlarge with 10 instances. Which change would most likely reduce training time?

A.Convert the data to Parquet format.

B.Increase the number of instances to 20.

C.Use Pipe input mode instead of File input mode.

D.Increase the size of the EBS volume attached to each instance.

AnswerC

Pipe mode streams data directly, reducing I/O bottleneck.

Why this answer

Pipe input mode streams data directly from S3 to the training instances without first downloading it to the local EBS volume, eliminating the I/O bottleneck of reading a 500 GB CSV file. This reduces the time spent on data loading and allows the XGBoost algorithm to begin training sooner, which is especially beneficial for large datasets.

Exam trap

Cisco often tests the distinction between data format optimization (Parquet) and data ingestion mode (Pipe vs. File), where candidates mistakenly choose a format change without recognizing that the primary bottleneck is the data transfer mechanism, not the storage format.

How to eliminate wrong answers

Option A is wrong because converting to Parquet format would reduce storage size and improve read efficiency, but the primary bottleneck here is the data transfer time from S3 to the instances, not the format overhead; Pipe mode addresses the transfer bottleneck more directly. Option B is wrong because increasing the number of instances to 20 would add more parallelism but also increase the overhead of data distribution and coordination, and the training job is already bottlenecked by data ingestion, not compute capacity. Option D is wrong because increasing the EBS volume size does not improve I/O throughput for reading from S3; the data must still be downloaded from S3 to the EBS volume, so the bottleneck remains.

Full explanation →

473

MCQmedium

A data scientist is analyzing a dataset with 10 million rows and 50 columns. The target variable is highly imbalanced (99% negative, 1% positive). Which approach is most appropriate for exploratory data analysis before modeling?

A.Remove all negative examples and analyze only the positive ones.

B.Take a random sample of 100,000 rows from the entire dataset.

C.Take a stratified sample that preserves the 99:1 ratio.

D.Up-sample the minority class to balance the dataset before analysis.

AnswerC

Stratified sampling ensures representation of both classes.

Why this answer

Option B is correct because stratified sampling preserves the class proportion in the sample, which is critical for imbalanced data. Option A (random sample) may miss positives; Option C (up-sample minority) changes distribution; Option D (remove negatives) loses information.

Full explanation →

474

MCQeasy

A data scientist needs to detect outliers in a dataset with multiple features that follow different distributions. Which method is most robust for multivariate outlier detection?

A.Z-score threshold

B.Interquartile range (IQR)

C.DBSCAN clustering

D.Isolation Forest

AnswerD

Correct: Isolation Forest works well for multivariate data without distributional assumptions.

Why this answer

Option C is correct because Isolation Forest is an ensemble method that isolates anomalies effectively in high-dimensional spaces without assuming distribution. Option A is wrong because Z-score assumes normal distribution. Option B is wrong because IQR is univariate.

Option D is wrong because DBSCAN is for clustering, not specifically for outlier detection.

Full explanation →

475

Multi-Selecteasy

Which TWO actions should a data scientist take when exploring a dataset that contains missing values and outliers? (Select TWO.)

Select 2 answers

A.Calculate the percentage of missing values per column.

B.Normalize all features using Min-Max scaling.

C.Remove all rows with outliers.

D.Impute missing values with the mean immediately.

E.Visualize the distribution of each feature using histograms.

AnswersA, E

Missing value counts inform imputation strategy.

Why this answer

Options B and E are correct. B: Visualizing the distribution helps identify shape and outliers. E: Reporting missing value counts is a standard EDA step.

A is wrong because imputation should be done after analysis. C is wrong because removing outliers without analysis is premature. D is wrong because normalization is not a first step.

Full explanation →

476

MCQmedium

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

A.Randomly sample 10% of the data and plot feature distributions by class.

B.Apply PCA to reduce dimensionality, then visualize the first two components.

C.Use stratified sampling to create a balanced subset, then compute correlation matrices and box plots.

D.Focus only on the majority class features to avoid bias.

AnswerC

Stratified sampling preserves class proportions, enabling meaningful EDA.

Why this answer

Option C is correct because stratified sampling preserves the class distribution in the sample, allowing you to create a balanced subset for exploratory analysis. Computing correlation matrices and box plots on this balanced subset reveals feature-target relationships without being overwhelmed by the majority class, which is critical for imbalanced datasets like 99% negative vs. 1% positive.

Exam trap

The trap here is that candidates may think random sampling (Option A) is sufficient for EDA, but they overlook that severe class imbalance (99:1) makes random samples uninformative for the minority class, whereas stratified sampling explicitly addresses this by ensuring both classes are represented in the analysis subset.

How to eliminate wrong answers

Option A is wrong because random sampling of 10% of the data will likely preserve the original class imbalance (99:1), so feature distributions by class will still be dominated by the negative class, obscuring patterns for the rare positive class. Option B is wrong because PCA is an unsupervised dimensionality reduction technique that does not use the target variable; the first two components may capture variance unrelated to the target, and the resulting visualization may not highlight class-specific separations. Option D is wrong because focusing only on the majority class features ignores the minority class entirely, which is the very class of interest in imbalanced problems; this approach would miss important discriminative features and introduce bias.

Full explanation →

477

MCQmedium

An IAM policy attached to a SageMaker notebook role is shown in the exhibit. A data scientist is trying to run a training job from the notebook, but the job fails with an access denied error. The training job needs to read data from 'my-bucket' and write output to 'my-bucket'. What is the most likely cause of the failure?

A.The policy does not allow s3:ListBucket

B.The training job execution role does not have the same permissions

C.The policy does not allow sagemaker:CreateTrainingJob

D.The S3 bucket is not specified in the Resource

E.The policy does not allow s3:GetObject

AnswerB

The notebook role is used for the notebook; the training job uses an execution role that may lack permissions.

Why this answer

Option C is correct because the policy allows s3:PutObject on 'my-bucket/*', but the training job may need to write to a different bucket or path. Option A is wrong because the actions are allowed. Option B is wrong because s3:PutObject is allowed.

Option D is wrong because the bucket is specified. Option E is wrong because the training job role is separate from the notebook role.

Full explanation →

478

MCQeasy

A data scientist needs to run a one-time training job on a large dataset using SageMaker. The job requires a specific PyTorch version and custom dependencies. Which approach is MOST efficient?

A.Create a custom Docker container and push to ECR.

B.Launch a SageMaker notebook instance, install dependencies, and run training script.

C.Use the SageMaker PyTorch estimator with a pre-built container.

D.Use the SageMaker generic container and install PyTorch via a lifecycle configuration.

AnswerC

The framework estimator manages the container and allows adding custom dependencies via source_dir.

Why this answer

SageMaker provides pre-built deep learning containers (DLCs) for PyTorch. Using a SageMaker framework estimator with a pre-built PyTorch container is the easiest and most efficient. The framework estimator automatically handles the container and dependencies.

Creating a custom container is more work. Using the generic container requires manually installing dependencies. Using a notebook instance is for interactive development, not one-time training.

Full explanation →

479

MCQeasy

A data scientist runs the above AWS CLI command on a file in S3. What can be concluded from the output?

A.The ETag can be used for integrity checking.

B.The file is 1 GB in size.

C.The object has S3 versioning enabled.

D.The file has not been preprocessed.

AnswerA

ETag is an MD5 hash of the object, used to detect changes.

Why this answer

Option B is correct because the ETag can be used to verify file integrity. Option A is wrong because ContentLength shows size in bytes, ContentLength 1048576 is 1 MB. Option C is wrong because ETag is not a version ID; S3 versioning uses VersionId.

Option D is wrong because preprocessed metadata indicates the file has been processed.

Full explanation →

480

MCQmedium

A data scientist builds a Random Forest model using SageMaker. The model performs well on training data but poorly on test data. Which step is most likely to reduce overfitting?

A.Reduce the maximum depth of each tree

B.Increase the number of trees

C.Switch to a linear model

D.Increase the number of features considered at each split

AnswerA

Shallower trees reduce model complexity and help prevent overfitting.

Why this answer

Reducing the maximum depth of each tree limits the complexity of individual decision trees, preventing them from memorizing noise and specific patterns in the training data. This directly addresses overfitting by enforcing simpler, more generalized splits, which improves performance on unseen test data.

Exam trap

The trap here is that candidates often assume adding more trees (Option B) always improves generalization, but they miss that overfitting in Random Forest is primarily caused by individual trees being too deep, not by the ensemble size.

How to eliminate wrong answers

Option B is wrong because increasing the number of trees in a Random Forest does not reduce overfitting; it typically reduces variance and improves generalization, but if trees are already deep and overfit, more trees will still produce overfit predictions. Option C is wrong because switching to a linear model is an extreme and unnecessary step; Random Forest can be regularized effectively by tuning hyperparameters like max_depth, and a linear model may underfit if the data has non-linear relationships. Option D is wrong because increasing the number of features considered at each split increases tree diversity but also allows each tree to potentially overfit to more features, especially if the features are noisy or irrelevant, thus not reducing overfitting.

Full explanation →

481

MCQmedium

A team is training a large language model using SageMaker's distributed training. They notice that the training loss is not decreasing after the first few epochs. Which action is MOST likely to resolve this issue?

A.Increase the batch size

B.Add L2 regularization

C.Reduce the learning rate

D.Switch from Adam to SGD optimizer

AnswerC

A high learning rate can cause the loss to stall; reducing it allows finer updates.

Why this answer

A learning rate that is too high can cause the loss to plateau or diverge. Reducing the learning rate often helps. Increasing batch size may stabilize training but not directly address plateau.

Switching optimizer or adding regularization may help but are less direct.

Full explanation →

482

MCQeasy

A data scientist is using Amazon SageMaker Data Wrangler for exploratory data analysis. The dataset contains a column with missing values that are encoded as 'NA' strings. The data scientist wants to treat these as missing values during the import. Which step should the data scientist take?

A.Configure a custom missing value symbol 'NA' in the import settings of Data Wrangler.

B.Use the 'Impute' transform to fill 'NA' with the mean of the column.

C.Use the 'Replace missing' transform to replace 'NA' with null after import.

D.Use the 'Drop missing' transform to remove rows containing 'NA'.

AnswerA

Data Wrangler supports custom missing value symbols during data import.

Why this answer

Option C is correct because Data Wrangler allows specifying custom missing value symbols during import. Option A is wrong because dropping rows is premature. Option B is wrong because replacing after import is less efficient.

Option D is wrong because imputation is needed after treating missing values.

Full explanation →

483

MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training data is stored in Amazon S3 and is approximately 500 GB. The data scientist notices that the training job is taking a long time to start because the data is being copied to the training instance's storage. The data scientist wants to reduce the startup time for subsequent training jobs. Which action should the data scientist take?

A.Use Pipe input mode instead of File input mode for the training job

B.Use an EBS-optimized instance type

C.Use Amazon FSx for Lustre as a high-performance file system mounted to the training instance

D.Increase the size of the training instance's Amazon EBS storage volume

AnswerA

Pipe mode streams data from S3 directly, reducing startup time.

Why this answer

Option A is correct because using Pipe input mode streams data directly from S3 to the training algorithm without downloading, reducing startup time. Option B is wrong because FSx for Lustre is not needed for simple streaming. Option C is wrong because increasing instance storage does not address the data transfer issue.

Option D is wrong because using EBS optimized instances does not change the data loading mechanism.

Full explanation →

484

MCQmedium

A team trained a multiclass classification model using SageMaker built-in XGBoost. The model's accuracy is high, but for a specific class, recall is very low. The team wants to improve recall for that class without significant accuracy drop. Which approach is MOST effective?

A.Add more training data from all classes

B.Resample the training data to balance the class representation

C.Increase the max_depth hyperparameter of XGBoost

D.Switch from XGBoost to a linear learner

AnswerB

Resampling addresses class imbalance, improving recall for minority class.

Why this answer

Option B is correct because resampling the training data to balance class representation directly addresses the root cause of low recall for a specific class in a multiclass XGBoost model. XGBoost's built-in objective functions (e.g., 'multi:softmax') optimize for overall accuracy, which can bias the model toward majority classes; resampling (e.g., oversampling the minority class or undersampling the majority) forces the model to learn decision boundaries that better capture the minority class, improving recall without drastically reducing overall accuracy.

Exam trap

The trap here is that candidates often assume increasing model complexity (max_depth) or switching algorithms will fix class imbalance, when in fact the most effective and direct approach is to rebalance the training data through resampling.

How to eliminate wrong answers

Option A is wrong because adding more data from all classes does not specifically target the underrepresented class; it may even worsen the imbalance if the new data is also skewed, and it does not guarantee improved recall for the minority class. Option C is wrong because increasing max_depth can lead to overfitting, which might temporarily boost recall on training data but often degrades generalization and overall accuracy, and it does not systematically address class imbalance. Option D is wrong because switching from XGBoost to a linear learner (e.g., LinearLearner in SageMaker) assumes linear separability, which is rarely true for complex multiclass problems; linear models typically have lower capacity to model minority class patterns and often yield worse recall than tree-based methods like XGBoost.

Full explanation →

485

MCQmedium

A data scientist is performing hyperparameter tuning using Amazon SageMaker Automatic Model Tuning (AMT). The job uses a random search strategy. After 20 training jobs, the best objective metric value has plateaued. The data scientist wants to explore more of the hyperparameter space. Which action should the data scientist take?

A.Change the tuning strategy from Random to Bayesian.

B.Enable early stopping.

C.Decrease the maximum number of training jobs.

D.Increase the number of parallel training jobs.

AnswerA

Bayesian search uses past results to guide exploration.

Why this answer

Option C is correct because switching to Bayesian search will explore new regions based on previous results, potentially finding better hyperparameters. Option A is wrong because increasing the number of parallel jobs does not change the search strategy. Option B is wrong because decreasing the number of training jobs reduces exploration.

Option D is wrong because early stopping does not change the search strategy.

Full explanation →

486

MCQeasy

A data scientist is training a model and wants to monitor training progress. Which AWS service can be used to track metrics like loss and accuracy in real time?

A.Amazon SageMaker Ground Truth

B.Amazon SageMaker Automatic Model Tuning

C.AWS Glue

D.Amazon CloudWatch

AnswerD

CloudWatch can monitor custom metrics.

Why this answer

Amazon CloudWatch is the correct service because it provides real-time monitoring of metrics such as loss and accuracy during model training. When using SageMaker, training jobs automatically emit metrics to CloudWatch via the CloudWatch agent, allowing you to view logs and set alarms on metric thresholds in near real-time.

Exam trap

The trap here is that candidates may confuse Amazon SageMaker Automatic Model Tuning (which orchestrates hyperparameter searches) with a monitoring service, but it does not provide real-time metric tracking itself—only CloudWatch does.

How to eliminate wrong answers

Option A is wrong because Amazon SageMaker Ground Truth is a data labeling service, not a monitoring tool for training metrics. Option B is wrong because Amazon SageMaker Automatic Model Tuning (hyperparameter tuning) launches training jobs with different hyperparameters but does not itself track real-time metrics; it relies on CloudWatch for that. Option C is wrong because AWS Glue is a serverless data integration and ETL service, not designed for real-time metric tracking during model training.

Full explanation →

487

MCQeasy

A data scientist is training a model on Amazon SageMaker and notices that the training job is taking much longer than expected. The instance type is ml.m5.xlarge and the dataset is 10 GB in CSV format. Which action is MOST likely to reduce training time without changing the instance type?

A.Change the instance type to ml.p3.2xlarge (GPU) for faster computation.

B.Reduce the number of training epochs to speed up convergence.

C.Convert the dataset to RecordIO or Parquet format before training.

D.Increase the batch size to the maximum supported by the instance memory.

AnswerC

RecordIO and Parquet are columnar formats that reduce I/O and allow faster data loading in SageMaker.

Why this answer

Option B is correct because converting CSV to optimized formats like Parquet or RecordIO reduces I/O overhead and improves throughput. Option A is wrong because increasing batch size may help but does not address I/O. Option C is wrong because changing to a GPU instance is not allowed per stem.

Option D is wrong because reducing epochs reduces accuracy, not training time effectively.

Full explanation →

488

MCQmedium

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be transformed into Parquet format before delivery. Which approach should the data engineer use?

A.Send the data to Amazon Kinesis Data Analytics to convert to Parquet

B.Configure Kinesis Data Firehose to convert the record format to Parquet using a schema from AWS Glue Data Catalog

C.Use an AWS Lambda function to transform JSON to Parquet and write to S3

D.Use an AWS Glue ETL job to read from Firehose and write Parquet to S3

AnswerB

Firehose can convert JSON to Parquet using a Glue Data Catalog schema.

Why this answer

Option A is correct because Kinesis Data Firehose supports built-in data format conversion to Parquet. Option B is wrong because AWS Glue is an ETL service, not a real-time transformation; Option C is wrong because Lambda can transform but must output to Firehose again; Option D is wrong because Kinesis Data Analytics is for streaming analytics, not format conversion.

Full explanation →

489

MCQhard

The exhibit shows an IAM policy for a SageMaker notebook. A data scientist wants to use the notebook to run an Athena query and then load the results into a pandas DataFrame. Which action is NOT possible with this policy?

A.Read the Athena query results from the output S3 location

B.Start an Athena query execution

C.Read a specific object from the my-training-data bucket

D.List objects in the my-training-data bucket

AnswerA

The policy only allows read on my-training-data, not the Athena output bucket.

Why this answer

Option C is correct because the policy allows s3:GetObject only on the 'my-training-data' bucket, but Athena writes results to a different S3 location (specified in the query configuration). Without permission to read from that output bucket, the notebook cannot load the results. Option A is wrong because Athena actions are allowed.

Option B is wrong because listing the training data bucket is allowed. Option D is wrong because reading individual objects from the training data bucket is allowed.

Full explanation →

490

MCQeasy

A data scientist is using Amazon SageMaker to train a model. Training is taking longer than expected. The scientist notices that the training job is using a single instance type with limited GPU memory. Which action will MOST likely reduce training time?

A.Configure the training job to use distributed data parallelism across multiple instances.

B.Use SageMaker Managed Spot Training to lower cost.

C.Use batch normalization layers.

D.Enable SageMaker Debugger for real-time monitoring.

AnswerA

Distributed data parallelism splits the dataset across multiple GPUs/instances, reducing per-worker memory and training time.

Why this answer

Distributed data parallelism (Option B) splits the data across multiple GPUs, reducing per-worker memory load and speeding up training. Option A (batch normalization) does not reduce training time. Option C (Spot Instances) introduces interruptions and may increase total time.

Option D (SageMaker Debugger) is for monitoring, not performance.

Full explanation →

491

MCQeasy

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset and notices that the target variable is highly imbalanced. Which technique should the data scientist apply to balance the dataset before training?

A.Synthetic Minority Oversampling Technique (SMOTE)

B.One-hot encoding of the target variable

C.Random undersampling of the majority class

D.Min-Max scaling of all features

AnswerA

SMOTE creates synthetic minority samples to balance the dataset.

Why this answer

Option D is correct because SMOTE generates synthetic samples for the minority class. Option A is wrong because it discards majority class data. Option B is wrong because scaling does not balance classes.

Option C is wrong because encoding is for categorical features.

Full explanation →

492

MCQeasy

A company is using Amazon S3 as a data lake. The data engineering team needs to catalog the schema of the data and make it available for querying with Amazon Athena. Which AWS Glue component should be used?

A.AWS Glue Studio

B.AWS Glue Crawlers

C.AWS Glue ETL jobs

D.AWS Glue DataBrew

AnswerB

Crawlers populate the Glue Data Catalog with table definitions.

Why this answer

AWS Glue Crawlers automatically scan data sources, infer schemas, and populate the AWS Glue Data Catalog. Option B is wrong because Glue ETL jobs are for transforming data. Option C is wrong because Glue DataBrew is for visual data preparation.

Option D is wrong because Glue Studio is a visual ETL development tool.

Full explanation →

493

Multi-Selectmedium

A company is using Amazon SageMaker to build a machine learning pipeline. The pipeline includes data preprocessing, training, and evaluation steps. The company wants to ensure that the pipeline is reproducible and that artifacts are versioned. Which TWO actions should be taken? (Choose TWO.)

Select 2 answers

A.Use a naming convention for training jobs that includes the date.

B.Use SageMaker Pipelines to create the pipeline and enable versioning on the pipeline artifacts.

C.Create a requirements.txt file with specific library versions for the training script.

D.Use AWS CodePipeline to trigger the pipeline on code changes.

E.Store the training dataset in a versioned S3 bucket.

AnswersB, C

SageMaker Pipelines version artifacts automatically.

Why this answer

SageMaker Pipelines provides a native way to define, orchestrate, and version machine learning pipelines. By enabling versioning on pipeline artifacts (e.g., via the `Pipeline` object's `version` parameter or by using SageMaker Model Registry), each pipeline run is tracked with a unique version, ensuring reproducibility. This directly addresses the requirement for reproducible pipelines and versioned artifacts.

Exam trap

The trap here is that candidates often confuse data versioning (Option E) with pipeline versioning, or assume that a naming convention (Option A) or CI/CD trigger (Option D) is sufficient for reproducibility, when in fact only a purpose-built pipeline orchestration service with artifact versioning (Option B) combined with environment pinning (Option C) meets both requirements.

Full explanation →

494

MCQeasy

A data analyst is using Amazon QuickSight to explore a dataset with 10 million rows. The analyst wants to create a histogram of a numerical column. However, the query is taking too long. Which action should the analyst take to improve performance without losing accuracy?

A.Change the data source to Amazon Athena directly with a limit clause.

B.Reduce the number of bins in the histogram.

C.Use a sample of the data (e.g., 1 million rows) for the histogram.

D.Import the dataset into SPICE (Super-fast, Parallel, In-memory Calculation Engine).

AnswerD

SPICE accelerates queries by loading data into memory.

Why this answer

Option A is correct because using SPICE in-memory engine speeds up queries by caching data. Option B is wrong because sampling loses accuracy. Option C is wrong because reducing bin count is a trade-off.

Option D is wrong because Athena is query engine, but QuickSight already uses it; SPICE is better.

Full explanation →

495

MCQeasy

After loading a dataset into a pandas DataFrame, a data scientist runs df.info() and sees that a column 'income' has object dtype. What does this indicate, and what EDA step should be taken?

A.The column has missing values; impute them.

B.The column contains strings; convert to numeric using pd.to_numeric() and investigate non-convertible values.

C.Normalize the column to a 0-1 range.

D.The column is already numeric; proceed.

AnswerB

Conversion to numeric is necessary for analysis; non-convertible values may indicate errors.

Why this answer

Option C is correct because object dtype typically indicates string or mixed types; converting to numeric allows mathematical operations. Option A is wrong because object dtype does not necessarily indicate missing values. Option B is wrong because object dtype is not automatically numeric.

Option D is wrong because normalization is done after conversion.

Full explanation →

496

MCQhard

A data scientist is tuning a gradient boosting model using SageMaker automatic model tuning. The hyperparameter 'num_round' ranges from 50 to 500. The tuning job uses 'ObjectiveMetric' = 'validation:auc'. After 50 training jobs, the best objective value is 0.95. The data scientist suspects overfitting. What should the data scientist do?

A.Increase 'max_depth' to capture more complex patterns.

B.Add an early stopping round and increase the range for regularization hyperparameters like 'gamma' and 'lambda'.

C.Increase 'num_round' to 1000 and keep other hyperparameters unchanged.

D.Decrease the range of 'num_round' to 10-100.

AnswerB

Early stopping prevents overfitting; regularization penalizes complexity.

Why this answer

Increasing early stopping rounds and adding regularization (like gamma or lambda) helps reduce overfitting. Lowering learning rate with more rounds can also help. Option A (decreasing rounds) might underfit.

Option C (increasing max_depth) worsens overfitting. Option D (increasing num_round) with no regularization may overfit more.

Full explanation →

497

Multi-Selecthard

A company uses Amazon S3 to store historical transaction data in CSV format. The data is partitioned by transaction_date. A data analyst runs Amazon Athena queries that frequently filter on customer_id and transaction_date. The queries are slow and expensive. The team needs to improve query performance and reduce cost. Which combination of actions should the team take? (Choose TWO.)

Select 2 answers

A.Enable S3 Select pushdown in Athena to reduce data transfer.

B.Convert the data to JSON format for better query performance.

C.Convert the data from CSV to Parquet format.

D.Reorganize the data by partitioning on customer_id first, then transaction_date.

E.Increase the number of Athena query workers.

AnswersC, D

Parquet is columnar and compressed, reducing data scanned.

Why this answer

Option B and D are correct. Converting to Parquet reduces data scanned due to columnar storage and compression. Partitioning by customer_id (which is frequently filtered) improves partition pruning.

Option A is wrong because increasing workers is not applicable to Athena (serverless). Option C is wrong because converting to JSON increases size. Option E is wrong because using S3 Select may not integrate with Athena directly.

Full explanation →

498

MCQmedium

A data scientist is training a model using Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single ml.p3.2xlarge instance. The data is stored in S3 and is about 50 GB in size. Which action would MOST likely reduce training time?

A.Enable automatic data sharding in the SageMaker training job.

B.Enable S3 server-side encryption on the training data.

C.Use a larger instance type, such as ml.p3.16xlarge.

D.Change the input mode from File to Pipe.

AnswerD

Pipe mode streams data directly from S3, reducing I/O time.

Why this answer

Option C is correct because using Pipe input mode streams data directly from S3 to the training algorithm without writing to disk, reducing I/O latency. Option A is wrong because increasing instance size may help but incurs higher cost and may not address data loading bottleneck. Option B is wrong because SageMaker automatically manages data channels.

Option D is wrong because enabling data compression on the S3 objects does not reduce training time significantly and adds CPU overhead for decompression.

Full explanation →

499

MCQhard

A machine learning engineer is using Amazon SageMaker to train a model. The training job is taking too long. The engineer suspects the data loading is a bottleneck. Which action would MOST effectively diagnose the issue?

A.Monitor CPU utilization in CloudWatch

B.Enable SageMaker Model Monitor

C.Use SageMaker Debugger to profile the training job

D.Increase the instance type to a larger one

AnswerC

Debugger can capture detailed metrics like data loading time.

Why this answer

Option C is correct because SageMaker Debugger can profile the training job and identify bottlenecks. Option A may add overhead. Option B is not detailed.

Option D is for inference.

Full explanation →

500

Multi-Selecteasy

Which TWO techniques are used for feature scaling? (Choose 2.)

Select 2 answers

A.One-hot encoding

B.Standardization (Z-score normalization)

C.Min-Max scaling

D.Principal Component Analysis (PCA)

E.Label encoding

AnswersB, C

Standardization scales features to have mean 0 and variance 1.

Why this answer

Standardization (Z-score normalization) is a feature scaling technique that transforms data to have a mean of 0 and a standard deviation of 1, using the formula z = (x - μ) / σ. This is essential for algorithms like SVM, k-means, and PCA that assume normally distributed features and are sensitive to feature magnitudes.

Exam trap

Cisco often tests the distinction between feature scaling techniques (which transform numerical feature values) and encoding or dimensionality reduction techniques, leading candidates to mistakenly select one-hot encoding or PCA as scaling methods.

Full explanation →

501

MCQhard

A data analyst is examining a dataset with a target variable that has three classes: A, B, C. They plot the distribution of a feature 'X' for each class and notice that for classes A and B, the distributions are bimodal, while for class C it is unimodal. They want to assess whether feature 'X' is useful for separating the classes. Which of the following metrics should they compute to quantify the separability?

A.ANOVA F-statistic between feature X and the target.

B.Variance ratio (between-group variance / within-group variance).

C.Chi-square test of independence.

D.Mutual information between X and the target.

AnswerA

ANOVA tests if the mean of X differs across classes.

Why this answer

Option A is correct because ANOVA F-statistic tests if means across groups are significantly different. Option B is wrong because chi-square is for categorical features. Option C is wrong because mutual information is for feature selection but doesn't directly test separability.

Option D is wrong because variance ratio is not standard.

Full explanation →

502

MCQeasy

A company is using Amazon SageMaker to deploy a model for real-time inference. The model receives requests with varying payload sizes. The company observes occasional latency spikes. Which feature can help mitigate this?

A.Multi-model endpoints

B.Amazon Elastic Inference

C.Automatic scaling

D.Amazon SageMaker Inference Recommender

AnswerD

Inference Recommender runs benchmarks to recommend optimal instance and endpoint configuration.

Why this answer

SageMaker Inference Recommender provides load testing and recommendations for instance type and endpoint configuration. It can help identify optimal settings to reduce latency spikes. Multi-model endpoints are for hosting multiple models, not directly for latency spikes.

Elastic Inference is for accelerating deep learning inference, not general latency. Automatic scaling adjusts capacity but not per-request latency.

Full explanation →

503

MCQmedium

A data scientist is training a regression model. The training loss is decreasing but the validation loss starts to increase after a few epochs. Which technique should the scientist use to address this issue?

A.Decrease the batch size.

B.Implement early stopping based on validation loss.

C.Add more layers to the model.

D.Increase the learning rate.

AnswerB

Stops training before overfitting.

Why this answer

Option B is correct because early stopping halts training when validation loss stops improving, preventing overfitting. Option A is wrong because adding more layers may worsen overfitting. Option C is wrong because increasing learning rate can cause divergence.

Option D is wrong because decreasing batch size may increase noise but not directly address overfitting.

Full explanation →

504

MCQhard

A company is using Amazon SageMaker to train and deploy a fraud detection model. The model is a gradient boosting machine (GBM) trained on a dataset with 10 million rows and 50 features. The training job runs on an ml.m5.2xlarge instance with 8 vCPUs and 32 GB memory. The training completes successfully, and the model is deployed to a real-time endpoint. After deployment, the inference latency is around 200 ms per request, which is acceptable. However, after a week, the company observes that latency increases to over 1 second during peak hours (12:00-13:00 UTC). CloudWatch metrics show CPU utilization on the endpoint instance reaches 95% during these peaks. The endpoint is configured with a single ml.m5.large instance. The company wants to maintain latency under 500 ms during peak hours without incurring unnecessary cost during off-peak hours. Which solution should the company implement?

A.Reduce the number of instances to zero during off-peak hours and manually launch a new endpoint every day at 12:00

B.Switch to SageMaker Batch Transform and have the application send requests in batches

C.Configure SageMaker endpoint auto scaling with a target CPU utilization of 70% and a minimum instance count of 1

D.Replace the endpoint instance type with ml.m5.4xlarge to handle peak load

AnswerC

Auto scaling dynamically adjusts instance count to handle load, keeping latency low and cost efficient.

Why this answer

Option A is correct: configuring auto scaling based on CPU utilization adds instances during peak and removes them during off-peak, meeting latency and cost goals. Option B (Batch Transform) is for offline inference. Option C (larger instance) is less cost-effective.

Option D (scale down) would worsen latency.

Full explanation →

505

MCQeasy

A data scientist is exploring a dataset with 10 features and observes that the correlation between feature A and feature B is 0.98. Which action should be taken to address multicollinearity before training a linear regression model?

A.Use Principal Component Analysis (PCA) to combine them.

B.Apply Min-Max scaling to both features.

C.Remove one of the two features from the dataset.

D.Add polynomial features to both.

AnswerC

Dropping one highly correlated feature directly addresses multicollinearity.

Why this answer

Option C is correct because dropping one of the highly correlated features reduces redundancy and mitigates multicollinearity. Option A is wrong because scaling does not address collinearity. Option B is wrong because PCA creates orthogonal components but reduces interpretability; dropping a feature is more straightforward.

Option D is wrong because adding polynomial features increases correlation.

Full explanation →

506

Multi-Selecthard

Which TWO statements about handling missing data during exploratory data analysis are correct? (Select TWO.)

Select 2 answers

A.Missing values can be ignored during EDA and handled during model training.

B.Visualizing the pattern of missingness can help determine if data is missing at random.

C.Understanding the missing data mechanism (MCAR, MAR, MNAR) is important for choosing an imputation strategy.

D.Listwise deletion (removing rows with missing values) is always safe and unbiased.

E.Imputing missing values with the mean preserves the original variance.

AnswersB, C

Missingness patterns inform assumptions about missing data mechanisms.

Why this answer

Options B and D are correct. B: Visualizing missing patterns (e.g., with a missingno matrix) is a good EDA practice. D: Understanding the mechanism (MCAR, MAR, MNAR) is critical for choosing imputation method.

A is wrong because listwise deletion can introduce bias if data not MCAR. C is wrong because mean imputation reduces variance. E is wrong because missing values should be handled before model training, not after.

Full explanation →

507

MCQeasy

A data scientist is exploring a dataset containing customer transactions. The dataset has a column 'transaction_amount' with values ranging from $0.01 to $10,000. Which EDA step is most appropriate to detect skewed distribution?

A.Create a count plot of transaction amounts

B.Create a correlation heatmap of all numeric columns

C.Create a histogram of transaction amounts

D.Create a scatter plot of transaction amount vs. customer age

AnswerC

Why C is correct

Why this answer

Option C is correct because a histogram or density plot reveals skewness visually. Option A is wrong because count plot is for categorical data. Option B is wrong because scatter plot shows relationship between two variables.

Option D is wrong because heatmap shows correlation, not skewness.

Full explanation →

508

MCQhard

A company is using SageMaker to train a large NLP model. The training job is taking too long due to high I/O wait time. The data is stored as CSV files in S3. Which optimization should the company implement to reduce I/O wait time?

A.Convert CSV files to RecordIO format

B.Use SageMaker Pipe mode to stream data directly from S3

C.Use SageMaker batch transform before training

D.Use SageMaker File mode with larger instance storage

E.Use SageMaker ShardedByS3Key data distribution

AnswerB

Pipe mode avoids disk I/O by streaming data.

Why this answer

Option B is correct because Pipe mode streams data directly from S3, reducing disk I/O. Option A (File mode) writes to disk first, causing I/O wait. Option C (ShardedByS3Key) is for distributed training but does not reduce I/O.

Option D (RecordIO) reduces file size but still uses File mode. Option E (SageMaker batch transform) is for inference, not training.

Full explanation →

509

MCQhard

A research lab is training a large language model (LLM) on SageMaker using PyTorch. The model has 1 billion parameters and does not fit on a single GPU. They have access to a cluster of 16 p4d.24xlarge instances (each with 8 A100 GPUs). They need to train the model with minimal changes to the training script. Which SageMaker feature should they use?

A.SageMaker's model parallelism with automatic partitioning

B.SageMaker's distributed data parallelism with Horovod

C.Use SageMaker's built-in BlazingText algorithm

D.SageMaker's managed spot training with checkpointing

AnswerA

Model parallelism splits the model across GPUs, and SageMaker's library automates this.

Why this answer

SageMaker's model parallelism is designed for large models that don't fit on a single device.

Full explanation →

510

MCQmedium

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. After the Autopilot job completes, the best model has an accuracy of 0.85 on the validation set. However, the data scientist notices a class imbalance (90% negative, 10% positive). Which metric should the data scientist use to evaluate the model's performance on the positive class?

A.Area Under the ROC Curve (AUC)

B.Recall

C.Accuracy

D.Precision

AnswerA

AUC is robust to class imbalance and evaluates overall ranking performance.

Why this answer

Option D is correct because AUC measures the model's ability to distinguish between classes irrespective of threshold, and is suitable for imbalanced datasets. Option A is wrong because accuracy is misleading when classes are imbalanced. Option B is wrong because precision only considers positive predictions.

Option C is wrong because recall only considers actual positives.

Full explanation →

511

MCQeasy

A data scientist wants to identify outliers in a dataset with 1,000 samples and 5 numerical features. Which technique is most appropriate for univariate outlier detection?

A.Principal component analysis (PCA)

B.Interquartile range (IQR) method

C.Mahalanobis distance

D.Z-score with a threshold of 3

AnswerB

IQR is robust and suitable for univariate outlier detection.

Why this answer

Option C is correct because the IQR method identifies outliers as points outside 1.5*IQR from the quartiles, which is robust to non-normal distributions. Option A is incorrect because Z-score assumes normality and is sensitive to extreme outliers. Option B is incorrect because Mahalanobis distance is for multivariate outliers.

Option D is incorrect because PCA is for dimensionality reduction, not outlier detection.

Full explanation →

512

MCQeasy

A company uses Amazon SageMaker to host a model for real-time predictions. The model endpoint is experiencing high latency during peak hours. The data scientist wants to reduce latency without increasing cost. Which action should they take?

A.Enable data capture for the endpoint to log requests

B.Switch to a larger instance type

C.Reduce the number of instances behind the endpoint

D.Enable auto-scaling for the endpoint based on latency metrics

AnswerD

Auto-scaling adjusts capacity to demand, maintaining low latency without over-provisioning.

Why this answer

Using SageMaker's production variants with auto-scaling can help handle traffic spikes without over-provisioning, thus managing latency and cost. Switching to a larger instance would increase cost. Reducing the number of instances would increase latency.

Enabling data capture adds overhead and increases latency.

Full explanation →

513

Multi-Selecteasy

Which TWO of the following are common techniques to handle missing values in a dataset?

Select 2 answers

A.Standardization

B.Principal Component Analysis (PCA)

C.One-hot encoding

D.Remove rows with missing values

E.Imputation with mean or median

AnswersD, E

Removing rows is a simple approach.

Why this answer

Options A and B are correct. A is correct because imputation with mean/median fills missing values. B is correct because removing rows with missing values is a valid approach.

C is wrong because one-hot encoding is for categorical data, not missing values. D is wrong because PCA is for dimensionality reduction, not missing value handling. E is wrong because standardization is for scaling, not missing values.

Full explanation →

514

Multi-Selectmedium

An ML team is deploying a model for real-time inference. They require A/B testing to compare a new model against the existing one. Which THREE steps should they take to set up this test?

Select 3 answers

A.Set up a second production variant for the existing model

B.Set up a SageMaker Batch Transform job for each model

C.Configure CloudWatch alarms to trigger variant switching

D.Configure the endpoint to route a percentage of traffic to each variant

E.Create a SageMaker production variant for the new model

AnswersA, D, E

Both models must be variants to split traffic.

Why this answer

Options A, B, and D are correct. A: Create a production variant with the new model. B: Set up a production variant for the existing model.

D: Configure the endpoint to route traffic between variants. Option C (CloudWatch) is monitoring, not part of setup. Option E (Batch Transform) is for batch, not real-time.

Full explanation →

515

MCQhard

A team is using SageMaker to train a custom PyTorch model on a large dataset (10 TB) stored in S3. The training job is repeatedly failing due to 'OutOfMemory' errors on the GPU. The team is using a single ml.p3.8xlarge instance. Which change is most likely to resolve the issue?

A.Change the instance type to ml.p3.16xlarge (more GPUs)

B.Use managed spot training to reduce cost

C.Reduce the batch size in the training script

D.Switch the input mode from Pipe to File

AnswerC

Reducing batch size decreases GPU memory usage per step, resolving OOM errors.

Why this answer

The 'OutOfMemory' error on the GPU indicates that the model and its associated data exceed the available GPU memory. Reducing the batch size directly decreases the memory footprint per training step, allowing the model to fit within the GPU's memory limits. This is the most direct and effective fix for GPU OOM errors, as it reduces the amount of data processed simultaneously without changing the instance type or input mode.

Exam trap

Cisco often tests the misconception that adding more GPUs (Option A) solves per-GPU memory issues, but the OOM error is per-device and requires reducing per-device memory usage, not increasing the number of devices.

How to eliminate wrong answers

Option A is wrong because switching to ml.p3.16xlarge adds more GPUs but does not increase the memory per GPU (each GPU still has 16 GB); the OOM error occurs on a single GPU, so more GPUs won't resolve the per-GPU memory exhaustion. Option B is wrong because managed spot training reduces cost but does not affect GPU memory usage; it could even cause interruptions that complicate debugging. Option D is wrong because switching from Pipe to File input mode changes how data is streamed (Pipe streams directly from S3, File downloads to local storage) but does not reduce the memory consumed by batches during training; in fact, File mode may increase local disk usage but not GPU memory.

Full explanation →

516

MCQmedium

A data analyst is using Amazon Athena to query a partitioned dataset in S3. They notice that queries are scanning more data than expected. Which step should they take during exploratory data analysis to optimize query performance?

A.Convert the data to Parquet format.

B.Use S3 Select to filter data before querying.

C.Increase the number of workers in Athena.

D.Check the partition metadata to ensure queries are pruning partitions.

AnswerD

Verifying partition structure ensures efficient partition pruning.

Why this answer

Option B is correct because checking the partition metadata using SHOW PARTITIONS or querying the information_schema helps verify that the partition structure is correct and that queries are using partition pruning. Option A is incorrect because compressing data reduces storage but does not directly affect partition pruning. Option C is incorrect because converting to Parquet improves columnar scanning but does not address partition misuse.

Option D is incorrect because increasing workers does not fix incorrect partition usage.

Full explanation →

517

Multi-Selecthard

A company is using Amazon SageMaker to train a machine learning model. The training job is configured to use the File mode to download data from S3 to the training instances. The training data is stored in a single S3 bucket with multiple prefixes. Which TWO actions are required to ensure the training job can access the data? (Choose TWO.)

Select 2 answers

A.Grant the SageMaker execution role s3:GetObject permission for the data bucket.

B.Configure the training job to use Pipe mode.

C.Specify the S3 data channel with the correct prefix.

D.Concatenate all data files into a single file.

E.Convert the data to RecordIO-protobuf format.

AnswersA, C

Needed to read objects.

Why this answer

Options A and D are correct. Option A: The IAM role must have s3:GetObject permission for the bucket. Option D: The input data channel must specify the S3 URI with the correct prefix.

Option B is wrong because File mode does not require RecordIO. Option C is wrong because Pipe mode is not used. Option E is wrong because File mode does not require data in a single file.

Full explanation →

518

Multi-Selecthard

A data scientist is performing EDA on a dataset stored in Amazon S3 using Amazon Athena. The dataset is partitioned by date, and each partition contains CSV files. The data scientist notices that some queries return zero rows for partitions that should have data. Which THREE steps should the data scientist take to troubleshoot? (Choose 3.)

Select 3 answers

A.Verify that the CSV files exist in the S3 bucket for the specific partition.

B.Run MSCK REPAIR TABLE to add new partitions to the Glue Data Catalog.

C.Convert the CSV files to Parquet format.

D.Check the data types of the columns used in the query's WHERE clause.

E.Re-run the query with a LIMIT clause to force partition discovery.

AnswersA, B, D

Files may have been moved or deleted.

Why this answer

Option A is correct because manually checking files confirms data presence. Option B is correct because MSCK REPAIR TABLE adds partitions not yet registered. Option C is correct because incorrect data types can cause filters to exclude rows.

Option D is wrong because the query does not automatically update partitions. Option E is wrong because converting to Parquet is not a troubleshooting step.

Full explanation →

519

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

A.The data retention period of the stream is too short.

B.The S3 bucket has insufficient write capacity.

C.The Kinesis stream has too few shards for the data volume.

D.The Lambda function's reserved concurrency is set too high.

AnswerC

Insufficient shards cause ProvisionedThroughputExceededException.

Why this answer

The 'ProvisionedThroughputExceededException' error in Amazon Kinesis Data Streams indicates that the data ingestion rate exceeds the write capacity of the stream's shards. Each shard supports up to 1 MB/s or 1,000 records/s for writes. If the clickstream data volume surpasses this limit, the Lambda function, which reads from the stream, will encounter this exception.

Increasing the number of shards scales the write capacity to match the data volume.

Exam trap

The trap here is that candidates confuse Kinesis throughput limits with Lambda concurrency or S3 capacity, but the specific exception name 'ProvisionedThroughputExceededException' is a direct indicator of insufficient shard write capacity in Kinesis.

How to eliminate wrong answers

Option A is wrong because the data retention period (default 24 hours, up to 365 days) controls how long records are stored, not the write throughput; a short retention period would cause data loss, not throughput errors. Option B is wrong because S3 buckets have virtually unlimited write capacity (thousands of PUT requests per second per prefix) and do not produce 'ProvisionedThroughputExceededException' errors, which are specific to Kinesis. Option D is wrong because setting reserved concurrency too high for the Lambda function would not cause a Kinesis throughput error; it might lead to throttling of the Lambda itself, but the exception originates from the Kinesis stream's shard limits.

Full explanation →

520

Multi-Selecthard

Which THREE are valid considerations when deploying a large deep learning model (10 GB) on a SageMaker endpoint? (Choose 3.)

Select 3 answers

A.Enable SageMaker Data Compression for network transfer.

B.Use GPU instances (e.g., p3, inf1) for faster inference.

C.Use SageMaker Multi-Model Endpoints to serve multiple models.

D.Use SageMaker Serverless Inference to avoid managing instances.

E.Attach Elastic Inference accelerators.

AnswersA, B, C

Compression reduces data transfer time.

Why this answer

Using GPU instances (Option A), enabling data compression (Option C), and using multi-model endpoints (Option D) are valid considerations. Option B (Elastic Inference) is deprecated and not recommended. Option E (serverless inference) has a payload limit and cold starts unsuitable for large models.

Full explanation →

521

MCQmedium

A data scientist is training a binary classification model on a dataset with 100 features and 10,000 rows. The model overfits significantly: training accuracy is 99%, but validation accuracy is 80%. The data scientist has tried L1 and L2 regularization without improvement. The dataset is clean and representative. Which approach is MOST likely to reduce overfitting? A. Increase the number of training epochs. B. Add more training data by generating synthetic samples using SMOTE. C. Reduce the number of features using PCA. D. Use a simpler model like logistic regression instead of a decision tree ensemble. The data scientist needs to maintain a validation accuracy above 85%, but the current model is too complex. The company has limited budget for data labeling. Which option is BEST?

A.Use a simpler model like logistic regression

B.Add more training data by generating synthetic samples using SMOTE

C.Reduce the number of features using PCA

D.Increase the number of training epochs

AnswerA

Simpler model reduces capacity and overfitting.

Why this answer

Option A is correct because the current model (likely a decision tree ensemble like Random Forest or XGBoost) is too complex for the dataset, causing overfitting. Switching to a simpler model like logistic regression reduces variance by limiting the hypothesis space, which directly addresses overfitting without requiring additional data or feature engineering. Given the limited labeling budget, this approach is cost-effective and can improve generalization, potentially achieving the required >85% validation accuracy.

Exam trap

The trap here is that candidates often assume more data (SMOTE) or dimensionality reduction (PCA) will always reduce overfitting, but in this scenario the core issue is model complexity, not data quantity or feature noise.

How to eliminate wrong answers

Option B is wrong because SMOTE generates synthetic samples by interpolating between existing minority class instances, which does not add new independent information; it can exacerbate overfitting by creating artificial patterns that the model already memorizes. Option C is wrong because PCA reduces dimensionality by projecting features onto principal components, but it is unsupervised and may discard features that are discriminative for the binary classification task, potentially harming validation accuracy. Option D is wrong because increasing the number of training epochs allows the model to further minimize training loss, which worsens overfitting by making the model memorize noise rather than generalize.

Full explanation →

522

Multi-Selecthard

A company is deploying a real-time inference endpoint with SageMaker. The model is a large neural network that requires GPU acceleration. Which TWO configurations must be set?

Select 2 answers

A.Instance type with GPU

B.Create a SageMaker model with the inference code and model artifacts

C.Batch transform job

D.Production variant

E.Training container image

AnswersA, B

Required for GPU inference.

Why this answer

Option A is correct because deploying a real-time inference endpoint with a large neural network that requires GPU acceleration necessitates selecting an instance type with a GPU, such as the ml.p3 or ml.g4dn series, to provide the parallel processing power needed for low-latency inference. Without a GPU instance, the model would fall back to CPU, leading to unacceptable inference times for large neural networks.

Exam trap

The trap here is that candidates often confuse the required configurations for deploying a real-time endpoint with those for training or batch processing, mistakenly selecting Batch Transform or Training Container Image instead of recognizing that the instance type with GPU and the SageMaker model definition are the two essential components.

Full explanation →

523

MCQhard

A media company uses Amazon SageMaker to train a deep learning model for video classification. The training job uses a single ml.p3.2xlarge instance and processes 50 GB of labeled video data stored in Amazon S3. The training completes successfully in 12 hours. However, the data scientists report that the model’s accuracy is lower than expected. They suspect the training data contains labeling errors. To improve model accuracy without incurring significant additional cost, they want to identify and remove mislabeled training examples before retraining. They have a small budget of $50 and need to complete the analysis within 2 hours. Which approach should the data scientists take?

A.Use SageMaker Ground Truth to create a new labeling job for the entire dataset, then compare the new labels with the original labels to identify discrepancies.

B.Use SageMaker Clarify to generate a bias report for the training data and remove instances that contribute to bias.

C.Train a small, fast model on a random sample of the data (e.g., 1 GB) using a cheaper instance like ml.m5.xlarge, then use the model's prediction confidence to flag low-confidence examples as potential mislabels for manual review.

D.Manually review all 50 GB of video data to correct labels.

AnswerC

This approach is cost-effective (within $50) and fast (under 2 hours). The small model can identify likely mislabeled examples by low confidence, allowing targeted manual review.

Why this answer

Option C is correct because training a small, fast model on a 1 GB random sample using a cheaper instance (ml.m5.xlarge) allows the team to quickly identify low-confidence predictions, which are strong indicators of mislabeled examples. This approach fits within the $50 budget and 2-hour time constraint, as it avoids processing the full 50 GB dataset and leverages a lightweight model for rapid iteration. By flagging only suspicious samples for manual review, the team can efficiently clean the training data without incurring the cost of re-labeling the entire dataset.

Exam trap

The trap here is that candidates may choose SageMaker Ground Truth (Option A) assuming it is the standard tool for label correction, but they overlook the strict budget and time constraints that make it infeasible for the full dataset.

How to eliminate wrong answers

Option A is wrong because using SageMaker Ground Truth to create a new labeling job for the entire 50 GB dataset would exceed the $50 budget and 2-hour time limit, as labeling large video datasets is expensive and time-consuming. Option B is wrong because SageMaker Clarify is designed for detecting bias in data and models, not for identifying individual mislabeled examples; it generates bias reports but cannot pinpoint which specific labels are erroneous. Option D is wrong because manually reviewing all 50 GB of video data is impractical within the 2-hour window and would far exceed the $50 budget, requiring significant human effort and cost.

Full explanation →

524

MCQhard

A data scientist is using Amazon SageMaker to train a model with a large dataset that does not fit into memory on a single instance. The training algorithm supports distributed training. Which approach should the scientist use to train the model efficiently?

A.Use SageMaker File mode and increase the instance volume size

B.Use Amazon EMR to preprocess data and then train on a smaller sample

C.Split the data into smaller files and use multiple training jobs sequentially

D.Use SageMaker Pipe mode to stream data directly from S3

AnswerD

Pipe mode allows the algorithm to read data on the fly, handling large datasets.

Why this answer

SageMaker Pipe mode streams data from S3 directly to the training algorithm without writing to disk, enabling processing of large datasets beyond memory.

Full explanation →

525

Multi-Selecthard

A company is deploying a machine learning model to a SageMaker endpoint and wants to ensure that the endpoint is resilient to instance failures. Which THREE steps should the company take to achieve high availability? (Choose THREE.)

Select 3 answers

A.Deploy the endpoint in a VPC with subnets in at least two Availability Zones.

B.Use a single instance type with the largest size to handle capacity.

C.Configure the endpoint with an initial instance count of at least 2.

D.Use a single Availability Zone for simplicity.

E.Enable auto-scaling to automatically replace unhealthy instances.

AnswersA, C, E

Provides AZ redundancy.

Why this answer

Option A is correct because deploying across multiple Availability Zones provides zone redundancy. Option C is correct because using multiple instances in the endpoint configuration ensures that if one instance fails, others handle traffic. Option D is correct because enabling auto-scaling can replace failed instances.

Option B is wrong because a single instance in one AZ is not resilient. Option E is wrong because a single AZ does not protect against AZ failure.

Full explanation →

Page 7 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →